Обсуждение: WAL logging problem in 9.4.3?

Поиск

Список

Период

Сортировка

WAL logging problem in 9.4.3?

От

Martijn van Oosterhout

Дата:

03 июля 2015 г., 01:04:47

Hoi,

I ran into this in our CI setup and I thinks it's an actual bug. The
issue appears to be that when a table is created *and truncated* in a
single transaction, that the WAL log is logging a truncate record it
shouldn't, such that if the database crashes you have a broken index.
It would also lose any data that was in the table at commit time.

I didn't test 9.4.4 yet, though I don't see anything in the release
notes that resemble this.

Detail:

=== Start with an empty database

martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.

ctmp=# begin;
BEGIN
ctmp=# create table test(id serial primary key);
CREATE TABLE
ctmp=# truncate table test;
TRUNCATE TABLE
ctmp=# commit;
COMMIT
ctmp=# select relname, relfilenode from pg_class where relname like 'test%';  relname   | relfilenode
-------------+-------------test        |       16389test_id_seq |       16387test_pkey   |       16393
(3 rows)

=== Note: if you do a CHECKPOINT here the issue doesn't happen
=== obviously.

ctmp=# \q
martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
[sudo] password for martijn:
-rw------- 1 messagebus ssl-cert 8192 Jul  2 23:34 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert    0 Jul  2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert 8192 Jul  2 23:34 /data/postgres/base/16385/16393

=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply
=== destroying the container it was running in.

=== Dump the xlogs just to show what got recorded. Note there's a
=== truncate for the data file and the index file.

martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/
000000010000000000000001|grep -wE '16389|16387|16393' 
rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc:
checkpoint:redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB
1;oldest multi 1 in DB 1; oldest running xid 0; shutdown 
rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file
create:base/16385/16387 
rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log:
rel1663/16385/16387 
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file
create:base/16385/16389 
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file
create:base/16385/16393 
rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log:
rel1663/16385/16387 
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file
truncate:base/16385/16389 to 0 blocks 
rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file
truncate:base/16385/16393 to 0 blocks 
pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740

=== Start the DB up again

database_1 | LOG:  database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
database_1 | LOG:  database system was not properly shut down; automatic recovery in progress
database_1 | LOG:  redo starts at 0/16A92A8
database_1 | LOG:  record with zero length at 0/16BE740
database_1 | LOG:  redo done at 0/16BE710
database_1 | LOG:  last completed transaction was at log time 2015-07-02 21:34:45.664989+00
database_1 | LOG:  database system is ready to accept connections
database_1 | LOG:  autovacuum launcher started

=== Oops, the index file is empty now

martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
-rw------- 1 messagebus ssl-cert 8192 Jul  2 23:37 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert    0 Jul  2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert    0 Jul  2 23:37 /data/postgres/base/16385/16393

martijn@martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.

=== And now the index is broken. I think the only reason it doesn't
=== complain about the data file is because zero bytes there is OK.  But if
=== the table had data before it would be gone now.

ctmp=# select * from test;
ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes

ctmp=# select version();                                           version
-----------------------------------------------------------------------------------------------PostgreSQL 9.4.3 on
x86_64-unknown-linux-gnu,compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit 
(1 row)

Hope this helps.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: WAL logging problem in 9.4.3?

От

Andres Freund

Дата:

03 июля 2015 г., 01:21:13

Hi,

On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
> === Start with an empty database

My guess is you have wal_level = minimal?

> ctmp=# begin;
> BEGIN
> ctmp=# create table test(id serial primary key);
> CREATE TABLE
> ctmp=# truncate table test;
> TRUNCATE TABLE
> ctmp=# commit;
> COMMIT
> ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
>    relname   | relfilenode
> -------------+-------------
>  test        |       16389
>  test_id_seq |       16387
>  test_pkey   |       16393
> (3 rows)
> 

> === Note the index file is 8KB.
> === At this point nuke the database server (in this case it was simply 
> === destroying the container it was running in.

How did you continue from there? The container has persistent storage?
Or are you repapplying the WAL to somewhere else?

> === Dump the xlogs just to show what got recorded. Note there's a
> === truncate for the data file and the index file.

That should be ok.

> martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/
000000010000000000000001|grep -wE '16389|16387|16393'
 
> rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc:
checkpoint:redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB
1;oldest multi 1 in DB 1; oldest running xid 0; shutdown
 
> rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc:
filecreate: base/16385/16387
 
> rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc:
log:rel 1663/16385/16387
 
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc:
filecreate: base/16385/16389
 
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc:
filecreate: base/16385/16393
 
> rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc:
log:rel 1663/16385/16387
 
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc:
filetruncate: base/16385/16389 to 0 blocks
 
> rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc:
filetruncate: base/16385/16393 to 0 blocks
 
> pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740

Note that the truncate will lead to a new, different, relfilenode.

> === Start the DB up again
> 
> database_1 | LOG:  database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
> database_1 | LOG:  database system was not properly shut down; automatic recovery in progress
> database_1 | LOG:  redo starts at 0/16A92A8
> database_1 | LOG:  record with zero length at 0/16BE740
> database_1 | LOG:  redo done at 0/16BE710
> database_1 | LOG:  last completed transaction was at log time 2015-07-02 21:34:45.664989+00
> database_1 | LOG:  database system is ready to accept connections
> database_1 | LOG:  autovacuum launcher started
> 
> === Oops, the index file is empty now

That's probably just the old index file?

> martijn@martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
> -rw------- 1 messagebus ssl-cert 8192 Jul  2 23:37 /data/postgres/base/16385/16387
> -rw------- 1 messagebus ssl-cert    0 Jul  2 23:34 /data/postgres/base/16385/16389
> -rw------- 1 messagebus ssl-cert    0 Jul  2 23:37 /data/postgres/base/16385/16393
> 
> martijn@martijn-jessie:$ psql ctmp -h localhost -U username
> Password for user username:
> psql (9.4.3)
> Type "help" for help.
> 
> === And now the index is broken. I think the only reason it doesn't
> === complain about the data file is because zero bytes there is OK.  But if
> === the table had data before it would be gone now.
> 
> ctmp=# select * from test;
> ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes

Hm. I can't reproduce this. Can you include a bit more details about how
to reproduce?

Greetings,

Andres Freund

Re: WAL logging problem in 9.4.3?

От

Martijn van Oosterhout

Дата:

03 июля 2015 г., 08:20:19

On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
> Hi,
>
> On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
> > === Start with an empty database
>
> My guess is you have wal_level = minimal?

Default config, was just initdb'd. So yes, the default wal_level =
minimal.

> > === Note the index file is 8KB.
> > === At this point nuke the database server (in this case it was simply
> > === destroying the container it was running in.
>
> How did you continue from there? The container has persistent storage?
> Or are you repapplying the WAL to somewhere else?

The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec" so postgres never gets the signal to shutdown.

> > martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/
000000010000000000000001|grep -wE '16389|16387|16393' 
> > rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc:
checkpoint:redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB
1;oldest multi 1 in DB 1; oldest running xid 0; shutdown 
> > rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc:
filecreate: base/16385/16387 
> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc:
log:rel 1663/16385/16387 
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc:
filecreate: base/16385/16389 
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc:
filecreate: base/16385/16393 
> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc:
log:rel 1663/16385/16387 
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc:
filetruncate: base/16385/16389 to 0 blocks 
> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc:
filetruncate: base/16385/16393 to 0 blocks 
> > pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740
>
> Note that the truncate will lead to a new, different, relfilenode.

Really? Comparing the relfilenodes gives the same values before and
after the truncate.
>
> > ctmp=# select * from test;
> > ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
>
> Hm. I can't reproduce this. Can you include a bit more details about how
> to reproduce?

Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: WAL logging problem in 9.4.3?

От

Fujii Masao

Дата:

03 июля 2015 г., 08:34:48

On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
>> Hi,
>>
>> On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
>> > === Start with an empty database
>>
>> My guess is you have wal_level = minimal?
>
> Default config, was just initdb'd. So yes, the default wal_level =
> minimal.
>
>> > === Note the index file is 8KB.
>> > === At this point nuke the database server (in this case it was simply
>> > === destroying the container it was running in.
>>
>> How did you continue from there? The container has persistent storage?
>> Or are you repapplying the WAL to somewhere else?
>
> The container has persistant storage on the host. What I think is
> actually unusual is that the script that started postgres was missing
> an 'exec" so postgres never gets the signal to shutdown.
>
>> > martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/
000000010000000000000001|grep -wE '16389|16387|16393' 
>> > rmgr: XLOG        len (rec/tot):     72/   104, tx:          0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc:
checkpoint:redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB
1;oldest multi 1 in DB 1; oldest running xid 0; shutdown 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:          0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc:
filecreate: base/16385/16387 
>> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc:
log:rel 1663/16385/16387 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc:
filecreate: base/16385/16389 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc:
filecreate: base/16385/16393 
>> > rmgr: Sequence    len (rec/tot):    158/   190, tx:        686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc:
log:rel 1663/16385/16387 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc:
filetruncate: base/16385/16389 to 0 blocks 
>> > rmgr: Storage     len (rec/tot):     16/    48, tx:        686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc:
filetruncate: base/16385/16393 to 0 blocks 
>> > pg_xlogdump: FATAL:  error in WAL record at 0/16BE710: record with zero length at 0/16BE740
>>
>> Note that the truncate will lead to a new, different, relfilenode.
>
> Really? Comparing the relfilenodes gives the same values before and
> after the truncate.

Yep, the relfilenodes are not changed in this case because CREATE TABLE and
TRUNCATE were executed in the same transaction block.

>> > ctmp=# select * from test;
>> > ERROR:  could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
>>
>> Hm. I can't reproduce this. Can you include a bit more details about how
>> to reproduce?
>
> Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
> can probably construct a Dockerfile that reproduces it pretty reliably.

I could reproduce the problem in the master branch by doing
the following steps.

1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements    begin;    create table test(id serial primary key);    truncate table test;
 commit; 
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement   select * from test;

The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..

Regards,

--
Fujii Masao

Re: WAL logging problem in 9.4.3?

От

Martijn van Oosterhout

Дата:

03 июля 2015 г., 09:01:02

On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
> > Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
> > can probably construct a Dockerfile that reproduces it pretty reliably.
>
> I could reproduce the problem in the master branch by doing
> the following steps.

Thank you, I wasn't sure if you could kill the server fast enough
without containers, but it looks like immediate mode is enough.

> 1. start the PostgreSQL server with wal_level = minimal
> 2. execute the following SQL statements
>      begin;
>      create table test(id serial primary key);
>      truncate table test;
>      commit;
> 3. shutdown the server with immediate mode
> 4. restart the server (crash recovery occurs)
> 5. execute the following SQL statement
>     select * from test;
>
> The optimization of TRUNCATE opereation that we can use when
> CREATE TABLE and TRUNCATE are executed in the same transaction block
> seems to cause the problem. In this case, only index file truncation is
> logged, and index creation in btbuild() is not logged because wal_level
> is minimal. Then at the subsequent crash recovery, index file is truncated
> to 0 byte... Very simple fix is to log an index creation in that case,
> but not sure if that's ok to do..

Looks plausible to me.

For reference I attach a small tarball for reproduction with docker.

1. Unpack tarball into empty dir (it has three small files)
2. docker build -t test .
3. docker run -v /tmp/pgtest:/data test
4. docker run -v /tmp/pgtest:/data test

Data dir is in /tmp/pgtest

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

On 10 July 2015 at 00:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:
> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>> different performance problem, perhaps we could have COPY create a new
>> relfilenode when it does this. That should be safe if the table was
>> previously empty.

> I'm not convinced that cab9a0656c36739f needs to survive in that
> form. To me only allowing one COPY to benefit from the wal_level =
> minimal optimization has a significantly higher cost than
> cab9a0656c36739f.

What evidence have you got to base that value judgement on?

cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction. On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.

We have to backpatch this fix, so it must be both simple and effective.

Heikki's suggestions may be best, maybe not, but they don't seem backpatchable.

Tom's suggestion looks good. So does Andres' suggestion. I have coded both.

And what reason is there to think that this would fix all the problems?

I don't think either suggested fix could be claimed to be a great solution, since there is little principle here, only heuristic. Heikki's solution would be the only safe way, but is not backpatchable.

Forcing SKIP_FSM to always extend has no negative side effects in other code paths, AFAICS.

Patches attached. Martijn, please verify.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: WAL logging problem in 9.4.3?

От

Heikki Linnakangas

Дата:

22 июля 2015 г., 19:21:34

On 07/22/2015 11:18 AM, Simon Riggs wrote:
> On 10 July 2015 at 00:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>>>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>>>> different performance problem, perhaps we could have COPY create a new
>>>> relfilenode when it does this.  That should be safe if the table was
>>>> previously empty.
>>
>>> I'm not convinced that cab9a0656c36739f needs to survive in that
>>> form. To me only allowing one COPY to benefit from the wal_level =
>>> minimal optimization has a significantly higher cost than
>>> cab9a0656c36739f.
>>
>> What evidence have you got to base that value judgement on?
>>
>> cab9a0656c36739f was based on an actual user complaint, so we have good
>> evidence that there are people out there who care about the cost of
>> truncating a table many times in one transaction.  On the other hand,
>> I know of no evidence that anyone's depending on multiple sequential
>> COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
>> for having this COPY optimization at all was to make restoring pg_dump
>> scripts in a single transaction fast; and that use-case doesn't care
>> about anything but a single COPY into a virgin table.
>>
>
> We have to backpatch this fix, so it must be both simple and effective.
>
> Heikki's suggestions may be best, maybe not, but they don't seem
> backpatchable.
>
> Tom's suggestion looks good. So does Andres' suggestion. I have coded both.

Thanks. For comparison, I wrote a patch to implement what I had in mind.

When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap
that would normally be WAL-logged, we check if the relation is in the
hash table, and skip WAL-logging if so.

That was a simplified explanation. In reality, when WAL-skipping COPY
begins, we also memorize the current size of the relation. Any actions
on blocks greater than the old size are not WAL-logged, and any actions
on smaller-numbered blocks are. This ensures that if you did any INSERTs
on the table before the COPY, any new actions on the blocks that were
already WAL-logged by the INSERT are also WAL-logged. And likewise if
you perform any INSERTs after (or during, by trigger) the COPY, and they
modify the new pages, those actions are not WAL-logged. So starting a
WAL-skipping COPY splits the relation into two parts: the first part
that is WAL-logged as usual, and the later part that is not WAL-logged.
(there is one loose end marked with XXX in the patch on this, when one
of the pages involved in a cold UPDATE is before the watermark and the
other is after)

The actual fsync() has been moved to the end of transaction, as we are
now skipping WAL-logging of any actions after the COPY as well.

And truncations complicate things further. If we emit a truncation WAL
record in the transaction, we also make an entry in the hash table to
record that. All operations on a relation that has been truncated must
be WAL-logged as usual, because replaying the truncate record will
destroy all data even if we fsync later. But we still optimize for
"BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we
truncate a relation that has already been marked for fsync-at-COMMIT, we
don't need to WAL-log the truncation either.

This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.

>> And what reason is there to think that this would fix all the problems?
>
> I don't think either suggested fix could be claimed to be a great solution,
> since there is little principle here, only heuristic. Heikki's solution
> would be the only safe way, but is not backpatchable.

I can't get too excited about a half-fix that leaves you with data
corruption in some scenarios.

I wrote a little test script to test all these different scenarios
(attached). Both of your patches fail with the script.

- Heikki

Вложения

Re: WAL logging problem in 9.4.3?

От

Simon Riggs

Дата:

23 июля 2015 г., 11:52:43

On 22 July 2015 at 17:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

When a WAL-skipping COPY begins, we add an entry for that relation in a "pending-fsyncs" hash table. Whenever we perform any action on a heap that would normally be WAL-logged, we check if the relation is in the hash table, and skip WAL-logging if so.

That was a simplified explanation. In reality, when WAL-skipping COPY begins, we also memorize the current size of the relation. Any actions on blocks greater than the old size are not WAL-logged, and any actions on smaller-numbered blocks are. This ensures that if you did any INSERTs on the table before the COPY, any new actions on the blocks that were already WAL-logged by the INSERT are also WAL-logged. And likewise if you perform any INSERTs after (or during, by trigger) the COPY, and they modify the new pages, those actions are not WAL-logged. So starting a WAL-skipping COPY splits the relation into two parts: the first part that is WAL-logged as usual, and the later part that is not WAL-logged. (there is one loose end marked with XXX in the patch on this, when one of the pages involved in a cold UPDATE is before the watermark and the other is after)

The actual fsync() has been moved to the end of transaction, as we are now skipping WAL-logging of any actions after the COPY as well.

And truncations complicate things further. If we emit a truncation WAL record in the transaction, we also make an entry in the hash table to record that. All operations on a relation that has been truncated must be WAL-logged as usual, because replaying the truncate record will destroy all data even if we fsync later. But we still optimize for "BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we truncate a relation that has already been marked for fsync-at-COMMIT, we don't need to WAL-log the truncation either.

This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have.

I didn't like it when I first read this, but I do now. As a by product of fixing the bug it actually extends the optimization.

You can optimize this approach so we always write WAL unless one of the two subid fields are set, so there is no need to call smgrIsSyncPending() every time. I couldn't see where this depended upon wal_level, but I guess its there somewhere.

I'm unhappy about the call during MarkBufferDirtyHint() which is just too costly. The only way to do this cheaply is to specifically mark buffers as being BM_WAL_SKIPPED, so they do not need to be hinted. That flag would be removed when we flush the buffers for the relation.

And what reason is there to think that this would fix all the problems?

I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.

I can't get too excited about a half-fix that leaves you with data corruption in some scenarios.

On further consideration, it seems obvious that Andres' suggestion would not work for UPDATE or DELETE, so I now agree.

It does seem a big thing to backpatch; alternative suggestions?

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

23 июля 2015 г., 21:38:20

On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> This is more invasive than I'd like to backpatch, but I think it's the
> simplest approach that works, and doesn't disable any of the important
> optimizations we have.

Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?
Should we be worried about that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WAL logging problem in 9.4.3?

От

Heikki Linnakangas

Дата:

24 июля 2015 г., 09:28:09

On 07/23/2015 09:38 PM, Robert Haas wrote:
> On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> This is more invasive than I'd like to backpatch, but I think it's the
>> simplest approach that works, and doesn't disable any of the important
>> optimizations we have.
>
> Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?

Yes. But it's still very cheap, especially in the common case that the 
pending syncs hash table is empty.

> Should we be worried about that?

It doesn't worry me.

- Heikki

Re: WAL logging problem in 9.4.3?

От

Alvaro Herrera

Дата:

21 октября 2015 г., 19:36:16

Heikki Linnakangas wrote:

> Thanks. For comparison, I wrote a patch to implement what I had in mind.
> 
> When a WAL-skipping COPY begins, we add an entry for that relation in a
> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
> would normally be WAL-logged, we check if the relation is in the hash table,
> and skip WAL-logging if so.

I think this wasn't applied, was it?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

22 октября 2015 г., 03:56:40

On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Heikki Linnakangas wrote:
>
>> Thanks. For comparison, I wrote a patch to implement what I had in mind.
>>
>> When a WAL-skipping COPY begins, we add an entry for that relation in a
>> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
>> would normally be WAL-logged, we check if the relation is in the hash table,
>> and skip WAL-logging if so.
>
> I think this wasn't applied, was it?

No, it was not applied.
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Heikki Linnakangas

Дата:

04 февраля 2016 г., 15:25:40

On 22/10/15 03:56, Michael Paquier wrote:
> On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Heikki Linnakangas wrote:
>>
>>> Thanks. For comparison, I wrote a patch to implement what I had in mind.
>>>
>>> When a WAL-skipping COPY begins, we add an entry for that relation in a
>>> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
>>> would normally be WAL-logged, we check if the relation is in the hash table,
>>> and skip WAL-logging if so.
>>
>> I think this wasn't applied, was it?
>
> No, it was not applied.

I dropped the ball on this one back in July, so here's an attempt to
revive this thread.

I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.

Some review of that would be nice. If there are no major issues with it,
I'm going to create backpatchable versions of this for 9.4 and below.

- Heikki

Вложения

0001-Fix-the-optimization-to-skip-WAL-logging-on-table-cr.patch

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

18 февраля 2016 г., 10:27:18

On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I dropped the ball on this one back in July, so here's an attempt to revive
> this thread.
>
> I spent some time fixing the remaining issues with the prototype patch I
> posted earlier, and rebased that on top of current git master. See attached.
>
> Some review of that would be nice. If there are no major issues with it, I'm
> going to create backpatchable versions of this for 9.4 and below.

I am going to look into that very soon. For now and to not forget
about this bug, I have added an entry in the CF app:
https://commitfest.postgresql.org/9/528/
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

19 февраля 2016 г., 10:33:32

On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I dropped the ball on this one back in July, so here's an attempt to revive
>> this thread.
>>
>> I spent some time fixing the remaining issues with the prototype patch I
>> posted earlier, and rebased that on top of current git master. See attached.
>>
>> Some review of that would be nice. If there are no major issues with it, I'm
>> going to create backpatchable versions of this for 9.4 and below.
>
> I am going to look into that very soon. For now and to not forget
> about this bug, I have added an entry in the CF app:
> https://commitfest.postgresql.org/9/528/

Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as its
CREATE TABLE, take that for example when wal_level = minimal:
1) Run transaction
=# begin;
BEGIN
=# create table ab (a int primary key);
CREATE TABLE
=# truncate ab;
TRUNCATE TABLE
=# commit;
COMMIT
2) Restart server with immediate mode.
3) Failure:
=# table ab;
ERROR:  XX001: could not read block 0 in file "base/16384/16388": read
only 0 of 8192 bytes
LOCATION:  mdread, md.c:728

The case where a COPY is issued after TRUNCATE is fixed though, so
that's still an improvement.

Here are other comments.

+   /* Flush updates to relations that we didn't WAL-logged */
+   smgrDoPendingSyncs(true);
"Flush updates to relations there were not WAL-logged"?

+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+   FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
islocal is always set as false, I'd rather remove it this argument
from FlushRelationBuffersWithoutRelCache.
       for (i = 0; i < nrels; i++)
+       {           smgrclose(srels[i]);
+       }
Looks like noise.

+   if (!found)
+   {
+       pending->truncated_to = InvalidBlockNumber;
+       pending->sync_above = nblocks;
+
+       elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
block %u",
+            rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+
+   }
+   else if (pending->sync_above == InvalidBlockNumber)
+   {
+       elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+            rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+       pending->sync_above = nblocks;
+   }
+   else
Here couldn't it be possible that when (sync_above !=
InvalidBlockNumber), nblocks can be higher than sync_above? In which
case we had better increase sync_above to nblocks, no?

+       if (!pendingSyncs)
+           createPendingSyncsHash();
+       pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                (void *) &rel->rd_node,
+                                                HASH_ENTER, &found);
This is lacking comments.

-       if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+       BufferGetTag(buffer, &rnode, &forknum, &blknum);
+       if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+           !smgrIsSyncPending(rnode, blknum))
Here as well explaining in more details why the buffer does not need
to go through XLogSaveBufferForHint would be nice.
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

19 февраля 2016 г., 16:27:09

On Fri, Feb 19, 2016 at 4:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I dropped the ball on this one back in July, so here's an attempt to revive
>>> this thread.
>>>
>>> I spent some time fixing the remaining issues with the prototype patch I
>>> posted earlier, and rebased that on top of current git master. See attached.
>>>
>>> Some review of that would be nice. If there are no major issues with it, I'm
>>> going to create backpatchable versions of this for 9.4 and below.
>>
>> I am going to look into that very soon. For now and to not forget
>> about this bug, I have added an entry in the CF app:
>> https://commitfest.postgresql.org/9/528/
>
> Worth noting that this patch does not address the problem with index
> relations when a TRUNCATE is used in the same transaction as its
> CREATE TABLE, take that for example when wal_level = minimal:
> 1) Run transaction
> =# begin;
> BEGIN
> =# create table ab (a int primary key);
> CREATE TABLE
> =# truncate ab;
> TRUNCATE TABLE
> =# commit;
> COMMIT
> 2) Restart server with immediate mode.
> 3) Failure:
> =# table ab;
> ERROR:  XX001: could not read block 0 in file "base/16384/16388": read
> only 0 of 8192 bytes
> LOCATION:  mdread, md.c:728
>
> The case where a COPY is issued after TRUNCATE is fixed though, so
> that's still an improvement.
>
> Here are other comments.
>
> +   /* Flush updates to relations that we didn't WAL-logged */
> +   smgrDoPendingSyncs(true);
> "Flush updates to relations there were not WAL-logged"?
>
> +void
> +FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
> +{
> +   FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
> +}
> islocal is always set as false, I'd rather remove it this argument
> from FlushRelationBuffersWithoutRelCache.
>
>         for (i = 0; i < nrels; i++)
> +       {
>             smgrclose(srels[i]);
> +       }
> Looks like noise.
>
> +   if (!found)
> +   {
> +       pending->truncated_to = InvalidBlockNumber;
> +       pending->sync_above = nblocks;
> +
> +       elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
> block %u",
> +            rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
> +
> +   }
> +   else if (pending->sync_above == InvalidBlockNumber)
> +   {
> +       elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
> +            rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
> +       pending->sync_above = nblocks;
> +   }
> +   else
> Here couldn't it be possible that when (sync_above !=
> InvalidBlockNumber), nblocks can be higher than sync_above? In which
> case we had better increase sync_above to nblocks, no?
>
> +       if (!pendingSyncs)
> +           createPendingSyncsHash();
> +       pending = (PendingRelSync *) hash_search(pendingSyncs,
> +                                                (void *) &rel->rd_node,
> +                                                HASH_ENTER, &found);
> This is lacking comments.
>
> -       if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
> +       BufferGetTag(buffer, &rnode, &forknum, &blknum);
> +       if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
> +           !smgrIsSyncPending(rnode, blknum))
> Here as well explaining in more details why the buffer does not need
> to go through XLogSaveBufferForHint would be nice.

An additional one:
-   XLogRegisterBuffer(0, newbuf, bufflags);
-   if (oldbuf != newbuf)
-       XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
In log_heap_update, the new buffer is now conditionally logged
depending on if the heap needs WAL or not.

Now during replay the following thing is done:
-   oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                     &obuffer);
+   if (oldblk == newblk)
+       oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+   else if (XLogRecHasBlockRef(record, 1))
+       oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+   else
+       oldaction = BLK_DONE;
Shouldn't we check for XLogRecHasBlockRef(record, 0) when the tuple is
updated on the same page?
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 марта 2016 г., 11:34:08

Hello, I considered on the original issue.

At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg@mail.gmail.com>
> > Worth noting that this patch does not address the problem with index
> > relations when a TRUNCATE is used in the same transaction as its

Focusing this issue, what we should do is somehow building empty
index just after a index truncation. The attached patch does the
following things to fix this.

- make index_build use ambuildempty when the relation on which the index will be built is apparently empty. That is,
whenthe relation has no block.
 

- add one parameter "persistent" to ambuildempty(). It behaves as before if the parameter is false. If not, it creates
anempty index on MAIN_FORK and emits logs even if wal_level is minimal.
 

Creation of an index for an empty table can be safely done by
ambuildempty, since it creates the image for init fork, which can
be simply copied as main fork on initialization. And the heap is
always empty when RelationTruncateIndexes calls index_build.

For nonempty tables, ambuild properly initializes the new index.

The new parameter 'persistent' would be better be forknum because
it actually represents the persistency of the index to be
created. But I'm out of time now..


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c740952..7f0d3f9 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -675,13 +675,14 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)}void
-brinbuildempty(Relation index)
+brinbuildempty(Relation index, bool persistent){    Buffer        metabuf;
+    ForkNumber    forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);    /* An empty BRIN index has a metapage only.
*/   metabuf =
 
-        ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+        ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);    LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
/*Initialize and xlog metabuffer. */
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..c041360 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -430,20 +430,23 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)}/*
- *    ginbuildempty() -- build an empty gin index in the initialization fork
+ *    ginbuildempty() -- build an empty gin index
+ *      the new index is built in the intialization fork or main fork according
+ *      to the parameter persistent. */void
-ginbuildempty(Relation index)
+ginbuildempty(Relation index, bool persistent){    Buffer        RootBuffer,                MetaBuffer;
+    ForkNumber    forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);    /* An empty GIN index has two pages. */
MetaBuffer=
 
-        ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+        ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);    LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
  RootBuffer =
 
-        ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+        ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);    LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
  /* Initialize and xlog metabuffer and root buffer. */
 
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 996363c..3d73083 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -110,15 +110,18 @@ createTempGistContext(void)}/*
- *    gistbuildempty() -- build an empty gist index in the initialization fork
+ *    gistbuildempty() -- build an empty gist index. 
+ *      the new index is built in the intialization fork or main fork according
+ *      to the parameter persistent. */void
-gistbuildempty(Relation index)
+gistbuildempty(Relation index, bool persistent){    Buffer        buffer;
+    ForkNumber    forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);    /* Initialize the root page */
-    buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+    buffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);    LockBuffer(buffer,
BUFFER_LOCK_EXCLUSIVE);   /* Initialize and xlog buffer */
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3d48c4f..3b9cd66 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -156,12 +156,14 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)}/*
- *    hashbuildempty() -- build an empty hash index in the initialization fork
+ *    hashbuildempty() -- build an empty hash index
+ *      the new index is built in the intialization fork or main fork according
+ *      to the parameter persistent. */void
-hashbuildempty(Relation index)
+hashbuildempty(Relation index, bool persistent){
-    _hash_metapinit(index, 0, INIT_FORKNUM);
+    _hash_metapinit(index, 0, persistent ? MAIN_FORKNUM : INIT_FORKNUM);}/*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..c20377d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -230,12 +230,15 @@ btbuildCallback(Relation index,}/*
- *    btbuildempty() -- build an empty btree index in the initialization fork
+ *    btbuildempty() -- build an empty btree index
+ *      the new index is built in the intialization fork or main fork according
+ *      to the parameter persistent. */void
-btbuildempty(Relation index)
+btbuildempty(Relation index, bool persistent){    Page        metapage;
+    ForkNumber    forknum = persistent ? MAIN_FORKNUM : INIT_FORKNUM;    /* Construct metapage. */    metapage =
(Page)palloc(BLCKSZ);
 
@@ -243,10 +246,9 @@ btbuildempty(Relation index)    /* Write the page.  If archiving/streaming, XLOG it. */
PageSetChecksumInplace(metapage,BTREE_METAPAGE);
 
-    smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-              (char *) metapage, true);
-    if (XLogIsNeeded())
-        log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+    smgrwrite(index->rd_smgr, forknum, BTREE_METAPAGE, (char *) metapage, true);
+    if (XLogIsNeeded() || persistent)
+        log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,                    BTREE_METAPAGE, metapage, false);
/*
@@ -254,7 +256,7 @@ btbuildempty(Relation index)     * write did not go through shared_buffers and therefore a
concurrent    * checkpoint may have moved the redo pointer past our xlog record.     */
 
-    smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+    smgrimmedsync(index->rd_smgr, forknum);}/*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 44fd644..3d5964b 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -152,12 +152,15 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)}/*
- * Build an empty SPGiST index in the initialization fork
+ * Build an empty SPGiST index
+ *      the new index is built in the intialization fork or main fork according
+ *      to the parameter persistent. */void
-spgbuildempty(Relation index)
+spgbuildempty(Relation index, bool persistent){    Page        page;
+    ForkNumber    forknum = (persistent ? MAIN_FORKNUM : INIT_FORKNUM);    /* Construct metapage. */    page = (Page)
palloc(BLCKSZ);
@@ -165,30 +168,30 @@ spgbuildempty(Relation index)    /* Write the page.  If archiving/streaming, XLOG it. */
PageSetChecksumInplace(page,SPGIST_METAPAGE_BLKNO);
 
-    smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
+    smgrwrite(index->rd_smgr, forknum, SPGIST_METAPAGE_BLKNO,              (char *) page, true);
-    if (XLogIsNeeded())
-        log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+    if (XLogIsNeeded() || persistent)
+        log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,                    SPGIST_METAPAGE_BLKNO, page, false);
  /* Likewise for the root page. */    SpGistInitPage(page, SPGIST_LEAF);    PageSetChecksumInplace(page,
SPGIST_ROOT_BLKNO);
-    smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
+    smgrwrite(index->rd_smgr, forknum, SPGIST_ROOT_BLKNO,              (char *) page, true);
-    if (XLogIsNeeded())
-        log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+    if (XLogIsNeeded() || persistent)
+        log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,                    SPGIST_ROOT_BLKNO, page, true);
/*Likewise for the null-tuples root page. */    SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
PageSetChecksumInplace(page,SPGIST_NULL_BLKNO);
 
-    smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
+    smgrwrite(index->rd_smgr, forknum, SPGIST_NULL_BLKNO,              (char *) page, true);
-    if (XLogIsNeeded())
-        log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+    if (XLogIsNeeded() || persistent)
+        log_newpage(&index->rd_smgr->smgr_rnode.node, forknum,                    SPGIST_NULL_BLKNO, page, true);
/*
@@ -196,7 +199,7 @@ spgbuildempty(Relation index)     * writes did not go through shared buffers and therefore a
concurrent    * checkpoint may have moved the redo pointer past our xlog record.     */
 
-    smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+    smgrimmedsync(index->rd_smgr, forknum);}/*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 31a1438..ea8c623 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1987,7 +1987,8 @@ index_build(Relation heapRelation,            bool isprimary,            bool isreindex){
-    IndexBuildResult *stats;
+    static IndexBuildResult defstats = {0, 0};
+    IndexBuildResult *stats = &defstats;    Oid            save_userid;    int            save_sec_context;    int
      save_nestlevel;
 
@@ -2016,12 +2017,19 @@ index_build(Relation heapRelation,    save_nestlevel = NewGUCNestLevel();    /*
-     * Call the access method's build procedure
+     * Call the access method's build procedure. Build an empty index for
+     * empty heaps.     */
-    stats = indexRelation->rd_amroutine->ambuild(heapRelation, indexRelation,
-                                                 indexInfo);
-    Assert(PointerIsValid(stats));
-
+    if (RelationGetNumberOfBlocks(heapRelation) > 0)
+        stats = indexRelation->rd_amroutine->ambuild(heapRelation,
+                                                     indexRelation,
+                                                     indexInfo);
+    else 
+    {
+        RelationOpenSmgr(indexRelation);
+        indexRelation->rd_amroutine->ambuildempty(indexRelation, true);
+    }
+        /*     * If this is an unlogged index, we may need to write out an init fork for     * it -- but we must first
checkwhether one already exists.  If, for
 
@@ -2032,9 +2040,8 @@ index_build(Relation heapRelation,    if (indexRelation->rd_rel->relpersistence ==
RELPERSISTENCE_UNLOGGED&&        !smgrexists(indexRelation->rd_smgr, INIT_FORKNUM))    {
 
-        RelationOpenSmgr(indexRelation);        smgrcreate(indexRelation->rd_smgr, INIT_FORKNUM, false);
-        indexRelation->rd_amroutine->ambuildempty(indexRelation);
+        indexRelation->rd_amroutine->ambuildempty(indexRelation, false);    }    /*
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 35f1061..220494e 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -36,7 +36,7 @@ typedef IndexBuildResult *(*ambuild_function) (Relation heapRelation,
              struct IndexInfo *indexInfo);/* build empty index */
 
-typedef void (*ambuildempty_function) (Relation indexRelation);
+typedef void (*ambuildempty_function) (Relation indexRelation, bool persistent);/* insert this tuple */typedef bool
(*aminsert_function)(Relation indexRelation,
 
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 47317af..f7e600a 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -86,7 +86,7 @@ extern BrinDesc *brin_build_desc(Relation rel);extern void brin_free_desc(BrinDesc *bdesc);extern
IndexBuildResult*brinbuild(Relation heap, Relation index,          struct IndexInfo *indexInfo);
 
-extern void brinbuildempty(Relation index);
+extern void brinbuildempty(Relation index, bool persistent);extern bool brininsert(Relation idxRel, Datum *values,
bool*nulls,           ItemPointer heaptid, Relation heapRel,           IndexUniqueCheck checkUnique);
 
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index d2ea588..91a2622 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -617,7 +617,7 @@ extern Datum gintuple_get_key(GinState *ginstate, IndexTuple tuple,/* gininsert.c */extern
IndexBuildResult*ginbuild(Relation heap, Relation index,         struct IndexInfo *indexInfo);
 
-extern void ginbuildempty(Relation index);
+extern void ginbuildempty(Relation index, bool persistent);extern bool gininsert(Relation index, Datum *values, bool
*isnull,         ItemPointer ht_ctid, Relation heapRel,          IndexUniqueCheck checkUnique);
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index f9732ba..448044e 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -428,7 +428,7 @@ typedef struct GiSTOptions/* gist.c */extern Datum gisthandler(PG_FUNCTION_ARGS);
-extern void gistbuildempty(Relation index);
+extern void gistbuildempty(Relation index, bool persistent);extern bool gistinsert(Relation r, Datum *values, bool
*isnull,          ItemPointer ht_ctid, Relation heapRel,           IndexUniqueCheck checkUnique);
 
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3a68390..ab93e34 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -246,7 +246,7 @@ typedef HashMetaPageData *HashMetaPage;extern Datum hashhandler(PG_FUNCTION_ARGS);extern
IndexBuildResult*hashbuild(Relation heap, Relation index,          struct IndexInfo *indexInfo);
 
-extern void hashbuildempty(Relation index);
+extern void hashbuildempty(Relation index, bool persistent);extern bool hashinsert(Relation rel, Datum *values, bool
*isnull,          ItemPointer ht_ctid, Relation heapRel,           IndexUniqueCheck checkUnique);
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9046b16..64de387 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -656,7 +656,7 @@ typedef BTScanOpaqueData *BTScanOpaque;extern Datum bthandler(PG_FUNCTION_ARGS);extern
IndexBuildResult*btbuild(Relation heap, Relation index,        struct IndexInfo *indexInfo);
 
-extern void btbuildempty(Relation index);
+extern void btbuildempty(Relation index, bool persistent);extern bool btinsert(Relation rel, Datum *values, bool
*isnull,        ItemPointer ht_ctid, Relation heapRel,         IndexUniqueCheck checkUnique);
 
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index 1994f71..3c26cde 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -181,7 +181,7 @@ extern bytea *spgoptions(Datum reloptions, bool validate);/* spginsert.c */extern IndexBuildResult
*spgbuild(Relationheap, Relation index,         struct IndexInfo *indexInfo);
 
-extern void spgbuildempty(Relation index);
+extern void spgbuildempty(Relation index, bool persistent);extern bool spginsert(Relation index, Datum *values, bool
*isnull,         ItemPointer ht_ctid, Relation heapRel,          IndexUniqueCheck checkUnique);

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

15 марта 2016 г., 20:21:38

On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg@mail.gmail.com>
>> > Worth noting that this patch does not address the problem with index
>> > relations when a TRUNCATE is used in the same transaction as its
>
> Focusing this issue, what we should do is somehow building empty
> index just after a index truncation. The attached patch does the
> following things to fix this.
>
> - make index_build use ambuildempty when the relation on which
>   the index will be built is apparently empty. That is, when the
>   relation has no block.
> - add one parameter "persistent" to ambuildempty(). It behaves as
>   before if the parameter is false. If not, it creates an empty
>   index on MAIN_FORK and emits logs even if wal_level is minimal.

Hm. It seems to me that this patch is just a bandaid for the real
problem which is that we should not TRUNCATE the underlying index
relations when the TRUNCATE optimization is running. In short I would
let the empty routines in AM code paths alone, and just continue using
them for the generation of INIT_FORKNUM with unlogged relations. Your
patch is not something backpatchable anyway I think.

> The new parameter 'persistent' would be better be forknum because
> it actually represents the persistency of the index to be
> created. But I'm out of time now..

I actually have some users running with wal_level to minimal, even if
I don't think they use this optimization, we had better fix even index
relations at the same time as table relations.. I'll try to get some
time once the patch review storm goes down a little, except if someone
beats me to it first.
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

16 марта 2016 г., 05:02:44

Thank you for the comment.

I understand that this is not an issue in a hurry so don't bother
to reply.

At Tue, 15 Mar 2016 18:21:34 +0100, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSVm-X1-w9i=U=DCyMxDxzfNT-41pqTSvh0DUmUgi8BQg@mail.gmail.com>
> On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg@mail.gmail.com>
> >> > Worth noting that this patch does not address the problem with index
> >> > relations when a TRUNCATE is used in the same transaction as its
> >
> > Focusing this issue, what we should do is somehow building empty
> > index just after a index truncation. The attached patch does the
> > following things to fix this.
> >
> > - make index_build use ambuildempty when the relation on which
> >   the index will be built is apparently empty. That is, when the
> >   relation has no block.
> > - add one parameter "persistent" to ambuildempty(). It behaves as
> >   before if the parameter is false. If not, it creates an empty
> >   index on MAIN_FORK and emits logs even if wal_level is minimal.
> 
> Hm. It seems to me that this patch is just a bandaid for the real
> problem which is that we should not TRUNCATE the underlying index
> relations when the TRUNCATE optimization is running.

The eventual problem is a 0-length index relation left just after
a relation truncation. We assume that an index with an empty
relation after a recovery is not valid. However just skipping
TRUNCATE of the index relation won't resolve it since it in turn
leaves an index with garbage entries. Am I missing something?

Since the index relation should be "validly emptied" in-place in
any way in the case of TRUNCATE optimization, I tried that by
TRUNCATE + ambuildempty, which can be redo'ed properly, too. A
repeated TRUNCATEs issues eventully-useless logs but it would be
inevitable since we cannot fortell of any succeeding TRUNCATEs.

(TRUNCATE+)COPY+INSERT seems another kind of problem, which would
be fixed by Heikki's patch.

> In short I would
> let the empty routines in AM code paths alone, and just continue using
> them for the generation of INIT_FORKNUM with unlogged relations. Your
> patch is not something backpatchable anyway I think.

It seems to be un-backpatchable if the change of the manner to
call ambuildempty inhibits this.

> > The new parameter 'persistent' would be better be forknum because
> > it actually represents the persistency of the index to be
> > created. But I'm out of time now..
> 
> I actually have some users running with wal_level to minimal, even if
> I don't think they use this optimization, we had better fix even index
> relations at the same time as table relations.. I'll try to get some
> time once the patch review storm goes down a little, except if someone
> beats me to it first.

Ok, I understand that this is not an issue in a hurry. I'll go to
another patch that needs review.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: WAL logging problem in 9.4.3?

От

David Steele

Дата:

22 марта 2016 г., 19:38:47

On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:

> Ok, I understand that this is not an issue in a hurry. I'll go to
> another patch that needs review.

Since we're getting towards the end of the CF is it time to pick this up
again?

Thanks,
-- 
-David
david@pgmasters.net

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

23 марта 2016 г., 03:52:44

On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david@pgmasters.net> wrote:
> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>
>> Ok, I understand that this is not an issue in a hurry. I'll go to
>> another patch that needs review.
>
> Since we're getting towards the end of the CF is it time to pick this up
> again?

Perhaps not. This is a legit bug with an unfinished patch (see index
relation truncation) that is going to need a careful review. I don't
think that this should be impacted by the 4/8 feature freeze, so we
could still work on that after the embargo and we've had this bug for
months actually. FWIW, I am still planning to work on it once the CF
is done, in order to keep my manpower focused on actual patch reviews
as much as possible...
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

23 марта 2016 г., 03:54:44

On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david@pgmasters.net> wrote:
>> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>>
>>> Ok, I understand that this is not an issue in a hurry. I'll go to
>>> another patch that needs review.
>>
>> Since we're getting towards the end of the CF is it time to pick this up
>> again?
>
> Perhaps not. This is a legit bug with an unfinished patch (see index
> relation truncation) that is going to need a careful review. I don't
> think that this should be impacted by the 4/8 feature freeze, so we
> could still work on that after the embargo and we've had this bug for
> months actually. FWIW, I am still planning to work on it once the CF
> is done, in order to keep my manpower focused on actual patch reviews
> as much as possible...

In short, we may want to bump that to next CF... I have already marked
this ticket as something to work on soonish on my side, so it does not
change much seen from here if it's part of the next CF. What we should
just be sure is not to lose track of its existence.
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

David Steele

Дата:

23 марта 2016 г., 05:11:24

On 3/22/16 8:54 PM, Michael Paquier wrote:
> On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david@pgmasters.net> wrote:
>>> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>>>
>>>> Ok, I understand that this is not an issue in a hurry. I'll go to
>>>> another patch that needs review.
>>>
>>> Since we're getting towards the end of the CF is it time to pick this up
>>> again?
>>
>> Perhaps not. This is a legit bug with an unfinished patch (see index
>> relation truncation) that is going to need a careful review. I don't
>> think that this should be impacted by the 4/8 feature freeze, so we
>> could still work on that after the embargo and we've had this bug for
>> months actually. FWIW, I am still planning to work on it once the CF
>> is done, in order to keep my manpower focused on actual patch reviews
>> as much as possible...
> 
> In short, we may want to bump that to next CF... I have already marked
> this ticket as something to work on soonish on my side, so it does not
> change much seen from here if it's part of the next CF. What we should
> just be sure is not to lose track of its existence.

I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.

-- 
-David
david@pgmasters.net

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

23 марта 2016 г., 06:45:24

On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
> I would prefer not to bump it to the next CF unless we decide this will
> not get fixed for 9.6.

It may make sense to add that to the list of open items for 9.6
instead. That's not a feature.
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

06 апреля 2016 г., 09:11:21

On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
>> I would prefer not to bump it to the next CF unless we decide this will
>> not get fixed for 9.6.
>
> It may make sense to add that to the list of open items for 9.6
> instead. That's not a feature.

So I have moved this patch to the next CF for now, and will work on
fixing it rather soonishly as an effort to stabilize 9.6 as well as
back-branches.
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

28 июля 2016 г., 10:59:22

On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
>>> I would prefer not to bump it to the next CF unless we decide this will
>>> not get fixed for 9.6.
>>
>> It may make sense to add that to the list of open items for 9.6
>> instead. That's not a feature.
>
> So I have moved this patch to the next CF for now, and will work on
> fixing it rather soonishly as an effort to stabilize 9.6 as well as
> back-branches.

Well, not that soon at the end, but I am back on that... I have not
completely reviewed all the code yet, and the case of index relation
referring to a relation optimized with truncate is still broken, but
for now here is a rebased patch if people are interested. I am going
to get as well a TAP tests out of my pocket to ease testing.
--
Michael

Вложения

fix-wal-level-minimal-michael-1.patch

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

29 июля 2016 г., 10:54:54

On Thu, Jul 28, 2016 at 4:59 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david@pgmasters.net> wrote:
>>>> I would prefer not to bump it to the next CF unless we decide this will
>>>> not get fixed for 9.6.
>>>
>>> It may make sense to add that to the list of open items for 9.6
>>> instead. That's not a feature.
>>
>> So I have moved this patch to the next CF for now, and will work on
>> fixing it rather soonishly as an effort to stabilize 9.6 as well as
>> back-branches.
>
> Well, not that soon at the end, but I am back on that... I have not
> completely reviewed all the code yet, and the case of index relation
> referring to a relation optimized with truncate is still broken, but
> for now here is a rebased patch if people are interested. I am going
> to get as well a TAP tests out of my pocket to ease testing.

The patch I sent yesterday was based on an incorrect version. Attached
is a slightly-modified version of the last one I found here
(https://www.postgresql.org/message-id/56B342F5.1050502@iki.fi), which
is rebased on HEAD at ed0b228. I have also converted the test case
script of upthread into a TAP test in src/test/recovery that covers 3
cases and I included that in the patch:
1) CREATE + INSERT + COPY => crash
2) CREATE + trigger + COPY => crash
3) CREATE + TRUNCATE + COPY => incorrect number of rows.
The first two tests make the system crash, the third one reports an
incorrect number of rows.

This is registered in next CF by the way:
https://commitfest.postgresql.org/10/528/
Thoughts?
--
Michael

Вложения

fix-wal-level-minimal-michael-2.patch

Re: WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

26 сентября 2016 г., 11:04:43

Hello, I return to this before my things:)

Though I haven't played with the patch yet..

At Fri, 29 Jul 2016 16:54:42 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com>
> > Well, not that soon at the end, but I am back on that... I have not
> > completely reviewed all the code yet, and the case of index relation
> > referring to a relation optimized with truncate is still broken, but
> > for now here is a rebased patch if people are interested. I am going
> > to get as well a TAP tests out of my pocket to ease testing.
> 
> The patch I sent yesterday was based on an incorrect version. Attached
> is a slightly-modified version of the last one I found here
> (https://www.postgresql.org/message-id/56B342F5.1050502@iki.fi), which
> is rebased on HEAD at ed0b228. I have also converted the test case
> script of upthread into a TAP test in src/test/recovery that covers 3
> cases and I included that in the patch:
> 1) CREATE + INSERT + COPY => crash
> 2) CREATE + trigger + COPY => crash
> 3) CREATE + TRUNCATE + COPY => incorrect number of rows.
> The first two tests make the system crash, the third one reports an
> incorrect number of rows.

At the first glance, managing sync_above and truncate_to is
workable for these cases, but seems too complicated for the
problem to be resolved.

This provides smgr with a capability to manage pending page
synchs. But the postpone-page-syncs-or-not issue rather seems to
be a matter of the users of that, who are responsible for WAL
issueing. Anyway heap_resgister_sync doesn't use any secret of
smgr. So I think this approach binds smgr with Relation too
tightly.

By this patch, many RelationNeedsWALs, which just accesses local
struct, are replaced with HeapNeedsWAL, which eventually accesses
a hash added by this patch. Especially in log_heap_update, it is
called for every update of single tuple (on a relation that needs
WAL).

Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
seems to exist at once in the hash.

What do you think?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

29 сентября 2016 г., 10:59:59

On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, I return to this before my things:)
>
> Though I haven't played with the patch yet..

Be sure to run the test cases in the patch or base your tests on them then!

> Though I don't know how it actually impacts the perfomance, it
> seems to me that we can live with truncated_to and sync_above in
> RelationData and BufferNeedsWAL(rel, buf) instead of
> HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> seems to exist at once in the hash.

TBH, I still think that the design of this patch as proposed is pretty
cool and easy to follow.
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

29 сентября 2016 г., 16:03:58

Hello,

At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com>
> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello, I return to this before my things:)
> >
> > Though I haven't played with the patch yet..
> 
> Be sure to run the test cases in the patch or base your tests on them then!

All items of 006_truncate_opt fail on ed0b228 and they are fixed
with the patch.

> > Though I don't know how it actually impacts the perfomance, it
> > seems to me that we can live with truncated_to and sync_above in
> > RelationData and BufferNeedsWAL(rel, buf) instead of
> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> > seems to exist at once in the hash.
> 
> TBH, I still think that the design of this patch as proposed is pretty
> cool and easy to follow.

It is clean from certain viewpoint but additional hash,
especially hash-searching on every HeapNeedsWAL seems to me to be
unacceptable. Do you see it accetable?


The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test.  The major changes are the following.

- Moved sync_above and truncted_to into  RelationData.

- Cleaning up is done in AtEOXact_cleanup instead of explicit calling to smgrDoPendingSyncs().

* BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires hash_search. It just refers to the additional members in
thegiven Relation.
 

X I feel that I have dropped one of the features of the origitnal patch during the hack, but I don't recall it clearly
now:(

X I haven't consider relfilenode replacement, which didn't matter for the original patch. (but there's few places to
consider).

What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..02e33cc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@ *      the POSTGRES heap access method used for all POSTGRES *      relations. *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ * *------------------------------------------------------------------------- */#include "postgres.h"
@@ -55,6 +77,7 @@#include "access/xlogutils.h"#include "catalog/catalog.h"#include "catalog/namespace.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -2331,12 +2354,6 @@ FreeBulkInsertState(BulkInsertState bistate) * The new tuple is stamped with current transaction
IDand the specified * command ID. *
 
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. *
@@ -2440,7 +2457,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,    MarkBufferDirty(buffer);    /*
XLOGstuff */
 
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_insert xlrec;        xl_heap_header xlhdr;
@@ -2639,12 +2656,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,    int            ndone;
char       *scratch = NULL;    Page        page;
 
-    bool        needwal;    Size        saveFreeSpace;    bool        need_tuple_data =
RelationIsLogicallyLogged(relation);   bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);    saveFreeSpace =
RelationGetTargetPageFreeSpace(relation,                                                  HEAP_DEFAULT_FILLFACTOR);
 
@@ -2659,7 +2674,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,     * palloc() within a
criticalsection is not safe, so we allocate this     * beforehand.     */
 
-    if (needwal)
+    if (RelationNeedsWAL(relation))        scratch = palloc(BLCKSZ);    /*
@@ -2694,6 +2709,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,        Buffer
vmbuffer= InvalidBuffer;        bool        all_visible_cleared = false;        int            nthispage;
 
+        bool        needwal;        CHECK_FOR_INTERRUPTS();
@@ -2705,6 +2721,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
            InvalidBuffer, options, bistate,                                           &vmbuffer, NULL);        page =
BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);        /* NO EREPORT(ERROR) from here till changes are logged */
    START_CRIT_SECTION();
 
@@ -3261,7 +3278,7 @@ l1:     * NB: heap_abort_speculative() uses the same xlog record and replay     * routines.
*/
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -4194,7 +4211,8 @@ l2:    MarkBufferDirty(buffer);    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))    {        XLogRecPtr    recptr;
@@ -5148,7 +5166,7 @@ failed:     * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG     *
entriesfor everything anyway.)     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))    {        xl_heap_lock xlrec;        XLogRecPtr    recptr;
@@ -5825,7 +5843,7 @@ l4:        MarkBufferDirty(buf);        /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))        {            xl_heap_lock_updated xlrec;            XLogRecPtr
recptr;
@@ -5980,7 +5998,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)    htup->t_ctid = tuple->t_self;
/*XLOG stuff */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_confirm xlrec;        XLogRecPtr    recptr;
@@ -6112,7 +6130,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)     * The WAL records generated here
matchheap_delete().  The same recovery     * routines are used.     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -6218,7 +6236,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)    MarkBufferDirty(buffer);    /* XLOG
stuff*/
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_inplace xlrec;        XLogRecPtr    recptr;
@@ -7331,7 +7349,7 @@ log_heap_clean(Relation reln, Buffer buffer,    XLogRecPtr    recptr;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    xlrec.latestRemovedXid = latestRemovedXid;    xlrec.nredirected =
nredirected;
@@ -7379,7 +7397,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,    XLogRecPtr    recptr;
 /* Caller should not call me on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    /* nor when there are no tuples to freeze */    Assert(ntuples > 0);
@@ -7464,7 +7482,7 @@ log_heap_update(Relation reln, Buffer oldbuf,    int            bufflags;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));    XLogBeginInsert();
@@ -7567,76 +7585,86 @@ log_heap_update(Relation reln, Buffer oldbuf,    xlrec.new_offnum =
ItemPointerGetOffsetNumber(&newtup->t_self);   xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
 
+    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+    bufflags = REGBUF_STANDARD;    if (init)        bufflags |= REGBUF_WILL_INIT;    if (need_tuple_data)
bufflags|= REGBUF_KEEP_DATA;
 
-    XLogRegisterBuffer(0, newbuf, bufflags);
-    if (oldbuf != newbuf)
-        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
-    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-    /*     * Prepare WAL data for the new tuple.     */
-    if (prefixlen > 0 || suffixlen > 0)
+    if (BufferNeedsWAL(reln, newbuf))    {
-        if (prefixlen > 0 && suffixlen > 0)
-        {
-            prefix_suffix[0] = prefixlen;
-            prefix_suffix[1] = suffixlen;
-            XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
-        }
-        else if (prefixlen > 0)
-        {
-            XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
-        }
-        else
+        XLogRegisterBuffer(0, newbuf, bufflags);
+
+        if ((prefixlen > 0 || suffixlen > 0))        {
-            XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            if (prefixlen > 0 && suffixlen > 0)
+            {
+                prefix_suffix[0] = prefixlen;
+                prefix_suffix[1] = suffixlen;
+                XLogRegisterBufData(0, (char *) &prefix_suffix,
+                                    sizeof(uint16) * 2);
+            }
+            else if (prefixlen > 0)
+            {
+                XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+            }
+            else
+            {
+                XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            }        }
-    }
-    xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
-    xlhdr.t_infomask = newtup->t_data->t_infomask;
-    xlhdr.t_hoff = newtup->t_data->t_hoff;
-    Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+        xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+        xlhdr.t_infomask = newtup->t_data->t_infomask;
+        xlhdr.t_hoff = newtup->t_data->t_hoff;
+        Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
-    /*
-     * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
-     *
-     * The 'data' doesn't include the common prefix or suffix.
-     */
-    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-    if (prefixlen == 0)
-    {
-        XLogRegisterBufData(0,
-                            ((char *) newtup->t_data) + SizeofHeapTupleHeader,
-                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);
-    }
-    else
-    {        /*
-         * Have to write the null bitmap and data after the common prefix as
-         * two separate rdata entries.
+         * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+         *
+         * The 'data' doesn't include the common prefix or suffix.         */
-        /* bitmap [+ padding] [+ oid] */
-        if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+        if (prefixlen == 0)        {            XLogRegisterBufData(0,                           ((char *)
newtup->t_data)+ SizeofHeapTupleHeader,
 
-                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);        }
+        else
+        {
+            /*
+             * Have to write the null bitmap and data after the common prefix
+             * as two separate rdata entries.
+             */
+            /* bitmap [+ padding] [+ oid] */
+            if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+            {
+                XLogRegisterBufData(0,
+                           ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+            }
-        /* data after common prefix */
-        XLogRegisterBufData(0,
+            /* data after common prefix */
+            XLogRegisterBufData(0,              ((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
    newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
 
+        }    }
+    /*
+     * If the old and new tuple are on different pages, also register the old
+     * page, so that a full-page image is created for it if necessary. We
+     * don't need any extra information to replay changes to it.
+     */
+    if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+    /* We need to log a tuple identity */    if (need_tuple_data && old_key_tuple)    {
@@ -8555,8 +8583,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)     */    /* Deal with old tuple
version*/
 
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+    if (oldaction == BLK_NEEDS_REDO)    {        page = BufferGetPage(obuffer);
@@ -8610,6 +8643,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)        PageInit(page,
BufferGetPageSize(nbuffer),0);        newaction = BLK_NEEDS_REDO;    }
 
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;    else        newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9046,9 +9081,16 @@ heap2_redo(XLogReaderState *record) *    heap_sync        - sync a heap, for use when no WAL has
beenwritten * * This forces the heap contents (including TOAST heap if any) down to disk.
 
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead. * * Indexes are not touched.  (Currently, index operations associated with * the
commandsthat use this are WAL-logged and so do not need fsync.
 
@@ -9081,3 +9123,33 @@ heap_sync(Relation rel)        heap_close(toastrel, AccessShareLock);    }}
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..27a2447 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@#include "access/htup_details.h"#include "access/xlog.h"#include "catalog/catalog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,        /*         *
Emita WAL HEAP_CLEAN record showing what we did         */
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))        {            XLogRecPtr    recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f9ce986..36ba62a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)    }    else if (HeapTupleHasExternal(tup) ||
tup->t_len> TOAST_TUPLE_THRESHOLD)        heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);    else        heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..e08623c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@#include "access/heapam_xlog.h"#include "access/visibilitymap.h"#include "access/xlog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "storage/bufmgr.h"#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,        map[mapByte] |= (flags
<<mapOffset);        MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))        {            if (XLogRecPtrIsInvalid(recptr))            {
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0d8311c..a2f03a7 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -260,31 +260,41 @@ RelationTruncate(Relation rel, BlockNumber nblocks)     */    if (RelationNeedsWAL(rel))    {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
-
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
-
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (rel->sync_above == InvalidBlockNumber ||
+            rel->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->truncated_to = nblocks;
+        }    }    /* Do the real work */
@@ -419,6 +429,59 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)    return nrels;}
+void
+RecordPendingSync(Relation rel)
+{
+    Assert(RelationNeedsWAL(rel));
+
+    if (rel->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             RelationGetNumberOfBlocks(rel));
+        rel->sync_above = RelationGetNumberOfBlocks(rel);
+    }
+    else
+        elog(DEBUG2, "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->sync_above, RelationGetNumberOfBlocks(rel));
+}
+
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->sync_above == InvalidBlockNumber ||
+         rel->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->truncated_to != InvalidBlockNumber &&
+        rel->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+
+    return false;
+}
+/* *    PostPrepare_smgr -- Clean up after a successful PREPARE *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..a0fe63f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2269,8 +2269,7 @@ CopyFrom(CopyState cstate)     *    - data is being written to relfilenode created in this
transaction    * then we can skip writing WAL.  It's safe because if the transaction     * doesn't commit, we'll
discardthe table (or the new relfilenode file).
 
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().     *     * As mentioned in comments in utils/rel.h, the
in-same-transactiontest     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
 
@@ -2302,7 +2301,7 @@ CopyFrom(CopyState cstate)    {        hi_options |= HEAP_INSERT_SKIP_FSM;        if
(!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);    }    /*
@@ -2551,11 +2550,11 @@ CopyFrom(CopyState cstate)    FreeExecutorState(estate);    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.     */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);    return processed;}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 5b4f6af..b64d52a 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     * We can skip
WAL-loggingthe insertions, unless PITR or streaming     * replication is in use. We can skip the FSM in any case.
*/
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;    myState->bistate = GetBulkInsertState();    /* Not using WAL
requiressmgr_targblock be initially invalid */
 
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close rel, but keep lock until commit */
 heap_close(myState->rel, NoLock);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6cddcbd..dbef95b 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -456,7 +456,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     */
myState->hi_options= HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;    if (!XLogIsNeeded())
 
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);    myState->bistate = GetBulkInsertState();    /* Not using WAL requires
smgr_targblockbe initially invalid */
 
@@ -499,9 +499,7 @@ transientrel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close transientrel, but keep lock until
commit*/    heap_close(myState->transientrel, NoLock);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 86e9814..ca892ea 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3984,8 +3984,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)        bistate =
GetBulkInsertState();       hi_options = HEAP_INSERT_SKIP_FSM;
 
+        if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);    }    else    {
@@ -4236,8 +4237,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);       /* If we skipped writing WAL, then we need to sync the heap. */
 
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);        heap_close(newrel, NoLock);    }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..3662f7b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -879,7 +879,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,                 * page has
beenpreviously WAL-logged, and if not, do that                 * now.                 */
 
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&                    PageGetLSN(page) == InvalidXLogRecPtr)
        log_newpage_buffer(buf, true);
 
@@ -1106,7 +1106,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,            }            /*
NowWAL-log freezing if necessary */
 
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))            {                XLogRecPtr    recptr;
@@ -1462,7 +1462,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,    MarkBufferDirty(buffer);
/* XLOG stuff */
 
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))    {        XLogRecPtr    recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..d128e63 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,            BufferAccessStrategy strategy,
  bool *foundPtr);static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);static void AtProcExit_Buffers(int code,
Datumarg);static void CheckForBufferLeaks(void);static int    rnode_comparator(const void *p1, const void *p2);
 
@@ -3130,20 +3131,41 @@ PrintPinnedBufs(void)voidFlushRelationBuffers(Relation rel){
-    int            i;
-    BufferDesc *bufHdr;
-    /* Open rel at the smgr level if not already done */    RelationOpenSmgr(rel);
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)    {        for (i = 0; i < NLocBuffer; i++)        {            uint32        buf_state;
bufHdr= GetLocalBufferDescriptor(i);
 
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&                ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))            {
 
@@ -3160,7 +3182,7 @@ FlushRelationBuffers(Relation rel)                PageSetChecksumInplace(localpage,
bufHdr->tag.blockNum);
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,                          bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                         localpage,
 
@@ -3190,18 +3212,18 @@ FlushRelationBuffers(Relation rel)         * As in DropRelFileNodeBuffers, an unlocked precheck
shouldbe safe         * and saves some cycles.         */
 
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))            continue;        ReservePrivateRefCountEntry();
    buf_state = LockBufHdr(bufHdr);
 
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID
|BM_DIRTY))        {            PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr,true);        }
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8d2ad01..31ae0f1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -66,6 +66,7 @@#include "optimizer/var.h"#include "rewrite/rewriteDefine.h"#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"#include "storage/lmgr.h"#include "storage/smgr.h"#include "utils/array.h"
@@ -407,6 +408,9 @@ AllocateRelationDesc(Form_pg_class relp)    /* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount= 1;
 
+    relation->sync_above = InvalidBlockNumber;
+    relation->truncated_to = InvalidBlockNumber;
+    MemoryContextSwitchTo(oldcxt);    return relation;
@@ -1731,6 +1735,9 @@ formrdesc(const char *relationName, Oid relationReltype,        relation->rd_rel->relhasindex =
true;   }
 
+    relation->sync_above = InvalidBlockNumber;
+    relation->truncated_to = InvalidBlockNumber;
+    /*     * add new reldesc to relcache     */
@@ -2055,6 +2062,22 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)    pfree(relation);}
+static void
+RelationDoPendingFlush(Relation relation)
+{
+    if (relation->sync_above != InvalidBlockNumber)
+    {
+        FlushRelationBuffersWithoutRelCache(relation->rd_node, false);
+        smgrimmedsync(smgropen(relation->rd_node, InvalidBackendId),
+                      MAIN_FORKNUM);
+
+        elog(DEBUG2, "syncing rel %u/%u/%u",
+             relation->rd_node.spcNode,
+             relation->rd_node.dbNode, relation->rd_node.relNode);
+        
+    }
+}
+/* * RelationClearRelation *
@@ -2686,7 +2709,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)    if (relation->rd_createSubid !=
InvalidSubTransactionId)   {        if (isCommit)
 
+        {
+            RelationDoPendingFlush(relation);            relation->rd_createSubid = InvalidSubTransactionId;
+        }        else if (RelationHasReferenceCountZero(relation))        {            RelationClearRelation(relation,
false);
@@ -3019,6 +3045,9 @@ RelationBuildLocalRelation(const char *relname,    else        rel->rd_rel->relfilenode =
relfilenode;
+    rel->sync_above = InvalidBlockNumber;
+    rel->truncated_to = InvalidBlockNumber;
+    RelationInitLockInfo(rel);    /* see lmgr.c */    RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..1c169ef 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004typedef struct BulkInsertStateData *BulkInsertState;
@@ -177,6 +176,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);extern void
simple_heap_update(Relationrelation, ItemPointer otid,                   HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);extern void heap_sync(Relation relation);/* in heap/pruneheap.c */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef960da..235c2b4 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,8 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks); */extern void
smgrDoPendingDeletes(boolisCommit);extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);extern void AtSubCommit_smgr(void);extern void
AtSubAbort_smgr(void);externvoid PostPrepare_smgr(void);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..f02ea93 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -202,6 +202,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);extern void FlushOneBuffer(Buffer buffer);extern void FlushRelationBuffers(Relation rel);
 
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);extern void FlushDatabaseBuffers(Oid dbid);extern void
DropRelFileNodeBuffers(RelFileNodeBackendrnode,                       ForkNumber forkNum, BlockNumber firstDelBlock);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ed14442..a8a2b23 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -172,6 +172,9 @@ typedef struct RelationData    /* use "struct" here to avoid needing to include pgstat.h: */
structPgStat_TableStatus *pgstat_info;        /* statistics collection area */
 
+
+    BlockNumber sync_above;
+    BlockNumber truncated_to;} RelationData;

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

02 октября 2016 г., 15:43:53

On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com>
>> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Hello, I return to this before my things:)
>> >
>> > Though I haven't played with the patch yet..
>>
>> Be sure to run the test cases in the patch or base your tests on them then!
>
> All items of 006_truncate_opt fail on ed0b228 and they are fixed
> with the patch.
>
>> > Though I don't know how it actually impacts the perfomance, it
>> > seems to me that we can live with truncated_to and sync_above in
>> > RelationData and BufferNeedsWAL(rel, buf) instead of
>> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
>> > seems to exist at once in the hash.
>>
>> TBH, I still think that the design of this patch as proposed is pretty
>> cool and easy to follow.
>
> It is clean from certain viewpoint but additional hash,
> especially hash-searching on every HeapNeedsWAL seems to me to be
> unacceptable. Do you see it accetable?
>
>
> The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> just as a PoC of my proposal quoted above. This also passes the
> 006 test.  The major changes are the following.
>
> - Moved sync_above and truncted_to into  RelationData.
>
> - Cleaning up is done in AtEOXact_cleanup instead of explicit
>   calling to smgrDoPendingSyncs().
>
> * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
>   hash_search. It just refers to the additional members in the
>   given Relation.
>
> X I feel that I have dropped one of the features of the origitnal
>   patch during the hack, but I don't recall it clearly now:(
>
> X I haven't consider relfilenode replacement, which didn't matter
>   for the original patch. (but there's few places to consider).
>
> What do you think about this?

I have moved this patch to next CF. (I still need to look at your patch.)
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

01 ноября 2016 г., 05:01:36

Hi,

At Sun, 2 Oct 2016 21:43:46 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTKOyHkrBSxvvSBZCXvU9F8OT_uumXmST_awKsswQA5Sg@mail.gmail.com>
> On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello,
> >
> > At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com>
> >> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> > Hello, I return to this before my things:)
> >> >
> >> > Though I haven't played with the patch yet..
> >>
> >> Be sure to run the test cases in the patch or base your tests on them then!
> >
> > All items of 006_truncate_opt fail on ed0b228 and they are fixed
> > with the patch.
> >
> >> > Though I don't know how it actually impacts the perfomance, it
> >> > seems to me that we can live with truncated_to and sync_above in
> >> > RelationData and BufferNeedsWAL(rel, buf) instead of
> >> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> >> > seems to exist at once in the hash.
> >>
> >> TBH, I still think that the design of this patch as proposed is pretty
> >> cool and easy to follow.
> >
> > It is clean from certain viewpoint but additional hash,
> > especially hash-searching on every HeapNeedsWAL seems to me to be
> > unacceptable. Do you see it accetable?
> >
> >
> > The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> > just as a PoC of my proposal quoted above. This also passes the
> > 006 test.  The major changes are the following.
> >
> > - Moved sync_above and truncted_to into  RelationData.
> >
> > - Cleaning up is done in AtEOXact_cleanup instead of explicit
> >   calling to smgrDoPendingSyncs().
> >
> > * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
> >   hash_search. It just refers to the additional members in the
> >   given Relation.
> >
> > X I feel that I have dropped one of the features of the origitnal
> >   patch during the hack, but I don't recall it clearly now:(
> >
> > X I haven't consider relfilenode replacement, which didn't matter
> >   for the original patch. (but there's few places to consider).
> >
> > What do you think about this?
> 
> I have moved this patch to next CF. (I still need to look at your patch.)

Thanks for considering that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

09 ноября 2016 г., 01:40:39

On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I dropped the ball on this one back in July, so here's an attempt to revive
> this thread.
>
> I spent some time fixing the remaining issues with the prototype patch I
> posted earlier, and rebased that on top of current git master. See attached.
>
> Some review of that would be nice. If there are no major issues with it, I'm
> going to create backpatchable versions of this for 9.4 and below.

Heikki:

Are you going to do commit something here?  This thread and patch are
now 14 months old, which is a long time to make people wait for a bug
fix.  The status in the CF is "Ready for Committer" although I am not
sure if that's accurate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

09 ноября 2016 г., 04:47:51

On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I dropped the ball on this one back in July, so here's an attempt to revive
>> this thread.
>>
>> I spent some time fixing the remaining issues with the prototype patch I
>> posted earlier, and rebased that on top of current git master. See attached.
>>
>> Some review of that would be nice. If there are no major issues with it, I'm
>> going to create backpatchable versions of this for 9.4 and below.
>
> Are you going to do commit something here?  This thread and patch are
> now 14 months old, which is a long time to make people wait for a bug
> fix.  The status in the CF is "Ready for Committer" although I am not
> sure if that's accurate.

"Needs Review" is definitely a better definition of its current state.
The last time I had a look at this patch I thought that it was in
pretty good shape (not Horiguchi-san's version, but the one in
https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com).
With some of the recent changes, surely it needs a second look, things
related to heap handling tend to rot quickly.

I'll look into it once again by the end of this week if Heikki does
not show up, the rest will be on him I am afraid...
-- 
Michael

Re: WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

09 ноября 2016 г., 09:56:01

On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I dropped the ball on this one back in July, so here's an attempt to revive
>>> this thread.
>>>
>>> I spent some time fixing the remaining issues with the prototype patch I
>>> posted earlier, and rebased that on top of current git master. See attached.
>>>
>>> Some review of that would be nice. If there are no major issues with it, I'm
>>> going to create backpatchable versions of this for 9.4 and below.
>>
>> Are you going to do commit something here? This thread and patch are
>> now 14 months old, which is a long time to make people wait for a bug
>> fix. The status in the CF is "Ready for Committer" although I am not
>> sure if that's accurate.
>
> "Needs Review" is definitely a better definition of its current state.
> The last time I had a look at this patch I thought that it was in
> pretty good shape (not Horiguchi-san's version, but the one in
> https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com).
> With some of the recent changes, surely it needs a second look, things
> related to heap handling tend to rot quickly.
>
> I'll look into it once again by the end of this week if Heikki does
> not show up, the rest will be on him I am afraid...

I have been able to hit a crash with recovery test 008:
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x00007fff8e5cc6df libsystem_c.dylib`abort + 129
frame #3: 0x0000000106ef10f0 postgres`ExceptionalCondition(conditionName="!(( !( ((void) ((bool) (! (!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) || (ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0)))), (buffer) != 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) : (GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion", fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593
frame #5: 0x000000010694e6ad postgres`HeapNeedsWAL(rel=0x00007f9454804118, buf=0) + 61 at heapam.c:9234
frame #6: 0x000000010696d8bd postgres`visibilitymap_set(rel=0x00007f9454804118, heapBlk=1, heapBuf=0, recptr=50841176, vmBuf=118, cutoff_xid=866, flags='\x01') + 989 at visibilitymap.c:310
frame #7: 0x000000010695d020 postgres`heap_xlog_visible(record=0x00007f94520035d0) + 896 at heapam.c:8148
frame #8: 0x000000010695c582 postgres`heap2_redo(record=0x00007f94520035d0) + 242 at heapam.c:9107
frame #9: 0x00000001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950
frame #10: 0x0000000106c9d783 postgres`StartupProcessMain + 339 at startup.c:216
frame #11: 0x00000001069ee6ec postgres`AuxiliaryProcessMain(argc=2, argv=0x00007fff59316d80) + 1676 at bootstrap.c:420
frame #12: 0x0000000106c98002 postgres`StartChildProcess(type=StartupProcess) + 322 at postmaster.c:5221
frame #13: 0x0000000106c96031 postgres`PostmasterMain(argc=3, argv=0x00007f9451c04210) + 6033 at postmaster.c:1301
frame #14: 0x0000000106bc30cf postgres`main(argc=3, argv=0x00007f9451c04210) + 751 at main.c:228
(lldb) up 1
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593
2590 {
2591 BufferDesc *bufHdr;
2592
-> 2593 Assert(BufferIsPinned(buffer));
2594
2595 if (BufferIsLocal(buffer))
2596 bufHdr = GetLocalBufferDescriptor(-buffer - 1);

Michael

Re: WAL logging problem in 9.4.3?

От

Haribabu Kommi

Дата:

02 декабря 2016 г., 07:40:08

On Wed, Nov 9, 2016 at 5:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> I dropped the ball on this one back in July, so here's an attempt to revive
>>> this thread.
>>>
>>> I spent some time fixing the remaining issues with the prototype patch I
>>> posted earlier, and rebased that on top of current git master. See attached.
>>>
>>> Some review of that would be nice. If there are no major issues with it, I'm
>>> going to create backpatchable versions of this for 9.4 and below.
>>
>> Are you going to do commit something here? This thread and patch are
>> now 14 months old, which is a long time to make people wait for a bug
>> fix. The status in the CF is "Ready for Committer" although I am not
>> sure if that's accurate.
>
> "Needs Review" is definitely a better definition of its current state.
> The last time I had a look at this patch I thought that it was in
> pretty good shape (not Horiguchi-san's version, but the one in
> https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com).
> With some of the recent changes, surely it needs a second look, things
> related to heap handling tend to rot quickly.
>
> I'll look into it once again by the end of this week if Heikki does
> not show up, the rest will be on him I am afraid...

I have been able to hit a crash with recovery test 008:
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x00007fff8e5cc6df libsystem_c.dylib`abort + 129
frame #3: 0x0000000106ef10f0 postgres`ExceptionalCondition(conditionName="!(( !( ((void) ((bool) (! (!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) || (ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0)))), (buffer) != 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) : (GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion", fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593
frame #5: 0x000000010694e6ad postgres`HeapNeedsWAL(rel=0x00007f9454804118, buf=0) + 61 at heapam.c:9234
frame #6: 0x000000010696d8bd postgres`visibilitymap_set(rel=0x00007f9454804118, heapBlk=1, heapBuf=0, recptr=50841176, vmBuf=118, cutoff_xid=866, flags='\x01') + 989 at visibilitymap.c:310
frame #7: 0x000000010695d020 postgres`heap_xlog_visible(record=0x00007f94520035d0) + 896 at heapam.c:8148
frame #8: 0x000000010695c582 postgres`heap2_redo(record=0x00007f94520035d0) + 242 at heapam.c:9107
frame #9: 0x00000001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950
frame #10: 0x0000000106c9d783 postgres`StartupProcessMain + 339 at startup.c:216
frame #11: 0x00000001069ee6ec postgres`AuxiliaryProcessMain(argc=2, argv=0x00007fff59316d80) + 1676 at bootstrap.c:420
frame #12: 0x0000000106c98002 postgres`StartChildProcess(type=StartupProcess) + 322 at postmaster.c:5221
frame #13: 0x0000000106c96031 postgres`PostmasterMain(argc=3, argv=0x00007f9451c04210) + 6033 at postmaster.c:1301
frame #14: 0x0000000106bc30cf postgres`main(argc=3, argv=0x00007f9451c04210) + 751 at main.c:228
(lldb) up 1
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204 at bufmgr.c:2593
2590 {
2591 BufferDesc *bufHdr;
2592
-> 2593 Assert(BufferIsPinned(buffer));
2594
2595 if (BufferIsLocal(buffer))
2596 bufHdr = GetLocalBufferDescriptor(-buffer - 1);

The latest proposed patch still having problems.

Closed in 2016-11 commitfest with "moved to next CF" status because of a bug fix patch.

Please feel free to update the status once you submit the updated patch.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

31 января 2017 г., 10:33:19

On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> The latest proposed patch still having problems.
> Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
> fix patch.
> Please feel free to update the status once you submit the updated patch.

And moved to CF 2017-03...
-- 
Michael

Re: [HACKERS] WAL logging problem in 9.4.3?

От

David Steele

Дата:

01 марта 2017 г., 22:23:33

On 1/30/17 11:33 PM, Michael Paquier wrote:
> On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>> The latest proposed patch still having problems.
>> Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
>> fix patch.
>> Please feel free to update the status once you submit the updated patch.
> And moved to CF 2017-03...

Are there any plans to post a new patch?  This thread is now 18 months
old and it would be good to get a resolution in this CF.

Thanks,

-- 
-David
david@pgmasters.net

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Alvaro Herrera

Дата:

08 апреля 2017 г., 05:33:21

Kyotaro HORIGUCHI wrote:

> The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> just as a PoC of my proposal quoted above. This also passes the
> 006 test.  The major changes are the following.
> 
> - Moved sync_above and truncted_to into  RelationData.

Interesting.  I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.

I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation.  But that doesn't guarantee that the relcache entry is
completely stable, does it?  If we can get proof of that, then this
technique should be safe, I think.

In your version of the patch, which I spent some time skimming, I am
missing comments on various functions.  I added some as I went along,
including one XXX indicating it must be filled.

RecordPendingSync() should really live in relcache.c (and probably get a
different name).

> X I feel that I have dropped one of the features of the origitnal
>   patch during the hack, but I don't recall it clearly now:(

Hah :-)

> X I haven't consider relfilenode replacement, which didn't matter
>   for the original patch. (but there's few places to consider).

Hmm ...  Please provide.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

fix-wal-level-minimal-michael-horiguchi-2.patch

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Alvaro Herrera

Дата:

08 апреля 2017 г., 05:36:25

I have claimed this patch as committer FWIW.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Alvaro Herrera

Дата:

08 апреля 2017 г., 05:42:17

Alvaro Herrera wrote:

> I suppose the rationale is that this shouldn't happen because any
> operation that does things this way must hold an exclusive lock on the
> relation.  But that doesn't guarantee that the relcache entry is
> completely stable, does it?  If we can get proof of that, then this
> technique should be safe, I think.

It occurs to me that in order to test this we could run the recovery
tests (including Michael's new 006 file, which you didn't include in
your patch) under -D CLOBBER_CACHE_ALWAYS.  I think that'd be sufficient
proof that it is solid.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Tom Lane

Дата:

08 апреля 2017 г., 06:38:35

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Interesting.  I wonder if it's possible that a relcache invalidation
> would cause these values to get lost for some reason, because that would
> be dangerous.

> I suppose the rationale is that this shouldn't happen because any
> operation that does things this way must hold an exclusive lock on the
> relation.  But that doesn't guarantee that the relcache entry is
> completely stable,

It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
how strong a lock you hold.
        regards, tom lane

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 апреля 2017 г., 06:56:06

Hello, thank you for looking this.

At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Interesting.  I wonder if it's possible that a relcache invalidation
> > would cause these values to get lost for some reason, because that would
> > be dangerous.
> 
> > I suppose the rationale is that this shouldn't happen because any
> > operation that does things this way must hold an exclusive lock on the
> > relation.  But that doesn't guarantee that the relcache entry is
> > completely stable,
> 
> It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> how strong a lock you hold.
> 
>             regards, tom lane

Ugh. Yes, relcache invalidation happens anytime and it resets the
added values. pg_stat_info deceived me that it can store
transient values. But I  came up with another thought.

The reason I proposed it was I thought that hash_search for every
buffer is not good. Instead, like pg_stat_info, we can link the
pending-sync hash entry to Relation. This greately reduces the
frequency of hash-searching.

I'll post new patch in this way soon.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 апреля 2017 г., 14:33:41

At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170411.095606.245908357.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello, thank you for looking this.
> 
> At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > > Interesting.  I wonder if it's possible that a relcache invalidation
> > > would cause these values to get lost for some reason, because that would
> > > be dangerous.
> > 
> > > I suppose the rationale is that this shouldn't happen because any
> > > operation that does things this way must hold an exclusive lock on the
> > > relation.  But that doesn't guarantee that the relcache entry is
> > > completely stable,
> > 
> > It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> > how strong a lock you hold.
> > 
> >             regards, tom lane
> 
> Ugh. Yes, relcache invalidation happens anytime and it resets the
> added values. pg_stat_info deceived me that it can store
> transient values. But I  came up with another thought.
> 
> The reason I proposed it was I thought that hash_search for every
> buffer is not good. Instead, like pg_stat_info, we can link the

buffer => buffer modification

> pending-sync hash entry to Relation. This greately reduces the
> frequency of hash-searching.
> 
> I'll post new patch in this way soon.

Here it is.

- Relation has new members no_pending_sync and pending_sync that works as instant cache of an entry in pendingSync
hash.

- Commit-time synchronizing is restored as Michael's patch.

- If relfilenode is replaced, pending_sync for the old node is removed. Anyway this is ignored on abort and meaningless
oncommit.
 

- TAP test is renamed to 012 since some new files have been added.

Accessing pending sync hash occured on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync.  Almost of them are
eliminated as the result.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..aa1b97d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@ *      the POSTGRES heap access method used for all POSTGRES *      relations. *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ * *------------------------------------------------------------------------- */#include "postgres.h"
@@ -56,6 +78,7 @@#include "access/xlogutils.h"#include "catalog/catalog.h"#include "catalog/namespace.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * The new tuple is stamped with current
transactionID and the specified * command ID. *
 
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. *
@@ -2465,7 +2482,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,    MarkBufferDirty(buffer);    /*
XLOGstuff */
 
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_insert xlrec;        xl_heap_header xlhdr;
@@ -2664,12 +2681,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,    int            ndone;
char       *scratch = NULL;    Page        page;
 
-    bool        needwal;    Size        saveFreeSpace;    bool        need_tuple_data =
RelationIsLogicallyLogged(relation);   bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);    saveFreeSpace =
RelationGetTargetPageFreeSpace(relation,                                                  HEAP_DEFAULT_FILLFACTOR);
 
@@ -2684,7 +2699,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,     * palloc() within a
criticalsection is not safe, so we allocate this     * beforehand.     */
 
-    if (needwal)
+    if (RelationNeedsWAL(relation))        scratch = palloc(BLCKSZ);    /*
@@ -2719,6 +2734,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,        Buffer
vmbuffer= InvalidBuffer;        bool        all_visible_cleared = false;        int            nthispage;
 
+        bool        needwal;        CHECK_FOR_INTERRUPTS();
@@ -2730,6 +2746,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
            InvalidBuffer, options, bistate,                                           &vmbuffer, NULL);        page =
BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);        /* NO EREPORT(ERROR) from here till changes are logged */
    START_CRIT_SECTION();
 
@@ -3286,7 +3303,7 @@ l1:     * NB: heap_abort_speculative() uses the same xlog record and replay     * routines.
*/
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -4250,7 +4267,8 @@ l2:    MarkBufferDirty(buffer);    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))    {        XLogRecPtr    recptr;
@@ -5141,7 +5159,7 @@ failed:     * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG     *
entriesfor everything anyway.)     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))    {        xl_heap_lock xlrec;        XLogRecPtr    recptr;
@@ -5843,7 +5861,7 @@ l4:        MarkBufferDirty(buf);        /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))        {            xl_heap_lock_updated xlrec;            XLogRecPtr
recptr;
@@ -5998,7 +6016,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)    htup->t_ctid = tuple->t_self;
/*XLOG stuff */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_confirm xlrec;        XLogRecPtr    recptr;
@@ -6131,7 +6149,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)     * The WAL records generated here
matchheap_delete().  The same recovery     * routines are used.     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -6240,7 +6258,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)    MarkBufferDirty(buffer);    /* XLOG
stuff*/
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_inplace xlrec;        XLogRecPtr    recptr;
@@ -7354,7 +7372,7 @@ log_heap_clean(Relation reln, Buffer buffer,    XLogRecPtr    recptr;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    xlrec.latestRemovedXid = latestRemovedXid;    xlrec.nredirected =
nredirected;
@@ -7402,7 +7420,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,    XLogRecPtr    recptr;
 /* Caller should not call me on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    /* nor when there are no tuples to freeze */    Assert(ntuples > 0);
@@ -7487,7 +7505,7 @@ log_heap_update(Relation reln, Buffer oldbuf,    int            bufflags;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));    XLogBeginInsert();
@@ -7590,76 +7608,86 @@ log_heap_update(Relation reln, Buffer oldbuf,    xlrec.new_offnum =
ItemPointerGetOffsetNumber(&newtup->t_self);   xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
 
+    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+    bufflags = REGBUF_STANDARD;    if (init)        bufflags |= REGBUF_WILL_INIT;    if (need_tuple_data)
bufflags|= REGBUF_KEEP_DATA;
 
-    XLogRegisterBuffer(0, newbuf, bufflags);
-    if (oldbuf != newbuf)
-        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
-    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-    /*     * Prepare WAL data for the new tuple.     */
-    if (prefixlen > 0 || suffixlen > 0)
+    if (BufferNeedsWAL(reln, newbuf))    {
-        if (prefixlen > 0 && suffixlen > 0)
-        {
-            prefix_suffix[0] = prefixlen;
-            prefix_suffix[1] = suffixlen;
-            XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
-        }
-        else if (prefixlen > 0)
-        {
-            XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
-        }
-        else
+        XLogRegisterBuffer(0, newbuf, bufflags);
+
+        if ((prefixlen > 0 || suffixlen > 0))        {
-            XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            if (prefixlen > 0 && suffixlen > 0)
+            {
+                prefix_suffix[0] = prefixlen;
+                prefix_suffix[1] = suffixlen;
+                XLogRegisterBufData(0, (char *) &prefix_suffix,
+                                    sizeof(uint16) * 2);
+            }
+            else if (prefixlen > 0)
+            {
+                XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+            }
+            else
+            {
+                XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            }        }
-    }
-    xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
-    xlhdr.t_infomask = newtup->t_data->t_infomask;
-    xlhdr.t_hoff = newtup->t_data->t_hoff;
-    Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+        xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+        xlhdr.t_infomask = newtup->t_data->t_infomask;
+        xlhdr.t_hoff = newtup->t_data->t_hoff;
+        Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
-    /*
-     * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
-     *
-     * The 'data' doesn't include the common prefix or suffix.
-     */
-    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-    if (prefixlen == 0)
-    {
-        XLogRegisterBufData(0,
-                            ((char *) newtup->t_data) + SizeofHeapTupleHeader,
-                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);
-    }
-    else
-    {        /*
-         * Have to write the null bitmap and data after the common prefix as
-         * two separate rdata entries.
+         * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+         *
+         * The 'data' doesn't include the common prefix or suffix.         */
-        /* bitmap [+ padding] [+ oid] */
-        if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+        if (prefixlen == 0)        {            XLogRegisterBufData(0,                           ((char *)
newtup->t_data)+ SizeofHeapTupleHeader,
 
-                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);        }
+        else
+        {
+            /*
+             * Have to write the null bitmap and data after the common prefix
+             * as two separate rdata entries.
+             */
+            /* bitmap [+ padding] [+ oid] */
+            if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+            {
+                XLogRegisterBufData(0,
+                           ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+            }
-        /* data after common prefix */
-        XLogRegisterBufData(0,
+            /* data after common prefix */
+            XLogRegisterBufData(0,              ((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
    newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
 
+        }    }
+    /*
+     * If the old and new tuple are on different pages, also register the old
+     * page, so that a full-page image is created for it if necessary. We
+     * don't need any extra information to replay changes to it.
+     */
+    if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+    /* We need to log a tuple identity */    if (need_tuple_data && old_key_tuple)    {
@@ -8578,8 +8606,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)     */    /* Deal with old tuple
version*/
 
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+    if (oldaction == BLK_NEEDS_REDO)    {        page = BufferGetPage(obuffer);
@@ -8633,6 +8666,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)        PageInit(page,
BufferGetPageSize(nbuffer),0);        newaction = BLK_NEEDS_REDO;    }
 
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;    else        newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9069,9 +9104,16 @@ heap2_redo(XLogReaderState *record) *    heap_sync        - sync a heap, for use when no WAL has
beenwritten * * This forces the heap contents (including TOAST heap if any) down to disk.
 
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead. * * Indexes are not touched.  (Currently, index operations associated with * the
commandsthat use this are WAL-logged and so do not need fsync.
 
@@ -9181,3 +9223,33 @@ heap_mask(char *pagedata, BlockNumber blkno)        }    }}
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..4754278 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@#include "access/htup_details.h"#include "access/xlog.h"#include "catalog/catalog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,        /*         *
Emita WAL HEAP_CLEAN record showing what we did         */
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))        {            XLogRecPtr    recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..6462f44 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)    }    else if (HeapTupleHasExternal(tup) ||
tup->t_len> TOAST_TUPLE_THRESHOLD)        heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);    else        heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e5616ce..933fa9c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@#include "access/heapam_xlog.h"#include "access/visibilitymap.h"#include "access/xlog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "storage/bufmgr.h"#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,        map[mapByte] |= (flags
<<mapOffset);        MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))        {            if (XLogRecPtrIsInvalid(recptr))            {
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92b263a..361b50d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2007,6 +2007,9 @@ CommitTransaction(void)    /* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+    /*     * Mark serializable transaction as complete for predicate locking     * purposes.  This should be done as
lateas we can put it and still allow
 
@@ -2238,6 +2241,9 @@ PrepareTransaction(void)    /* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+    /*     * Mark serializable transaction as complete for predicate locking     * purposes.  This should be done as
lateas we can put it and still allow
 
@@ -2545,6 +2551,7 @@ AbortTransaction(void)    AtAbort_Notify();    AtEOXact_RelationMap(false);
AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandon pending syncs */    /*     * Advertise the fact that we aborted in
pg_xact(assuming that we got as
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f677916..1234325 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@#include "catalog/storage_xlog.h"#include "storage/freespace.h"#include "storage/smgr.h"
+#include "utils/hsearch.h"#include "utils/memutils.h"#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDeletestatic PendingRelDelete *pendingDeletes = NULL; /* head of linked
list*//*
 
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/* * RelationCreateStorage *        Create physical storage for a relation. *
@@ -116,6 +160,14 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)    pending->nestLevel =
GetCurrentTransactionNestLevel();   pending->next = pendingDeletes;    pendingDeletes = pending;
 
+
+    /* pending sync on this file is no longer needed */
+    if (pendingSyncs)
+    {
+        bool found;
+
+        hash_search(pendingSyncs, (void *) &rnode, HASH_REMOVE, &found);
+    }}/*
@@ -226,6 +278,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)voidRelationTruncate(Relation rel,
BlockNumbernblocks){
 
+    PendingRelSync *pending = NULL;
+    bool        found;    bool        fsm;    bool        vm;
@@ -260,37 +314,78 @@ RelationTruncate(Relation rel, BlockNumber nblocks)     */    if (RelationNeedsWAL(rel))    {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
-
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (!rel->pending_sync)
+        {
+            if (!pendingSyncs)
+                createPendingSyncsHash();
+            pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                 (void *) &rel->rd_node,
+                                                 HASH_ENTER, &found);
+            if (!found)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            rel->pending_sync = pending;
+        }
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+            rel->pending_sync->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }    }    /* Do the real work */    smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+    /* First time through: initialize the hash table */
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hash = tag_hash;
+    pendingSyncs = hash_create("pending relation sync table", 5,
+                               &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+/* *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact. *
@@ -419,6 +514,156 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)    return nrels;}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    bool found = true;
+    BlockNumber nblocks;
+
+    Assert(RelationNeedsWAL(rel));
+
+    if (!rel->pending_sync)
+    {
+        if (!pendingSyncs)
+            createPendingSyncsHash();
+
+        /* Look up or create an entry */
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_ENTER, &found);
+    }        
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+    if (!found)
+    {
+        rel->pending_sync->truncated_to = InvalidBlockNumber;
+        rel->pending_sync->sync_above = nblocks;
+
+        elog(DEBUG2,
+             "registering new pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+
+    }
+    else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+        rel->pending_sync->sync_above = nblocks;
+    }
+    else
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* WAL is needed if no pending syncs */
+    if (!rel->pending_sync)
+    {
+        bool found = false;
+
+        if (!pendingSyncs)
+            return true;
+
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_FIND, &found);
+        if (!found)
+            return true;
+    }
+        
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+        rel->pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+        rel->pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+/* *    PostPrepare_smgr -- Clean up after a successful PREPARE *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b5af2be..8aa7e7b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2372,8 +2372,7 @@ CopyFrom(CopyState cstate)     *    - data is being written to relfilenode created in this
transaction    * then we can skip writing WAL.  It's safe because if the transaction     * doesn't commit, we'll
discardthe table (or the new relfilenode file).
 
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().     *     * As mentioned in comments in utils/rel.h, the
in-same-transactiontest     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
 
@@ -2405,7 +2404,7 @@ CopyFrom(CopyState cstate)    {        hi_options |= HEAP_INSERT_SKIP_FSM;        if
(!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);    }    /*
@@ -2782,11 +2781,11 @@ CopyFrom(CopyState cstate)    FreeExecutorState(estate);    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.     */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);    return processed;}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 06425cc..408495e 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     * We can skip
WAL-loggingthe insertions, unless PITR or streaming     * replication is in use. We can skip the FSM in any case.
*/
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;    myState->bistate = GetBulkInsertState();    /* Not using WAL
requiressmgr_targblock be initially invalid */
 
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close rel, but keep lock until commit */
 heap_close(myState->rel, NoLock);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 9ffd91e..8b127e3 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     */
myState->hi_options= HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;    if (!XLogIsNeeded())
 
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);    myState->bistate = GetBulkInsertState();    /* Not using WAL requires
smgr_targblockbe initially invalid */
 
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close transientrel, but keep lock until
commit*/    heap_close(myState->transientrel, NoLock);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index abb262b..ae69954 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4327,8 +4327,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)        bistate =
GetBulkInsertState();       hi_options = HEAP_INSERT_SKIP_FSM;
 
+        if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);    }    else    {
@@ -4589,8 +4590,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);       /* If we skipped writing WAL, then we need to sync the heap. */
 
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);        heap_close(newrel, NoLock);    }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..f3dcf6e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -893,7 +893,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,                 * page has
beenpreviously WAL-logged, and if not, do that                 * now.                 */
 
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&                    PageGetLSN(page) == InvalidXLogRecPtr)
        log_newpage_buffer(buf, true);
 
@@ -1120,7 +1120,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,            }            /*
NowWAL-log freezing if necessary */
 
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))            {                XLogRecPtr    recptr;
@@ -1480,7 +1480,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,    MarkBufferDirty(buffer);
/* XLOG stuff */
 
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))    {        XLogRecPtr    recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2109cbf..e991e9f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,            BufferAccessStrategy strategy,
  bool *foundPtr);static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);static void AtProcExit_Buffers(int code,
Datumarg);static void CheckForBufferLeaks(void);static int    rnode_comparator(const void *p1, const void *p2);
 
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)voidFlushRelationBuffers(Relation rel){
-    int            i;
-    BufferDesc *bufHdr;
-    /* Open rel at the smgr level if not already done */    RelationOpenSmgr(rel);
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)    {        for (i = 0; i < NLocBuffer; i++)        {            uint32        buf_state;
bufHdr= GetLocalBufferDescriptor(i);
 
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&                ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))            {
 
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)                PageSetChecksumInplace(localpage,
bufHdr->tag.blockNum);
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,                          bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                         localpage,
 
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)         * As in DropRelFileNodeBuffers, an unlocked precheck
shouldbe safe         * and saves some cycles.         */
 
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))            continue;        ReservePrivateRefCountEntry();
    buf_state = LockBufHdr(bufHdr);
 
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID
|BM_DIRTY))        {            PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr,true);        }
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ddb9485..61ff7eb 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@#include "optimizer/var.h"#include "rewrite/rewriteDefine.h"#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"#include "storage/lmgr.h"#include "storage/smgr.h"#include "utils/array.h"
@@ -418,6 +419,9 @@ AllocateRelationDesc(Form_pg_class relp)    /* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount= 1;
 
+    /* pending_sync is set as required later */
+    relation->pending_sync = NULL;
+    MemoryContextSwitchTo(oldcxt);    return relation;
@@ -3353,6 +3357,8 @@ RelationBuildLocalRelation(const char *relname,    else        rel->rd_rel->relfilenode =
relfilenode;
+    rel->pending_sync = NULL;
+    RelationInitLockInfo(rel);    /* see lmgr.c */    RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7e85510..3967641 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004typedef struct BulkInsertStateData *BulkInsertState;
@@ -178,6 +177,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);extern void
simple_heap_update(Relationrelation, ItemPointer otid,                   HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);extern void heap_sync(Relation relation);extern void
heap_update_snapshot(HeapScanDescscan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index fea96de..e8e49f1 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks); */extern void
smgrDoPendingDeletes(boolisCommit);extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);extern void AtSubCommit_smgr(void);extern void
AtSubAbort_smgr(void);externvoid PostPrepare_smgr(void);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 07a32d6..6ec2d26 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);extern void FlushOneBuffer(Buffer buffer);extern void FlushRelationBuffers(Relation rel);
 
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);extern void FlushDatabaseBuffers(Oid dbid);extern void
DropRelFileNodeBuffers(RelFileNodeBackendrnode,                       ForkNumber forkNum, BlockNumber firstDelBlock);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ab875bb..f802cc1 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,8 @@ typedef struct RelationData    /* use "struct" here to avoid needing to include pgstat.h: */
structPgStat_TableStatus *pgstat_info;        /* statistics collection area */
 
+
+    struct PendingRelSync *pending_sync;} RelationData;
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
deleted file mode 100644
index ccd5943..0000000
--- a/src/test/recovery/t/001_stream_rep.pl
+++ /dev/null
@@ -1,230 +0,0 @@
-# Minimal test testing streaming replication
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 28;
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-my $backup_name = 'my_backup';
-
-# Take backup
-$node_master->backup($backup_name);
-
-# Create streaming standby linking to master
-my $node_standby_1 = get_new_node('standby_1');
-$node_standby_1->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby_1->start;
-
-# Take backup of standby 1 (not mandatory, but useful to check if
-# pg_basebackup works on a standby).
-$node_standby_1->backup($backup_name);
-
-# Take a second backup of the standby while the master is offline.
-$node_master->stop;
-$node_standby_1->backup('my_backup_2');
-$node_master->start;
-
-# Create second standby node linking to standby 1
-my $node_standby_2 = get_new_node('standby_2');
-$node_standby_2->init_from_backup($node_standby_1, $backup_name,
-    has_streaming => 1);
-$node_standby_2->start;
-
-# Create some content on master and check its presence in standby 1
-$node_master->safe_psql('postgres',
-    "CREATE TABLE tab_int AS SELECT generate_series(1,1002) AS a");
-
-# Wait for standbys to catch up
-$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('insert'));
-$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('replay'));
-
-my $result =
-  $node_standby_1->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-print "standby 1: $result\n";
-is($result, qq(1002), 'check streamed content on standby 1');
-
-$result =
-  $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-print "standby 2: $result\n";
-is($result, qq(1002), 'check streamed content on standby 2');
-
-# Check that only READ-only queries can run on standbys
-is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
-    3, 'read-only queries on standby 1');
-is($node_standby_2->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
-    3, 'read-only queries on standby 2');
-
-# Tests for connection parameter target_session_attrs
-note "testing connection parameter \"target_session_attrs\"";
-
-# Routine designed to run tests on the connection parameter
-# target_session_attrs with multiple nodes.
-sub test_target_session_attrs
-{
-    my $node1 = shift;
-    my $node2 = shift;
-    my $target_node = shift;
-    my $mode = shift;
-    my $status = shift;
-
-    my $node1_host = $node1->host;
-    my $node1_port = $node1->port;
-    my $node1_name = $node1->name;
-    my $node2_host = $node2->host;
-    my $node2_port = $node2->port;
-    my $node2_name = $node2->name;
-
-    my $target_name = $target_node->name;
-
-    # Build connection string for connection attempt.
-    my $connstr = "host=$node1_host,$node2_host ";
-    $connstr .= "port=$node1_port,$node2_port ";
-    $connstr .= "target_session_attrs=$mode";
-
-    # The client used for the connection does not matter, only the backend
-    # point does.
-    my ($ret, $stdout, $stderr) =
-        $node1->psql('postgres', 'SHOW port;', extra_params => ['-d', $connstr]);
-    is($status == $ret && $stdout eq $target_node->port, 1,
-       "connect to node $target_name if mode \"$mode\" and $node1_name,$node2_name listed");
-}
-
-# Connect to master in "read-write" mode with master,standby1 list.
-test_target_session_attrs($node_master, $node_standby_1, $node_master,
-                          "read-write", 0);
-# Connect to master in "read-write" mode with standby1,master list.
-test_target_session_attrs($node_standby_1, $node_master, $node_master,
-                          "read-write", 0);
-# Connect to master in "any" mode with master,standby1 list.
-test_target_session_attrs($node_master, $node_standby_1, $node_master,
-                          "any", 0);
-# Connect to standby1 in "any" mode with standby1,master list.
-test_target_session_attrs($node_standby_1, $node_master, $node_standby_1,
-                          "any", 0);
-
-note "switching to physical replication slot";
-# Switch to using a physical replication slot. We can do this without a new
-# backup since physical slots can go backwards if needed. Do so on both
-# standbys. Since we're going to be testing things that affect the slot state,
-# also increase the standby feedback interval to ensure timely updates.
-my ($slotname_1, $slotname_2) = ('standby_1', 'standby_2');
-$node_master->append_conf('postgresql.conf', "max_replication_slots = 4\n");
-$node_master->restart;
-is($node_master->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_1');]), 0, 'physical slot
createdon master');
 
-$node_standby_1->append_conf('recovery.conf', "primary_slot_name = $slotname_1\n");
-$node_standby_1->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
-$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
-$node_standby_1->restart;
-is($node_standby_1->psql('postgres', qq[SELECT pg_create_physical_replication_slot('$slotname_2');]), 0, 'physical
slotcreated on intermediate replica');
 
-$node_standby_2->append_conf('recovery.conf', "primary_slot_name = $slotname_2\n");
-$node_standby_2->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n");
-$node_standby_2->restart;
-
-sub get_slot_xmins
-{
-    my ($node, $slotname) = @_;
-    my $slotinfo = $node->slot($slotname);
-    return ($slotinfo->{'xmin'}, $slotinfo->{'catalog_xmin'});
-}
-
-# There's no hot standby feedback and there are no logical slots on either peer
-# so xmin and catalog_xmin should be null on both slots.
-my ($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
-is($xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
-is($catalog_xmin, '', 'non-cascaded slot xmin null with no hs_feedback');
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-is($xmin, '', 'cascaded slot xmin null with no hs_feedback');
-is($catalog_xmin, '', 'cascaded slot xmin null with no hs_feedback');
-
-# Replication still works?
-$node_master->safe_psql('postgres', 'CREATE TABLE replayed(val integer);');
-
-sub replay_check
-{
-    my $newval = $node_master->safe_psql('postgres', 'INSERT INTO replayed(val) SELECT coalesce(max(val),0) + 1 AS
newvalFROM replayed RETURNING val');
 
-    $node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('insert'));
-    $node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('replay'));
-    $node_standby_1->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
-        or die "standby_1 didn't replay master value $newval";
-    $node_standby_2->safe_psql('postgres', qq[SELECT 1 FROM replayed WHERE val = $newval])
-        or die "standby_2 didn't replay standby_1 value $newval";
-}
-
-replay_check();
-
-note "enabling hot_standby_feedback";
-# Enable hs_feedback. The slot should gain an xmin. We set the status interval
-# so we'll see the results promptly.
-$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_1->reload;
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
-replay_check();
-sleep(2);
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
-isnt($xmin, '', 'non-cascaded slot xmin non-null with hs feedback');
-is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback');
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-isnt($xmin, '', 'cascaded slot xmin non-null with hs feedback');
-is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback');
-
-note "doing some work to advance xmin";
-for my $i (10000..11000) {
-    $node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES ($i);]);
-}
-$node_master->safe_psql('postgres', 'VACUUM;');
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-
-my ($xmin2, $catalog_xmin2) = get_slot_xmins($node_master, $slotname_1);
-note "new xmin $xmin2, old xmin $xmin";
-isnt($xmin2, $xmin, 'non-cascaded slot xmin with hs feedback has changed');
-is($catalog_xmin2, '', 'non-cascaded slot xmin still null with hs_feedback unchanged');
-
-($xmin2, $catalog_xmin2) = get_slot_xmins($node_standby_1, $slotname_2);
-note "new xmin $xmin2, old xmin $xmin";
-isnt($xmin2, $xmin, 'cascaded slot xmin with hs feedback has changed');
-is($catalog_xmin2, '', 'cascaded slot xmin still null with hs_feedback unchanged');
-
-note "disabling hot_standby_feedback";
-# Disable hs_feedback. Xmin should be cleared.
-$node_standby_1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
-$node_standby_1->reload;
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
-$node_standby_2->reload;
-replay_check();
-sleep(2);
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_master, $slotname_1);
-is($xmin, '', 'non-cascaded slot xmin null with hs feedback reset');
-is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback reset');
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-is($xmin, '', 'cascaded slot xmin null with hs feedback reset');
-is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback reset');
-
-note "re-enabling hot_standby_feedback and disabling while stopped";
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
-
-$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
-replay_check();
-
-$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
-$node_standby_2->stop;
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-isnt($xmin, '', 'cascaded slot xmin non-null with postgres shut down');
-
-# Xmin from a previous run should be cleared on startup.
-$node_standby_2->start;
-
-($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
-is($xmin, '', 'cascaded slot xmin reset after startup with hs feedback reset');
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
deleted file mode 100644
index 83b43bf..0000000
--- a/src/test/recovery/t/002_archiving.pl
+++ /dev/null
@@ -1,53 +0,0 @@
-# test for archiving with hot standby
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-use File::Copy;
-
-# Initialize master node, doing archives
-my $node_master = get_new_node('master');
-$node_master->init(
-    has_archiving    => 1,
-    allows_streaming => 1);
-my $backup_name = 'my_backup';
-
-# Start it
-$node_master->start;
-
-# Take backup for slave
-$node_master->backup($backup_name);
-
-# Initialize standby node from backup, fetching WAL from archives
-my $node_standby = get_new_node('standby');
-$node_standby->init_from_backup($node_master, $backup_name,
-    has_restoring => 1);
-$node_standby->append_conf(
-    'postgresql.conf', qq(
-wal_retrieve_retry_interval = '100ms'
-));
-$node_standby->start;
-
-# Create some content on master
-$node_master->safe_psql('postgres',
-    "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $current_lsn =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-# Force archiving of WAL file to make it present on master
-$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
-
-# Add some more content, it should not be present on standby
-$node_master->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-
-# Wait until necessary replay has been done on standby
-my $caughtup_query =
-  "SELECT '$current_lsn'::pg_lsn <= pg_last_wal_replay_location()";
-$node_standby->poll_query_until('postgres', $caughtup_query)
-  or die "Timed out while waiting for standby to catch up";
-
-my $result =
-  $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-is($result, qq(1000), 'check content from archives');
diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
deleted file mode 100644
index b7b0caa..0000000
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ /dev/null
@@ -1,146 +0,0 @@
-# Test for recovery targets: name, timestamp, XID
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 9;
-
-# Create and test a standby from given backup, with a certain
-# recovery target.
-sub test_recovery_standby
-{
-    my $test_name       = shift;
-    my $node_name       = shift;
-    my $node_master     = shift;
-    my $recovery_params = shift;
-    my $num_rows        = shift;
-    my $until_lsn       = shift;
-
-    my $node_standby = get_new_node($node_name);
-    $node_standby->init_from_backup($node_master, 'my_backup',
-        has_restoring => 1);
-
-    foreach my $param_item (@$recovery_params)
-    {
-        $node_standby->append_conf(
-            'recovery.conf',
-            qq($param_item
-));
-    }
-
-    $node_standby->start;
-
-    # Wait until standby has replayed enough data
-    my $caughtup_query =
-      "SELECT '$until_lsn'::pg_lsn <= pg_last_wal_replay_location()";
-    $node_standby->poll_query_until('postgres', $caughtup_query)
-      or die "Timed out while waiting for standby to catch up";
-
-    # Create some content on master and check its presence in standby
-    my $result =
-      $node_standby->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-    is($result, qq($num_rows), "check standby content for $test_name");
-
-    # Stop standby node
-    $node_standby->teardown_node;
-}
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(has_archiving => 1, allows_streaming => 1);
-
-# Start it
-$node_master->start;
-
-# Create data before taking the backup, aimed at testing
-# recovery_target = 'immediate'
-$node_master->safe_psql('postgres',
-    "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $lsn1 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-# Take backup from which all operations will be run
-$node_master->backup('my_backup');
-
-# Insert some data with used as a replay reference, with a recovery
-# target TXID.
-$node_master->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-my $ret = $node_master->safe_psql('postgres',
-    "SELECT pg_current_wal_location(), txid_current();");
-my ($lsn2, $recovery_txid) = split /\|/, $ret;
-
-# More data, with recovery target timestamp
-$node_master->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(2001,3000))");
-$ret = $node_master->safe_psql('postgres',
-    "SELECT pg_current_wal_location(), now();");
-my ($lsn3, $recovery_time) = split /\|/, $ret;
-
-# Even more data, this time with a recovery target name
-$node_master->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(3001,4000))");
-my $recovery_name = "my_target";
-my $lsn4 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-$node_master->safe_psql('postgres',
-    "SELECT pg_create_restore_point('$recovery_name');");
-
-# And now for a recovery target LSN
-$node_master->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(4001,5000))");
-my $recovery_lsn = $node_master->safe_psql('postgres', "SELECT pg_current_wal_location()");
-my $lsn5 =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-$node_master->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(5001,6000))");
-
-# Force archiving of WAL file
-$node_master->safe_psql('postgres', "SELECT pg_switch_wal()");
-
-# Test recovery targets
-my @recovery_params = ("recovery_target = 'immediate'");
-test_recovery_standby('immediate target',
-    'standby_1', $node_master, \@recovery_params, "1000", $lsn1);
-@recovery_params = ("recovery_target_xid = '$recovery_txid'");
-test_recovery_standby('XID', 'standby_2', $node_master, \@recovery_params,
-    "2000", $lsn2);
-@recovery_params = ("recovery_target_time = '$recovery_time'");
-test_recovery_standby('time', 'standby_3', $node_master, \@recovery_params,
-    "3000", $lsn3);
-@recovery_params = ("recovery_target_name = '$recovery_name'");
-test_recovery_standby('name', 'standby_4', $node_master, \@recovery_params,
-    "4000", $lsn4);
-@recovery_params = ("recovery_target_lsn = '$recovery_lsn'");
-test_recovery_standby('LSN', 'standby_5', $node_master, \@recovery_params,
-    "5000", $lsn5);
-
-# Multiple targets
-# Last entry has priority (note that an array respects the order of items
-# not hashes).
-@recovery_params = (
-    "recovery_target_name = '$recovery_name'",
-    "recovery_target_xid  = '$recovery_txid'",
-    "recovery_target_time = '$recovery_time'");
-test_recovery_standby('name + XID + time',
-    'standby_6', $node_master, \@recovery_params, "3000", $lsn3);
-@recovery_params = (
-    "recovery_target_time = '$recovery_time'",
-    "recovery_target_name = '$recovery_name'",
-    "recovery_target_xid  = '$recovery_txid'");
-test_recovery_standby('time + name + XID',
-    'standby_7', $node_master, \@recovery_params, "2000", $lsn2);
-@recovery_params = (
-    "recovery_target_xid  = '$recovery_txid'",
-    "recovery_target_time = '$recovery_time'",
-    "recovery_target_name = '$recovery_name'");
-test_recovery_standby('XID + time + name',
-    'standby_8', $node_master, \@recovery_params, "4000", $lsn4);
-@recovery_params = (
-    "recovery_target_xid  = '$recovery_txid'",
-    "recovery_target_time = '$recovery_time'",
-    "recovery_target_name = '$recovery_name'",
-    "recovery_target_lsn = '$recovery_lsn'",);
-test_recovery_standby('XID + time + name + LSN',
-    'standby_9', $node_master, \@recovery_params, "5000", $lsn5);
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
deleted file mode 100644
index 7c6587a..0000000
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ /dev/null
@@ -1,62 +0,0 @@
-# Test for timeline switch
-# Ensure that a cascading standby is able to follow a newly-promoted standby
-# on a new timeline.
-use strict;
-use warnings;
-use File::Path qw(rmtree);
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-
-$ENV{PGDATABASE} = 'postgres';
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-
-# Take backup
-my $backup_name = 'my_backup';
-$node_master->backup($backup_name);
-
-# Create two standbys linking to it
-my $node_standby_1 = get_new_node('standby_1');
-$node_standby_1->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby_1->start;
-my $node_standby_2 = get_new_node('standby_2');
-$node_standby_2->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby_2->start;
-
-# Create some content on master
-$node_master->safe_psql('postgres',
-    "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-
-# Wait until standby has replayed enough data on standby 1
-$node_master->wait_for_catchup($node_standby_1, 'replay', $node_master->lsn('write'));
-
-# Stop and remove master, and promote standby 1, switching it to a new timeline
-$node_master->teardown_node;
-$node_standby_1->promote;
-
-# Switch standby 2 to replay from standby 1
-rmtree($node_standby_2->data_dir . '/recovery.conf');
-my $connstr_1 = $node_standby_1->connstr;
-$node_standby_2->append_conf(
-    'recovery.conf', qq(
-primary_conninfo='$connstr_1 application_name=@{[$node_standby_2->name]}'
-standby_mode=on
-recovery_target_timeline='latest'
-));
-$node_standby_2->restart;
-
-# Insert some data in standby 1 and check its presence in standby 2
-# to ensure that the timeline switch has been done.
-$node_standby_1->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
-$node_standby_1->wait_for_catchup($node_standby_2, 'replay', $node_standby_1->lsn('write'));
-
-my $result =
-  $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
-is($result, qq(2000), 'check content of standby 2');
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
deleted file mode 100644
index cd9e8f5..0000000
--- a/src/test/recovery/t/005_replay_delay.pl
+++ /dev/null
@@ -1,69 +0,0 @@
-# Checks for recovery_min_apply_delay
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-
-# And some content
-$node_master->safe_psql('postgres',
-    "CREATE TABLE tab_int AS SELECT generate_series(1, 10) AS a");
-
-# Take backup
-my $backup_name = 'my_backup';
-$node_master->backup($backup_name);
-
-# Create streaming standby from backup
-my $node_standby = get_new_node('standby');
-my $delay        = 3;
-$node_standby->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby->append_conf(
-    'recovery.conf', qq(
-recovery_min_apply_delay = '${delay}s'
-));
-$node_standby->start;
-
-# Make new content on master and check its presence in standby depending
-# on the delay applied above. Before doing the insertion, get the
-# current timestamp that will be used as a comparison base. Even on slow
-# machines, this allows to have a predictable behavior when comparing the
-# delay between data insertion moment on master and replay time on standby.
-my $master_insert_time = time();
-$node_master->safe_psql('postgres',
-    "INSERT INTO tab_int VALUES (generate_series(11, 20))");
-
-# Now wait for replay to complete on standby. We're done waiting when the
-# slave has replayed up to the previously saved master LSN.
-my $until_lsn =
-  $node_master->safe_psql('postgres', "SELECT pg_current_wal_location()");
-
-my $remaining = 90;
-while ($remaining-- > 0)
-{
-
-    # Done waiting?
-    my $replay_status = $node_standby->safe_psql('postgres',
-        "SELECT (pg_last_wal_replay_location() - '$until_lsn'::pg_lsn) >= 0"
-    );
-    last if $replay_status eq 't';
-
-    # No, sleep some more.
-    my $sleep = $master_insert_time + $delay - time();
-    $sleep = 1 if $sleep < 1;
-    sleep $sleep;
-}
-
-die "Maximum number of attempts reached ($remaining remain)"
-  if $remaining < 0;
-
-# This test is successful if and only if the LSN has been applied with at least
-# the configured apply delay.
-ok(time() - $master_insert_time >= $delay,
-    "standby applies WAL only after replication delay");
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
deleted file mode 100644
index bf9b50a..0000000
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ /dev/null
@@ -1,104 +0,0 @@
-# Testing of logical decoding using SQL interface and/or pg_recvlogical
-#
-# Most logical decoding tests are in contrib/test_decoding. This module
-# is for work that doesn't fit well there, like where server restarts
-# are required.
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 16;
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->append_conf(
-        'postgresql.conf', qq(
-wal_level = logical
-));
-$node_master->start;
-my $backup_name = 'master_backup';
-
-$node_master->safe_psql('postgres', qq[CREATE TABLE decoding_test(x integer, y text);]);
-
-$node_master->safe_psql('postgres', qq[SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');]);
-
-$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10)
s;]);
-
-# Basic decoding works
-my($result) = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
-is(scalar(my @foobar = split /^/m, $result), 12, 'Decoding produced 12 rows inc BEGIN/COMMIT');
-
-# If we immediately crash the server we might lose the progress we just made
-# and replay the same changes again. But a clean shutdown should never repeat
-# the same changes when we use the SQL decoding interface.
-$node_master->restart('fast');
-
-# There are no new writes, so the result should be empty.
-$result = $node_master->safe_psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]);
-chomp($result);
-is($result, '', 'Decoding after fast restart repeats no rows');
-
-# Insert some rows and verify that we get the same results from pg_recvlogical
-# and the SQL interface.
-$node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,4)
s;]);
-
-my $expected = q{BEGIN
-table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
-table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
-table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
-table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
-COMMIT};
-
-my $stdout_sql = $node_master->safe_psql('postgres', qq[SELECT data FROM pg_logical_slot_peek_changes('test_slot',
NULL,NULL, 'include-xids', '0', 'skip-empty-xacts', '1');]);
 
-is($stdout_sql, $expected, 'got expected output from SQL decoding session');
-
-my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL,
NULL)ORDER BY location DESC LIMIT 1;");
 
-print "waiting to replay $endpos\n";
-
-my $stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0',
'skip-empty-xacts'=> '1');
 
-chomp($stdout_recv);
-is($stdout_recv, $expected, 'got same expected output from pg_recvlogical decoding session');
-
-$stdout_recv = $node_master->pg_recvlogical_upto('postgres', 'test_slot', $endpos, 10, 'include-xids' => '0',
'skip-empty-xacts'=> '1');
 
-chomp($stdout_recv);
-is($stdout_recv, '', 'pg_recvlogical acknowledged changes, nothing pending on slot');
-
-$node_master->safe_psql('postgres', 'CREATE DATABASE otherdb');
-
-is($node_master->psql('otherdb', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY
locationDESC LIMIT 1;"), 3,
 
-    'replaying logical slot from another database fails');
-
-$node_master->safe_psql('otherdb', qq[SELECT pg_create_logical_replication_slot('otherdb_slot', 'test_decoding');]);
-
-# make sure you can't drop a slot while active
-my $pg_recvlogical = IPC::Run::start(['pg_recvlogical', '-d', $node_master->connstr('otherdb'), '-S', 'otherdb_slot',
'-f','-', '--start']);
 
-$node_master->poll_query_until('otherdb', "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name =
'otherdb_slot'AND active_pid IS NOT NULL)");
 
-is($node_master->psql('postgres', 'DROP DATABASE otherdb'), 3,
-    'dropping a DB with inactive logical slots fails');
-$pg_recvlogical->kill_kill;
-is($node_master->slot('otherdb_slot')->{'slot_name'}, undef,
-    'logical slot still exists');
-
-$node_master->poll_query_until('otherdb', "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name =
'otherdb_slot'AND active_pid IS NULL)");
 
-is($node_master->psql('postgres', 'DROP DATABASE otherdb'), 0,
-    'dropping a DB with inactive logical slots succeeds');
-is($node_master->slot('otherdb_slot')->{'slot_name'}, undef,
-    'logical slot was actually dropped with DB');
-
-# Restarting a node with wal_level = logical that has existing
-# slots must succeed, but decoding from those slots must fail.
-$node_master->safe_psql('postgres', 'ALTER SYSTEM SET wal_level = replica');
-is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'logical', 'wal_level is still logical before restart');
-$node_master->restart;
-is($node_master->safe_psql('postgres', 'SHOW wal_level'), 'replica', 'wal_level is replica');
-isnt($node_master->slot('test_slot')->{'catalog_xmin'}, '0',
-    'restored slot catalog_xmin is nonzero');
-is($node_master->psql('postgres', qq[SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);]), 3,
-    'reading from slot with wal_level < logical fails');
-is($node_master->psql('postgres', q[SELECT pg_drop_replication_slot('test_slot')]), 0,
-    'can drop logical slot while wal_level = replica');
-is($node_master->slot('test_slot')->{'catalog_xmin'}, '', 'slot was dropped');
-
-# done with the node
-$node_master->stop;
diff --git a/src/test/recovery/t/007_sync_rep.pl b/src/test/recovery/t/007_sync_rep.pl
deleted file mode 100644
index e11b428..0000000
--- a/src/test/recovery/t/007_sync_rep.pl
+++ /dev/null
@@ -1,205 +0,0 @@
-# Minimal test testing synchronous replication sync_state transition
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 11;
-
-# Query checking sync_priority and sync_state of each standby
-my $check_sql =
-"SELECT application_name, sync_priority, sync_state FROM pg_stat_replication ORDER BY application_name;";
-
-# Check that sync_state of each standby is expected.
-# If $setting is given, synchronous_standby_names is set to it and
-# the configuration file is reloaded before the test.
-sub test_sync_state
-{
-    my ($self, $expected, $msg, $setting) = @_;
-
-    if (defined($setting))
-    {
-        $self->psql('postgres',
-            "ALTER SYSTEM SET synchronous_standby_names = '$setting';");
-        $self->reload;
-    }
-
-    my $timeout_max = 30;
-    my $timeout     = 0;
-    my $result;
-
-    # A reload may take some time to take effect on busy machines,
-    # hence use a loop with a timeout to give some room for the test
-    # to pass.
-    while ($timeout < $timeout_max)
-    {
-        $result = $self->safe_psql('postgres', $check_sql);
-
-        last if ($result eq $expected);
-
-        $timeout++;
-        sleep 1;
-    }
-
-    is($result, $expected, $msg);
-}
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-$node_master->start;
-my $backup_name = 'master_backup';
-
-# Take backup
-$node_master->backup($backup_name);
-
-# Create standby1 linking to master
-my $node_standby_1 = get_new_node('standby1');
-$node_standby_1->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby_1->start;
-
-# Create standby2 linking to master
-my $node_standby_2 = get_new_node('standby2');
-$node_standby_2->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby_2->start;
-
-# Create standby3 linking to master
-my $node_standby_3 = get_new_node('standby3');
-$node_standby_3->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby_3->start;
-
-# Check that sync_state is determined correctly when
-# synchronous_standby_names is specified in old syntax.
-test_sync_state(
-    $node_master, qq(standby1|1|sync
-standby2|2|potential
-standby3|0|async),
-    'old syntax of synchronous_standby_names',
-    'standby1,standby2');
-
-# Check that all the standbys are considered as either sync or
-# potential when * is specified in synchronous_standby_names.
-# Note that standby1 is chosen as sync standby because
-# it's stored in the head of WalSnd array which manages
-# all the standbys though they have the same priority.
-test_sync_state(
-    $node_master, qq(standby1|1|sync
-standby2|1|potential
-standby3|1|potential),
-    'asterisk in synchronous_standby_names',
-    '*');
-
-# Stop and start standbys to rearrange the order of standbys
-# in WalSnd array. Now, if standbys have the same priority,
-# standby2 is selected preferentially and standby3 is next.
-$node_standby_1->stop;
-$node_standby_2->stop;
-$node_standby_3->stop;
-
-$node_standby_2->start;
-$node_standby_3->start;
-
-# Specify 2 as the number of sync standbys.
-# Check that two standbys are in 'sync' state.
-test_sync_state(
-    $node_master, qq(standby2|2|sync
-standby3|3|sync),
-    '2 synchronous standbys',
-    '2(standby1,standby2,standby3)');
-
-# Start standby1
-$node_standby_1->start;
-
-# Create standby4 linking to master
-my $node_standby_4 = get_new_node('standby4');
-$node_standby_4->init_from_backup($node_master, $backup_name,
-    has_streaming => 1);
-$node_standby_4->start;
-
-# Check that standby1 and standby2 whose names appear earlier in
-# synchronous_standby_names are considered as sync. Also check that
-# standby3 appearing later represents potential, and standby4 is
-# in 'async' state because it's not in the list.
-test_sync_state(
-    $node_master, qq(standby1|1|sync
-standby2|2|sync
-standby3|3|potential
-standby4|0|async),
-    '2 sync, 1 potential, and 1 async');
-
-# Check that sync_state of each standby is determined correctly
-# when num_sync exceeds the number of names of potential sync standbys
-# specified in synchronous_standby_names.
-test_sync_state(
-    $node_master, qq(standby1|0|async
-standby2|4|sync
-standby3|3|sync
-standby4|1|sync),
-    'num_sync exceeds the num of potential sync standbys',
-    '6(standby4,standby0,standby3,standby2)');
-
-# The setting that * comes before another standby name is acceptable
-# but does not make sense in most cases. Check that sync_state is
-# chosen properly even in case of that setting.
-# The priority of standby2 should be 2 because it matches * first.
-test_sync_state(
-    $node_master, qq(standby1|1|sync
-standby2|2|sync
-standby3|2|potential
-standby4|2|potential),
-    'asterisk comes before another standby name',
-    '2(standby1,*,standby2)');
-
-# Check that the setting of '2(*)' chooses standby2 and standby3 that are stored
-# earlier in WalSnd array as sync standbys.
-test_sync_state(
-    $node_master, qq(standby1|1|potential
-standby2|1|sync
-standby3|1|sync
-standby4|1|potential),
-    'multiple standbys having the same priority are chosen as sync',
-    '2(*)');
-
-# Stop Standby3 which is considered in 'sync' state.
-$node_standby_3->stop;
-
-# Check that the state of standby1 stored earlier in WalSnd array than
-# standby4 is transited from potential to sync.
-test_sync_state(
-    $node_master, qq(standby1|1|sync
-standby2|1|sync
-standby4|1|potential),
-    'potential standby found earlier in array is promoted to sync');
-
-# Check that standby1 and standby2 are chosen as sync standbys
-# based on their priorities.
-test_sync_state(
-$node_master, qq(standby1|1|sync
-standby2|2|sync
-standby4|0|async),
-'priority-based sync replication specified by FIRST keyword',
-'FIRST 2(standby1, standby2)');
-
-# Check that all the listed standbys are considered as candidates
-# for sync standbys in a quorum-based sync replication.
-test_sync_state(
-$node_master, qq(standby1|1|quorum
-standby2|2|quorum
-standby4|0|async),
-'2 quorum and 1 async',
-'ANY 2(standby1, standby2)');
-
-# Start Standby3 which will be considered in 'quorum' state.
-$node_standby_3->start;
-
-# Check that the setting of 'ANY 2(*)' chooses all standbys as
-# candidates for quorum sync standbys.
-test_sync_state(
-$node_master, qq(standby1|1|quorum
-standby2|1|quorum
-standby3|1|quorum
-standby4|1|quorum),
-'all standbys are considered as candidates for quorum sync standbys',
-'ANY 2(*)');
diff --git a/src/test/recovery/t/008_fsm_truncation.pl b/src/test/recovery/t/008_fsm_truncation.pl
deleted file mode 100644
index 8aa8a4f..0000000
--- a/src/test/recovery/t/008_fsm_truncation.pl
+++ /dev/null
@@ -1,92 +0,0 @@
-# Test WAL replay of FSM changes.
-#
-# FSM changes don't normally need to be WAL-logged, except for truncation.
-# The FSM mustn't return a page that doesn't exist (anymore).
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 1;
-
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1);
-
-$node_master->append_conf('postgresql.conf', qq{
-fsync = on
-wal_log_hints = on
-max_prepared_transactions = 5
-autovacuum = off
-});
-
-# Create a master node and its standby, initializing both with some data
-# at the same time.
-$node_master->start;
-
-$node_master->backup('master_backup');
-my $node_standby = get_new_node('standby');
-$node_standby->init_from_backup($node_master, 'master_backup',
-    has_streaming => 1);
-$node_standby->start;
-
-$node_master->psql('postgres', qq{
-create table testtab (a int, b char(100));
-insert into testtab select generate_series(1,1000), 'foo';
-insert into testtab select generate_series(1,1000), 'foo';
-delete from testtab where ctid > '(8,0)';
-});
-
-# Take a lock on the table to prevent following vacuum from truncating it
-$node_master->psql('postgres', qq{
-begin;
-lock table testtab in row share mode;
-prepare transaction 'p1';
-});
-
-# Vacuum, update FSM without truncation
-$node_master->psql('postgres', 'vacuum verbose testtab');
-
-# Force a checkpoint
-$node_master->psql('postgres', 'checkpoint');
-
-# Now do some more insert/deletes, another vacuum to ensure full-page writes
-# are done
-$node_master->psql('postgres', qq{
-insert into testtab select generate_series(1,1000), 'foo';
-delete from testtab where ctid > '(8,0)';
-vacuum verbose testtab;
-});
-
-# Ensure all buffers are now clean on the standby
-$node_standby->psql('postgres', 'checkpoint');
-
-# Release the lock, vacuum again which should lead to truncation
-$node_master->psql('postgres', qq{
-rollback prepared 'p1';
-vacuum verbose testtab;
-});
-
-$node_master->psql('postgres', 'checkpoint');
-my $until_lsn =
-    $node_master->safe_psql('postgres', "SELECT pg_current_wal_location();");
-
-# Wait long enough for standby to receive and apply all WAL
-my $caughtup_query =
-    "SELECT '$until_lsn'::pg_lsn <= pg_last_wal_replay_location()";
-$node_standby->poll_query_until('postgres', $caughtup_query)
-    or die "Timed out while waiting for standby to catch up";
-
-# Promote the standby
-$node_standby->promote;
-$node_standby->poll_query_until('postgres',
-    "SELECT NOT pg_is_in_recovery()")
-  or die "Timed out while waiting for promotion of standby";
-$node_standby->psql('postgres', 'checkpoint');
-
-# Restart to discard in-memory copy of FSM
-$node_standby->restart;
-
-# Insert should work on standby
-is($node_standby->psql('postgres',
-   qq{insert into testtab select generate_series(1,1000), 'foo';}),
-   0, 'INSERT succeeds with truncated relation FSM');
diff --git a/src/test/recovery/t/009_twophase.pl b/src/test/recovery/t/009_twophase.pl
deleted file mode 100644
index be7f00b..0000000
--- a/src/test/recovery/t/009_twophase.pl
+++ /dev/null
@@ -1,322 +0,0 @@
-# Tests dedicated to two-phase commit in recovery
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 13;
-
-# Setup master node
-my $node_master = get_new_node("master");
-$node_master->init(allows_streaming => 1);
-$node_master->append_conf('postgresql.conf', qq(
-    max_prepared_transactions = 10
-    log_checkpoints = true
-));
-$node_master->start;
-$node_master->backup('master_backup');
-$node_master->psql('postgres', "CREATE TABLE t_009_tbl (id int)");
-
-# Setup slave node
-my $node_slave = get_new_node('slave');
-$node_slave->init_from_backup($node_master, 'master_backup', has_streaming => 1);
-$node_slave->start;
-
-# Switch to synchronous replication
-$node_master->append_conf('postgresql.conf', qq(
-    synchronous_standby_names = '*'
-));
-$node_master->psql('postgres', "SELECT pg_reload_conf()");
-
-my $psql_out = '';
-my $psql_rc = '';
-
-###############################################################################
-# Check that we can commit and abort transaction after soft restart.
-# Here checkpoint happens before shutdown and no WAL replay will occur at next
-# startup. In this case postgres re-creates shared-memory state from twophase
-# files.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (142);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (143);
-    PREPARE TRANSACTION 'xact_009_2';");
-$node_master->stop;
-$node_master->start;
-
-$psql_rc = $node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', 'Commit prepared transaction after restart');
-
-$psql_rc = $node_master->psql('postgres', "ROLLBACK PREPARED 'xact_009_2'");
-is($psql_rc, '0', 'Rollback prepared transaction after restart');
-
-###############################################################################
-# Check that we can commit and abort after a hard restart.
-# At next startup, WAL replay will re-create shared memory state for prepared
-# transaction using dedicated WAL records.
-###############################################################################
-
-$node_master->psql('postgres', "
-    CHECKPOINT;
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (142);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (143);
-    PREPARE TRANSACTION 'xact_009_2';");
-$node_master->teardown_node;
-$node_master->start;
-
-$psql_rc = $node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', 'Commit prepared transaction after teardown');
-
-$psql_rc = $node_master->psql('postgres', "ROLLBACK PREPARED 'xact_009_2'");
-is($psql_rc, '0', 'Rollback prepared transaction after teardown');
-
-###############################################################################
-# Check that WAL replay can handle several transactions with same GID name.
-###############################################################################
-
-$node_master->psql('postgres', "
-    CHECKPOINT;
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';
-    COMMIT PREPARED 'xact_009_1';
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';");
-$node_master->teardown_node;
-$node_master->start;
-
-$psql_rc = $node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', 'Replay several transactions with same GID');
-
-###############################################################################
-# Check that WAL replay cleans up its shared memory state and releases locks
-# while replaying transaction commits.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';
-    COMMIT PREPARED 'xact_009_1';");
-$node_master->teardown_node;
-$node_master->start;
-$psql_rc = $node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    -- This prepare can fail due to conflicting GID or locks conflicts if
-    -- replay did not fully cleanup its state on previous commit.
-    PREPARE TRANSACTION 'xact_009_1';");
-is($psql_rc, '0', "Cleanup of shared memory state for 2PC commit");
-
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-
-###############################################################################
-# Check that WAL replay will cleanup its shared memory state on running slave.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';
-    COMMIT PREPARED 'xact_009_1';");
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
-      stdout => \$psql_out);
-is($psql_out, '0',
-   "Cleanup of shared memory state on running standby without checkpoint");
-
-###############################################################################
-# Same as in previous case, but let's force checkpoint on slave between
-# prepare and commit to use on-disk twophase files.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';");
-$node_slave->psql('postgres', "CHECKPOINT");
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
-      stdout => \$psql_out);
-is($psql_out, '0',
-   "Cleanup of shared memory state on running standby after checkpoint");
-
-###############################################################################
-# Check that prepared transactions can be committed on promoted slave.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';");
-$node_master->teardown_node;
-$node_slave->promote;
-$node_slave->poll_query_until('postgres',
-    "SELECT NOT pg_is_in_recovery()")
-  or die "Timed out while waiting for promotion of standby";
-
-$psql_rc = $node_slave->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-is($psql_rc, '0', "Restore of prepared transaction on promoted slave");
-
-# change roles
-($node_master, $node_slave) = ($node_slave, $node_master);
-$node_slave->enable_streaming($node_master);
-$node_slave->append_conf('recovery.conf', qq(
-recovery_target_timeline='latest'
-));
-$node_slave->start;
-
-###############################################################################
-# Check that prepared transactions are replayed after soft restart of standby
-# while master is down. Since standby knows that master is down it uses a
-# different code path on startup to ensure that the status of transactions is
-# consistent.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (42);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';");
-$node_master->stop;
-$node_slave->restart;
-$node_slave->promote;
-$node_slave->poll_query_until('postgres',
-    "SELECT NOT pg_is_in_recovery()")
-  or die "Timed out while waiting for promotion of standby";
-
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
-      stdout => \$psql_out);
-is($psql_out, '1',
-   "Restore prepared transactions from files with master down");
-
-# restore state
-($node_master, $node_slave) = ($node_slave, $node_master);
-$node_slave->enable_streaming($node_master);
-$node_slave->append_conf('recovery.conf', qq(
-recovery_target_timeline='latest'
-));
-$node_slave->start;
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-
-###############################################################################
-# Check that prepared transactions are correctly replayed after slave hard
-# restart while master is down.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (242);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (243);
-    PREPARE TRANSACTION 'xact_009_1';
-    ");
-$node_master->stop;
-$node_slave->teardown_node;
-$node_slave->start;
-$node_slave->promote;
-$node_slave->poll_query_until('postgres',
-    "SELECT NOT pg_is_in_recovery()")
-  or die "Timed out while waiting for promotion of standby";
-
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
-      stdout => \$psql_out);
-is($psql_out, '1',
-   "Restore prepared transactions from records with master down");
-
-# restore state
-($node_master, $node_slave) = ($node_slave, $node_master);
-$node_slave->enable_streaming($node_master);
-$node_slave->append_conf('recovery.conf', qq(
-recovery_target_timeline='latest'
-));
-$node_slave->start;
-$node_master->psql('postgres', "COMMIT PREPARED 'xact_009_1'");
-
-
-###############################################################################
-# Check for a lock conflict between prepared transaction with DDL inside and replay of
-# XLOG_STANDBY_LOCK wal record.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    CREATE TABLE t_009_tbl2 (id int);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl2 VALUES (42);
-    PREPARE TRANSACTION 'xact_009_1';
-    -- checkpoint will issue XLOG_STANDBY_LOCK that can conflict with lock
-    -- held by 'create table' statement
-    CHECKPOINT;
-    COMMIT PREPARED 'xact_009_1';");
-
-$node_slave->psql('postgres', "SELECT count(*) FROM pg_prepared_xacts",
-      stdout => \$psql_out);
-is($psql_out, '0', "Replay prepared transaction with DDL");
-
-
-###############################################################################
-# Check that replay will correctly set SUBTRANS and properly advance nextXid
-# so that it won't conflict with savepoint xids.
-###############################################################################
-
-$node_master->psql('postgres', "
-    BEGIN;
-    DELETE FROM t_009_tbl;
-    INSERT INTO t_009_tbl VALUES (43);
-    SAVEPOINT s1;
-    INSERT INTO t_009_tbl VALUES (43);
-    SAVEPOINT s2;
-    INSERT INTO t_009_tbl VALUES (43);
-    SAVEPOINT s3;
-    INSERT INTO t_009_tbl VALUES (43);
-    SAVEPOINT s4;
-    INSERT INTO t_009_tbl VALUES (43);
-    SAVEPOINT s5;
-    INSERT INTO t_009_tbl VALUES (43);
-    PREPARE TRANSACTION 'xact_009_1';
-    CHECKPOINT;");
-
-$node_master->stop;
-$node_master->start;
-$node_master->psql('postgres', "
-    -- here we can get xid of previous savepoint if nextXid
-    -- wasn't properly advanced
-    BEGIN;
-    INSERT INTO t_009_tbl VALUES (142);
-    ROLLBACK;
-    COMMIT PREPARED 'xact_009_1';");
-
-$node_master->psql('postgres', "SELECT count(*) FROM t_009_tbl",
-      stdout => \$psql_out);
-is($psql_out, '6', "Check nextXid handling for prepared subtransactions");
diff --git a/src/test/recovery/t/010_logical_decoding_timelines.pl
b/src/test/recovery/t/010_logical_decoding_timelines.pl
deleted file mode 100644
index cdddb4d..0000000
--- a/src/test/recovery/t/010_logical_decoding_timelines.pl
+++ /dev/null
@@ -1,184 +0,0 @@
-# Demonstrate that logical can follow timeline switches.
-#
-# Logical replication slots can follow timeline switches but it's
-# normally not possible to have a logical slot on a replica where
-# promotion and a timeline switch can occur. The only ways
-# we can create that circumstance are:
-#
-# * By doing a filesystem-level copy of the DB, since pg_basebackup
-#   excludes pg_replslot but we can copy it directly; or
-#
-# * by creating a slot directly at the C level on the replica and
-#   advancing it as we go using the low level APIs. It can't be done
-#   from SQL since logical decoding isn't allowed on replicas.
-#
-# This module uses the first approach to show that timeline following
-# on a logical slot works.
-#
-# (For convenience, it also tests some recovery-related operations
-# on logical slots).
-#
-use strict;
-use warnings;
-
-use PostgresNode;
-use TestLib;
-use Test::More tests => 13;
-use RecursiveCopy;
-use File::Copy;
-use IPC::Run ();
-use Scalar::Util qw(blessed);
-
-my ($stdout, $stderr, $ret);
-
-# Initialize master node
-my $node_master = get_new_node('master');
-$node_master->init(allows_streaming => 1, has_archiving => 1);
-$node_master->append_conf('postgresql.conf', q[
-wal_level = 'logical'
-max_replication_slots = 3
-max_wal_senders = 2
-log_min_messages = 'debug2'
-hot_standby_feedback = on
-wal_receiver_status_interval = 1
-]);
-$node_master->dump_info;
-$node_master->start;
-
-note "testing logical timeline following with a filesystem-level copy";
-
-$node_master->safe_psql('postgres',
-"SELECT pg_create_logical_replication_slot('before_basebackup', 'test_decoding');"
-);
-$node_master->safe_psql('postgres', "CREATE TABLE decoding(blah text);");
-$node_master->safe_psql('postgres',
-    "INSERT INTO decoding(blah) VALUES ('beforebb');");
-
-# We also want to verify that DROP DATABASE on a standby with a logical
-# slot works. This isn't strictly related to timeline following, but
-# the only way to get a logical slot on a standby right now is to use
-# the same physical copy trick, so:
-$node_master->safe_psql('postgres', 'CREATE DATABASE dropme;');
-$node_master->safe_psql('dropme',
-"SELECT pg_create_logical_replication_slot('dropme_slot', 'test_decoding');"
-);
-
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-
-my $backup_name = 'b1';
-$node_master->backup_fs_hot($backup_name);
-
-$node_master->safe_psql('postgres',
-    q[SELECT pg_create_physical_replication_slot('phys_slot');]);
-
-my $node_replica = get_new_node('replica');
-$node_replica->init_from_backup(
-    $node_master, $backup_name,
-    has_streaming => 1,
-    has_restoring => 1);
-$node_replica->append_conf(
-    'recovery.conf', q[primary_slot_name = 'phys_slot']);
-
-$node_replica->start;
-
-# If we drop 'dropme' on the master, the standby should drop the
-# db and associated slot.
-is($node_master->psql('postgres', 'DROP DATABASE dropme'), 0,
-    'dropped DB with logical slot OK on master');
-$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('insert'));
-is($node_replica->safe_psql('postgres', q[SELECT 1 FROM pg_database WHERE datname = 'dropme']), '',
-    'dropped DB dropme on standby');
-is($node_master->slot('dropme_slot')->{'slot_name'}, undef,
-    'logical slot was actually dropped on standby');
-
-# Back to testing failover...
-$node_master->safe_psql('postgres',
-"SELECT pg_create_logical_replication_slot('after_basebackup', 'test_decoding');"
-);
-$node_master->safe_psql('postgres',
-    "INSERT INTO decoding(blah) VALUES ('afterbb');");
-$node_master->safe_psql('postgres', 'CHECKPOINT;');
-
-# Verify that only the before base_backup slot is on the replica
-$stdout = $node_replica->safe_psql('postgres',
-    'SELECT slot_name FROM pg_replication_slots ORDER BY slot_name');
-is($stdout, 'before_basebackup',
-    'Expected to find only slot before_basebackup on replica');
-
-# Examine the physical slot the replica uses to stream changes
-# from the master to make sure its hot_standby_feedback
-# has locked in a catalog_xmin on the physical slot, and that
-# any xmin is < the catalog_xmin
-$node_master->poll_query_until('postgres', q[
-    SELECT catalog_xmin IS NOT NULL
-    FROM pg_replication_slots
-    WHERE slot_name = 'phys_slot'
-    ]);
-my $phys_slot = $node_master->slot('phys_slot');
-isnt($phys_slot->{'xmin'}, '',
-    'xmin assigned on physical slot of master');
-isnt($phys_slot->{'catalog_xmin'}, '',
-    'catalog_xmin assigned on physical slot of master');
-# Ignore wrap-around here, we're on a new cluster:
-cmp_ok($phys_slot->{'xmin'}, '>=', $phys_slot->{'catalog_xmin'},
-       'xmin on physical slot must not be lower than catalog_xmin');
-
-$node_master->safe_psql('postgres', 'CHECKPOINT');
-
-# Boom, crash
-$node_master->stop('immediate');
-
-$node_replica->promote;
-print "waiting for replica to come up\n";
-$node_replica->poll_query_until('postgres',
-    "SELECT NOT pg_is_in_recovery();");
-
-$node_replica->safe_psql('postgres',
-    "INSERT INTO decoding(blah) VALUES ('after failover');");
-
-# Shouldn't be able to read from slot created after base backup
-($ret, $stdout, $stderr) = $node_replica->psql('postgres',
-"SELECT data FROM pg_logical_slot_peek_changes('after_basebackup', NULL, NULL, 'include-xids', '0',
'skip-empty-xacts','1');"
 
-);
-is($ret, 3, 'replaying from after_basebackup slot fails');
-like(
-    $stderr,
-    qr/replication slot "after_basebackup" does not exist/,
-    'after_basebackup slot missing');
-
-# Should be able to read from slot created before base backup
-($ret, $stdout, $stderr) = $node_replica->psql(
-    'postgres',
-"SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0',
'skip-empty-xacts','1');",
 
-    timeout => 30);
-is($ret, 0, 'replay from slot before_basebackup succeeds');
-
-my $final_expected_output_bb = q(BEGIN
-table public.decoding: INSERT: blah[text]:'beforebb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'afterbb'
-COMMIT
-BEGIN
-table public.decoding: INSERT: blah[text]:'after failover'
-COMMIT);
-is($stdout, $final_expected_output_bb, 'decoded expected data from slot before_basebackup');
-is($stderr, '', 'replay from slot before_basebackup produces no stderr');
-
-# So far we've peeked the slots, so when we fetch the same info over
-# pg_recvlogical we should get complete results. First, find out the commit lsn
-# of the last transaction. There's no max(pg_lsn), so:
-
-my $endpos = $node_replica->safe_psql('postgres', "SELECT location FROM
pg_logical_slot_peek_changes('before_basebackup',NULL, NULL) ORDER BY location DESC LIMIT 1;");
 
-
-# now use the walsender protocol to peek the slot changes and make sure we see
-# the same results.
-
-$stdout = $node_replica->pg_recvlogical_upto('postgres', 'before_basebackup',
-    $endpos, 30, 'include-xids' => '0', 'skip-empty-xacts' => '1');
-
-# walsender likes to add a newline
-chomp($stdout);
-is($stdout, $final_expected_output_bb, 'got same output from walsender via pg_recvlogical on before_basebackup');
-
-$node_replica->teardown_node();
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
deleted file mode 100644
index 3c3718e..0000000
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ /dev/null
@@ -1,46 +0,0 @@
-#
-# Tests relating to PostgreSQL crash recovery and redo
-#
-use strict;
-use warnings;
-use PostgresNode;
-use TestLib;
-use Test::More tests => 3;
-
-my $node = get_new_node('master');
-$node->init(allows_streaming => 1);
-$node->start;
-
-my ($stdin, $stdout, $stderr) = ('', '', '');
-
-# Ensure that txid_status reports 'aborted' for xacts
-# that were in-progress during crash. To do that, we need
-# an xact to be in-progress when we crash and we need to know
-# its xid.
-my $tx = IPC::Run::start(
-    ['psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', $node->connstr('postgres')],
-    '<', \$stdin, '>', \$stdout, '2>', \$stderr);
-$stdin .= q[
-BEGIN;
-CREATE TABLE mine(x integer);
-SELECT txid_current();
-];
-$tx->pump until $stdout =~ /[[:digit:]]+[\r\n]$/;
-
-# Status should be in-progress
-my $xid = $stdout;
-chomp($xid);
-
-is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]), 'in progress', 'own xid is in-progres');
-
-# Crash and restart the postmaster
-$node->stop('immediate');
-$node->start;
-
-# Make sure we really got a new xid
-cmp_ok($node->safe_psql('postgres', 'SELECT txid_current()'), '>', $xid,
-    'new xid after restart is greater');
-# and make sure we show the in-progress xact as aborted
-is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]), 'aborted', 'xid is aborted after crash');
-
-$tx->kill_kill;

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 апреля 2017 г., 14:38:12

Sorry, what I have just sent was broken.

At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170411.173341.257028732.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170411.095606.245908357.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > Hello, thank you for looking this.
> > 
> > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
> > > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > > > Interesting.  I wonder if it's possible that a relcache invalidation
> > > > would cause these values to get lost for some reason, because that would
> > > > be dangerous.
> > > 
> > > > I suppose the rationale is that this shouldn't happen because any
> > > > operation that does things this way must hold an exclusive lock on the
> > > > relation.  But that doesn't guarantee that the relcache entry is
> > > > completely stable,
> > > 
> > > It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> > > how strong a lock you hold.
> > > 
> > >             regards, tom lane
> > 
> > Ugh. Yes, relcache invalidation happens anytime and it resets the
> > added values. pg_stat_info deceived me that it can store
> > transient values. But I  came up with another thought.
> > 
> > The reason I proposed it was I thought that hash_search for every
> > buffer is not good. Instead, like pg_stat_info, we can link the
> 
> buffer => buffer modification
> 
> > pending-sync hash entry to Relation. This greately reduces the
> > frequency of hash-searching.
> > 
> > I'll post new patch in this way soon.
> 
> Here it is.

It contained tariling space and missing test script.  This is the
correct patch.

> - Relation has new members no_pending_sync and pending_sync that
>   works as instant cache of an entry in pendingSync hash.
> 
> - Commit-time synchronizing is restored as Michael's patch.
> 
> - If relfilenode is replaced, pending_sync for the old node is
>   removed. Anyway this is ignored on abort and meaningless on
>   commit.
> 
> - TAP test is renamed to 012 since some new files have been added.
> 
> Accessing pending sync hash occured on every calling of
> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> accessing relations has pending sync.  Almost of them are
> eliminated as the result.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0c3e2b0..23a6d56 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@ *      the POSTGRES heap access method used for all POSTGRES *      relations. *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ * *------------------------------------------------------------------------- */#include "postgres.h"
@@ -56,6 +78,7 @@#include "access/xlogutils.h"#include "catalog/catalog.h"#include "catalog/namespace.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -2356,12 +2379,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * The new tuple is stamped with current
transactionID and the specified * command ID. *
 
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. *
@@ -2392,6 +2409,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * TID where the tuple was stored.  But note
thatany toasting of fields * within the tuple data is NOT reflected into *tup. */
 
+extern HTAB *pendingSyncs;Oidheap_insert(Relation relation, HeapTuple tup, CommandId cid,            int options,
BulkInsertStatebistate)
 
@@ -2465,7 +2483,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,    MarkBufferDirty(buffer);    /*
XLOGstuff */
 
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_insert xlrec;        xl_heap_header xlhdr;
@@ -2664,12 +2682,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,    int            ndone;
char       *scratch = NULL;    Page        page;
 
-    bool        needwal;    Size        saveFreeSpace;    bool        need_tuple_data =
RelationIsLogicallyLogged(relation);   bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);    saveFreeSpace =
RelationGetTargetPageFreeSpace(relation,                                                  HEAP_DEFAULT_FILLFACTOR);
 
@@ -2684,7 +2700,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,     * palloc() within a
criticalsection is not safe, so we allocate this     * beforehand.     */
 
-    if (needwal)
+    if (RelationNeedsWAL(relation))        scratch = palloc(BLCKSZ);    /*
@@ -2719,6 +2735,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,        Buffer
vmbuffer= InvalidBuffer;        bool        all_visible_cleared = false;        int            nthispage;
 
+        bool        needwal;        CHECK_FOR_INTERRUPTS();
@@ -2730,6 +2747,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
            InvalidBuffer, options, bistate,                                           &vmbuffer, NULL);        page =
BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);        /* NO EREPORT(ERROR) from here till changes are logged */
    START_CRIT_SECTION();
 
@@ -3286,7 +3304,7 @@ l1:     * NB: heap_abort_speculative() uses the same xlog record and replay     * routines.
*/
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -4250,7 +4268,8 @@ l2:    MarkBufferDirty(buffer);    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))    {        XLogRecPtr    recptr;
@@ -5141,7 +5160,7 @@ failed:     * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG     *
entriesfor everything anyway.)     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))    {        xl_heap_lock xlrec;        XLogRecPtr    recptr;
@@ -5843,7 +5862,7 @@ l4:        MarkBufferDirty(buf);        /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))        {            xl_heap_lock_updated xlrec;            XLogRecPtr
recptr;
@@ -5998,7 +6017,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)    htup->t_ctid = tuple->t_self;
/*XLOG stuff */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_confirm xlrec;        XLogRecPtr    recptr;
@@ -6131,7 +6150,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)     * The WAL records generated here
matchheap_delete().  The same recovery     * routines are used.     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -6240,7 +6259,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)    MarkBufferDirty(buffer);    /* XLOG
stuff*/
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_inplace xlrec;        XLogRecPtr    recptr;
@@ -7354,7 +7373,7 @@ log_heap_clean(Relation reln, Buffer buffer,    XLogRecPtr    recptr;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    xlrec.latestRemovedXid = latestRemovedXid;    xlrec.nredirected =
nredirected;
@@ -7402,7 +7421,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,    XLogRecPtr    recptr;
 /* Caller should not call me on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    /* nor when there are no tuples to freeze */    Assert(ntuples > 0);
@@ -7487,7 +7506,7 @@ log_heap_update(Relation reln, Buffer oldbuf,    int            bufflags;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));    XLogBeginInsert();
@@ -7590,76 +7609,86 @@ log_heap_update(Relation reln, Buffer oldbuf,    xlrec.new_offnum =
ItemPointerGetOffsetNumber(&newtup->t_self);   xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
 
+    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+    bufflags = REGBUF_STANDARD;    if (init)        bufflags |= REGBUF_WILL_INIT;    if (need_tuple_data)
bufflags|= REGBUF_KEEP_DATA;
 
-    XLogRegisterBuffer(0, newbuf, bufflags);
-    if (oldbuf != newbuf)
-        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
-    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-    /*     * Prepare WAL data for the new tuple.     */
-    if (prefixlen > 0 || suffixlen > 0)
+    if (BufferNeedsWAL(reln, newbuf))    {
-        if (prefixlen > 0 && suffixlen > 0)
-        {
-            prefix_suffix[0] = prefixlen;
-            prefix_suffix[1] = suffixlen;
-            XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
-        }
-        else if (prefixlen > 0)
-        {
-            XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
-        }
-        else
+        XLogRegisterBuffer(0, newbuf, bufflags);
+
+        if ((prefixlen > 0 || suffixlen > 0))        {
-            XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            if (prefixlen > 0 && suffixlen > 0)
+            {
+                prefix_suffix[0] = prefixlen;
+                prefix_suffix[1] = suffixlen;
+                XLogRegisterBufData(0, (char *) &prefix_suffix,
+                                    sizeof(uint16) * 2);
+            }
+            else if (prefixlen > 0)
+            {
+                XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+            }
+            else
+            {
+                XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            }        }
-    }
-    xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
-    xlhdr.t_infomask = newtup->t_data->t_infomask;
-    xlhdr.t_hoff = newtup->t_data->t_hoff;
-    Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+        xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+        xlhdr.t_infomask = newtup->t_data->t_infomask;
+        xlhdr.t_hoff = newtup->t_data->t_hoff;
+        Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
-    /*
-     * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
-     *
-     * The 'data' doesn't include the common prefix or suffix.
-     */
-    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-    if (prefixlen == 0)
-    {
-        XLogRegisterBufData(0,
-                            ((char *) newtup->t_data) + SizeofHeapTupleHeader,
-                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);
-    }
-    else
-    {        /*
-         * Have to write the null bitmap and data after the common prefix as
-         * two separate rdata entries.
+         * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+         *
+         * The 'data' doesn't include the common prefix or suffix.         */
-        /* bitmap [+ padding] [+ oid] */
-        if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+        if (prefixlen == 0)        {            XLogRegisterBufData(0,                           ((char *)
newtup->t_data)+ SizeofHeapTupleHeader,
 
-                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);        }
+        else
+        {
+            /*
+             * Have to write the null bitmap and data after the common prefix
+             * as two separate rdata entries.
+             */
+            /* bitmap [+ padding] [+ oid] */
+            if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+            {
+                XLogRegisterBufData(0,
+                           ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+            }
-        /* data after common prefix */
-        XLogRegisterBufData(0,
+            /* data after common prefix */
+            XLogRegisterBufData(0,              ((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
    newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
 
+        }    }
+    /*
+     * If the old and new tuple are on different pages, also register the old
+     * page, so that a full-page image is created for it if necessary. We
+     * don't need any extra information to replay changes to it.
+     */
+    if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+    /* We need to log a tuple identity */    if (need_tuple_data && old_key_tuple)    {
@@ -8578,8 +8607,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)     */    /* Deal with old tuple
version*/
 
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+    if (oldaction == BLK_NEEDS_REDO)    {        page = BufferGetPage(obuffer);
@@ -8633,6 +8667,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)        PageInit(page,
BufferGetPageSize(nbuffer),0);        newaction = BLK_NEEDS_REDO;    }
 
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;    else        newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9069,9 +9105,16 @@ heap2_redo(XLogReaderState *record) *    heap_sync        - sync a heap, for use when no WAL has
beenwritten * * This forces the heap contents (including TOAST heap if any) down to disk.
 
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead. * * Indexes are not touched.  (Currently, index operations associated with * the
commandsthat use this are WAL-logged and so do not need fsync.
 
@@ -9181,3 +9224,33 @@ heap_mask(char *pagedata, BlockNumber blkno)        }    }}
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d69a266..4754278 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@#include "access/htup_details.h"#include "access/xlog.h"#include "catalog/catalog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,        /*         *
Emita WAL HEAP_CLEAN record showing what we did         */
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))        {            XLogRecPtr    recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d7f65a5..6462f44 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)    }    else if (HeapTupleHasExternal(tup) ||
tup->t_len> TOAST_TUPLE_THRESHOLD)        heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);    else        heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index e5616ce..933fa9c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@#include "access/heapam_xlog.h"#include "access/visibilitymap.h"#include "access/xlog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "storage/bufmgr.h"#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,        map[mapByte] |= (flags
<<mapOffset);        MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))        {            if (XLogRecPtrIsInvalid(recptr))            {
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92b263a..313a03b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2007,6 +2007,9 @@ CommitTransaction(void)    /* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+    /*     * Mark serializable transaction as complete for predicate locking     * purposes.  This should be done as
lateas we can put it and still allow
 
@@ -2238,6 +2241,9 @@ PrepareTransaction(void)    /* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+    /*     * Mark serializable transaction as complete for predicate locking     * purposes.  This should be done as
lateas we can put it and still allow
 
@@ -2545,6 +2551,7 @@ AbortTransaction(void)    AtAbort_Notify();    AtEOXact_RelationMap(false);
AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */    /*     * Advertise the fact that we aborted in
pg_xact(assuming that we got as
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f677916..14df0b1 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@#include "catalog/storage_xlog.h"#include "storage/freespace.h"#include "storage/smgr.h"
+#include "utils/hsearch.h"#include "utils/memutils.h"#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDeletestatic PendingRelDelete *pendingDeletes = NULL; /* head of linked
list*//*
 
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/* * RelationCreateStorage *        Create physical storage for a relation. *
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)voidRelationTruncate(Relation rel,
BlockNumbernblocks){
 
+    PendingRelSync *pending = NULL;
+    bool        found;    bool        fsm;    bool        vm;
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)     */    if (RelationNeedsWAL(rel))    {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
-
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        /* no_pending_sync is ignored since new entry is created here */
+        if (!rel->pending_sync)
+        {
+            if (!pendingSyncs)
+                createPendingSyncsHash();
+            elog(LOG, "RelationTruncate: accessing hash");
+            pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                 (void *) &rel->rd_node,
+                                                 HASH_ENTER, &found);
+            if (!found)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            rel->no_pending_sync= false;
+            rel->pending_sync = pending;
+        }
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+            rel->pending_sync->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }    }    /* Do the real work */    smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+    /* First time through: initialize the hash table */
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hash = tag_hash;
+    pendingSyncs = hash_create("pending relation sync table", 5,
+                               &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+/* *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact. *
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)}/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(LOG, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
+/* * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted. * * The return value is the number of
relationsscheduled for termination.
 
@@ -419,6 +527,166 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)    return nrels;}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    bool found = true;
+    BlockNumber nblocks;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* ignore no_pending_sync since new entry is created here */
+    if (!rel->pending_sync)
+    {
+        if (!pendingSyncs)
+            createPendingSyncsHash();
+
+        /* Look up or create an entry */
+        rel->no_pending_sync = false;
+        elog(LOG, "RecordPendingSync: accessing hash");
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_ENTER, &found);
+    }
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+    if (!found)
+    {
+        rel->pending_sync->truncated_to = InvalidBlockNumber;
+        rel->pending_sync->sync_above = nblocks;
+
+        elog(DEBUG2,
+             "registering new pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+
+    }
+    else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+        rel->pending_sync->sync_above = nblocks;
+    }
+    else
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    elog(LOG, "BufferNeedsWAL: pendingSyncs = %p, no_pending_sync = %d", pendingSyncs, rel->no_pending_sync);
+    /* no further work if we know that we don't have pending sync */
+    if (!pendingSyncs || rel->no_pending_sync)
+        return true;
+
+    /* do the real work */
+    if (!rel->pending_sync)
+    {
+        bool found = false;
+
+        elog(LOG, "BufferNeedsWAL: accessing hash");
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_FIND, &found);
+        if (!found)
+        {
+            /* we don't have no one. don't access the hash no longer */
+            rel->no_pending_sync = true;
+            return true;
+        }
+    }
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+        rel->pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+        rel->pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+/* *    PostPrepare_smgr -- Clean up after a successful PREPARE *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b5af2be..8aa7e7b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2372,8 +2372,7 @@ CopyFrom(CopyState cstate)     *    - data is being written to relfilenode created in this
transaction    * then we can skip writing WAL.  It's safe because if the transaction     * doesn't commit, we'll
discardthe table (or the new relfilenode file).
 
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().     *     * As mentioned in comments in utils/rel.h, the
in-same-transactiontest     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
 
@@ -2405,7 +2404,7 @@ CopyFrom(CopyState cstate)    {        hi_options |= HEAP_INSERT_SKIP_FSM;        if
(!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);    }    /*
@@ -2782,11 +2781,11 @@ CopyFrom(CopyState cstate)    FreeExecutorState(estate);    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.     */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);    return processed;}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 06425cc..408495e 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     * We can skip
WAL-loggingthe insertions, unless PITR or streaming     * replication is in use. We can skip the FSM in any case.
*/
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;    myState->bistate = GetBulkInsertState();    /* Not using WAL
requiressmgr_targblock be initially invalid */
 
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close rel, but keep lock until commit */
 heap_close(myState->rel, NoLock);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 9ffd91e..8b127e3 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     */
myState->hi_options= HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;    if (!XLogIsNeeded()) 
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);    myState->bistate = GetBulkInsertState();    /* Not using WAL requires
smgr_targblockbe initially invalid */
 
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close transientrel, but keep lock until
commit*/    heap_close(myState->transientrel, NoLock);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index abb262b..2fd210b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4327,8 +4327,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)        bistate =
GetBulkInsertState();       hi_options = HEAP_INSERT_SKIP_FSM;
 
+        if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);    }    else    {
@@ -4589,8 +4590,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);       /* If we skipped writing WAL, then we need to sync the heap. */
 
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);        heap_close(newrel, NoLock);    }
@@ -10510,11 +10509,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)    /*     * Create
andcopy all forks of the relation, and schedule unlinking of
 
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.     *     * NOTE: any conflict in
relfilenodevalue will be caught in     * RelationCreateStorage().     */
 
+    RelationRemovePendingSync(rel);    RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);    /* copy main
fork*/
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..f3dcf6e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -893,7 +893,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,                 * page has
beenpreviously WAL-logged, and if not, do that                 * now.                 */
 
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&                    PageGetLSN(page) == InvalidXLogRecPtr)
        log_newpage_buffer(buf, true);
 
@@ -1120,7 +1120,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,            }            /*
NowWAL-log freezing if necessary */
 
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))            {                XLogRecPtr    recptr;
@@ -1480,7 +1480,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,    MarkBufferDirty(buffer);
/* XLOG stuff */
 
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))    {        XLogRecPtr    recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2109cbf..e991e9f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,            BufferAccessStrategy strategy,
  bool *foundPtr);static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);static void AtProcExit_Buffers(int code,
Datumarg);static void CheckForBufferLeaks(void);static int    rnode_comparator(const void *p1, const void *p2);
 
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)voidFlushRelationBuffers(Relation rel){
-    int            i;
-    BufferDesc *bufHdr;
-    /* Open rel at the smgr level if not already done */    RelationOpenSmgr(rel);
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)    {        for (i = 0; i < NLocBuffer; i++)        {            uint32        buf_state;
bufHdr= GetLocalBufferDescriptor(i);
 
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&                ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))            {
 
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)                PageSetChecksumInplace(localpage,
bufHdr->tag.blockNum);
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,                          bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                         localpage,
 
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)         * As in DropRelFileNodeBuffers, an unlocked precheck
shouldbe safe         * and saves some cycles.         */
 
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))            continue;        ReservePrivateRefCountEntry();
    buf_state = LockBufHdr(bufHdr);
 
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID
|BM_DIRTY))        {            PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr,true);        }
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ddb9485..b6b0d78 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@#include "optimizer/var.h"#include "rewrite/rewriteDefine.h"#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"#include "storage/lmgr.h"#include "storage/smgr.h"#include "utils/array.h"
@@ -418,6 +419,10 @@ AllocateRelationDesc(Form_pg_class relp)    /* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount= 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+    MemoryContextSwitchTo(oldcxt);    return relation;
@@ -2032,6 +2037,10 @@ formrdesc(const char *relationName, Oid relationReltype,        relation->rd_rel->relhasindex =
true;   }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+    /*     * add new reldesc to relcache     */
@@ -3353,6 +3362,10 @@ RelationBuildLocalRelation(const char *relname,    else        rel->rd_rel->relfilenode =
relfilenode;
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+    RelationInitLockInfo(rel);    /* see lmgr.c */    RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7e85510..3967641 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004typedef struct BulkInsertStateData *BulkInsertState;
@@ -178,6 +177,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);extern void
simple_heap_update(Relationrelation, ItemPointer otid,                   HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);extern void heap_sync(Relation relation);extern void
heap_update_snapshot(HeapScanDescscan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index fea96de..b9d485a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);extern void
RelationDropStorage(Relationrel);extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);extern void
RelationTruncate(Relationrel, BlockNumber nblocks);
 
-
+extern void RelationRemovePendingSync(Relation rel);/* * These functions used to be in storage/smgr/smgr.c, which
explainsthe * naming */extern void smgrDoPendingDeletes(bool isCommit);extern int    smgrGetPendingDeletes(bool
forCommit,RelFileNode **ptr);
 
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);extern void AtSubCommit_smgr(void);extern void
AtSubAbort_smgr(void);externvoid PostPrepare_smgr(void);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 07a32d6..6ec2d26 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);extern void FlushOneBuffer(Buffer buffer);extern void FlushRelationBuffers(Relation rel);
 
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);extern void FlushDatabaseBuffers(Oid dbid);extern void
DropRelFileNodeBuffers(RelFileNodeBackendrnode,                       ForkNumber forkNum, BlockNumber firstDelBlock);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ab875bb..666273e 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData    /* use "struct" here to avoid needing to include pgstat.h: */
structPgStat_TableStatus *pgstat_info;        /* statistics collection area */
 
+
+    /*
+     * no_pending_sync is true if this relation is kown not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;} RelationData;
diff --git a/src/test/recovery/t/012_truncate_opt.pl b/src/test/recovery/t/012_truncate_opt.pl
new file mode 100644
index 0000000..baf5604
--- /dev/null
+++ b/src/test/recovery/t/012_truncate_opt.pl
@@ -0,0 +1,94 @@
+# Set of tests to check TRUNCATE optimizations with CREATE TABLE
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+my $node = get_new_node('master');
+$node->init;
+
+my $copy_file = $node->backup_dir . "copy_data.txt";
+
+$node->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = minimal
+});
+
+$node->start;
+
+# Create file containing data to COPY
+TestLib::append_to_file($copy_file, qq{copied row 1
+copied row 2
+copied row 3
+});
+
+# CREATE, INSERT, COPY, crash.
+#
+# If COPY inserts to the existing block, and is not WAL-logged, replaying
+# the implicit FPW of the INSERT record will destroy the COPY data.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+INSERT INTO test1 VALUES ('inserted row');
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 4 rows.
+$node->stop('immediate');
+$node->start;
+my $ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '4', 'SELECT reports 4 rows');
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE test1;');
+
+# CREATE, COPY, crash. Trigger in COPY that inserts more to same table.
+#
+# If the INSERTS from the trigger go to the same block we're copying to,
+# and the INSERTs are WAL-logged, WAL replay will fail when it tries to
+# replay the WAL record but the "before" image doesn't match, because not
+# all changes were WAL-logged.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+CREATE FUNCTION test1_beforetrig() RETURNS trigger LANGUAGE plpgsql as \$\$
+  BEGIN
+  IF new.t NOT LIKE 'triggered%' THEN
+    INSERT INTO test1 VALUES ('triggered ' || NEW.t);
+  END IF;
+  RETURN NEW;
+END;
+\$\$;
+CREATE TRIGGER test1_beforeinsert BEFORE INSERT ON test1
+FOR EACH ROW EXECUTE PROCEDURE test1_beforetrig();
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 6
+# rows here.
+$node->stop('immediate');
+$node->start;
+$ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '6', 'SELECT returns 6 rows');
+# Clean up
+$node->safe_psql('postgres', 'DROP TABLE test1;');
+$node->safe_psql('postgres', 'DROP FUNCTION test1_beforetrig();');
+
+# CREATE, TRUNCATE, COPY, crash.
+#
+# If we skip WAL-logging of the COPY, replaying the TRUNCATE record destroys
+# the newly inserted data.
+$node->psql('postgres', qq{
+BEGIN;
+CREATE TABLE test1(t text PRIMARY KEY);
+TRUNCATE test1;
+COPY test1 FROM '$copy_file';
+COMMIT;
+});
+# Enforce recovery and check the state of table. There should be 3
+# rows here.
+$node->stop('immediate');
+$node->start;
+$ret = $node->safe_psql('postgres', 'SELECT count(*) FROM test1');
+is($ret, '3', 'SELECT returns 3 rows');

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

13 апреля 2017 г., 10:52:40

On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Sorry, what I have just sent was broken.

You can use PROVE_TESTS when running make check to select a subset of
tests you want to run. I use that all the time when working on patches
dedicated to certain code paths.

>> - Relation has new members no_pending_sync and pending_sync that
>>   works as instant cache of an entry in pendingSync hash.
>> - Commit-time synchronizing is restored as Michael's patch.
>> - If relfilenode is replaced, pending_sync for the old node is
>>   removed. Anyway this is ignored on abort and meaningless on
>>   commit.
>> - TAP test is renamed to 012 since some new files have been added.
>>
>> Accessing pending sync hash occurred on every calling of
>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
>> accessing relations has pending sync.  Almost of them are
>> eliminated as the result.

Did you actually test this patch? One of the logs added makes the
tests a long time to run:
2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT:  ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0

-       lsn = XLogInsert(RM_SMGR_ID,
-                        XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+           rel->no_pending_sync= false;
+           rel->pending_sync = pending;
+       }
It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...

+void
+RecordPendingSync(Relation rel)
I don't think that I agree that this should be part of relcache.c. The
syncs are tracked should be tracked out of the relation context.

Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.

Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...
-- 
Michael

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

13 апреля 2017 г., 12:29:35

I'd like to put a supplimentary explanation.

At Tue, 11 Apr 2017 17:38:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170411.173812.133964522.horiguchi.kyotaro@lab.ntt.co.jp>
> Sorry, what I have just sent was broken.
> 
> At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170411.173341.257028732.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170411.095606.245908357.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > Hello, thank you for looking this.
> > > 
> > > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <27309.1491611915@sss.pgh.pa.us>
> > > > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > > > > Interesting.  I wonder if it's possible that a relcache invalidation
> > > > > would cause these values to get lost for some reason, because that would
> > > > > be dangerous.
> > > > 
> > > > > I suppose the rationale is that this shouldn't happen because any
> > > > > operation that does things this way must hold an exclusive lock on the
> > > > > relation.  But that doesn't guarantee that the relcache entry is
> > > > > completely stable,
> > > > 
> > > > It ABSOLUTELY is not safe.  Relcache flushes can happen regardless of
> > > > how strong a lock you hold.
> > > > 
> > > >             regards, tom lane
> > > 
> > > Ugh. Yes, relcache invalidation happens anytime and it resets the

The pending locations are not stored in relcache hash so the
problem here is not invalidation but that Relation objects are
created as necessary, anywhere. Even if no invalidation happens,
the same thing will happen in a bit different form.

> > > added values. pg_stat_info deceived me that it can store
> > > transient values. But I  came up with another thought.
> > > 
> > > The reason I proposed it was I thought that hash_search for every
> > > buffer is not good. Instead, like pg_stat_info, we can link the
> > 
> > buffer => buffer modification
> > 
> > > pending-sync hash entry to Relation. This greately reduces the
> > > frequency of hash-searching.
> > > 
> > > I'll post new patch in this way soon.
> > 
> > Here it is.
> 
> It contained tariling space and missing test script.  This is the
> correct patch.
> 
> > - Relation has new members no_pending_sync and pending_sync that
> >   works as instant cache of an entry in pendingSync hash.
> > 
> > - Commit-time synchronizing is restored as Michael's patch.
> > 
> > - If relfilenode is replaced, pending_sync for the old node is
> >   removed. Anyway this is ignored on abort and meaningless on
> >   commit.
> > 
> > - TAP test is renamed to 012 since some new files have been added.
> > 
> > Accessing pending sync hash occured on every calling of
> > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> > accessing relations has pending sync.  Almost of them are
> > eliminated as the result.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

13 апреля 2017 г., 15:42:19

At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA@mail.gmail.com>
> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Sorry, what I have just sent was broken.
> 
> You can use PROVE_TESTS when running make check to select a subset of
> tests you want to run. I use that all the time when working on patches
> dedicated to certain code paths.

Thank you for the information. Removing unwanted test scripts
from t/ directories was annoyance. This makes me happy.

> >> - Relation has new members no_pending_sync and pending_sync that
> >>   works as instant cache of an entry in pendingSync hash.
> >> - Commit-time synchronizing is restored as Michael's patch.
> >> - If relfilenode is replaced, pending_sync for the old node is
> >>   removed. Anyway this is ignored on abort and meaningless on
> >>   commit.
> >> - TAP test is renamed to 012 since some new files have been added.
> >>
> >> Accessing pending sync hash occurred on every calling of
> >> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> >> accessing relations has pending sync.  Almost of them are
> >> eliminated as the result.
> 
> Did you actually test this patch? One of the logs added makes the
> tests a long time to run:

Maybe this patch requires make clean since it extends the
structure RelationData. (Perhaps I saw the same trouble.)

> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> STATEMENT:  ANALYZE;
> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> = 0x0, no_pending_sync = 0
> 
> -       lsn = XLogInsert(RM_SMGR_ID,
> -                        XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> +           rel->no_pending_sync= false;
> +           rel->pending_sync = pending;
> +       }
> 
> It seems to me that those flags and the pending_sync data should be
> kept in the context of backend process and not be part of the Relation
> data...

I understand that the context of "backend process" means
storage.c local. I don't mind the context on which the data is,
but I found only there that can get rid of frequent hash
searching. For pending deletions, just appending to a list is
enough and costs almost nothing, on the other hand pendig syncs
are required to be referenced, sometimes very frequently.

> +void
> +RecordPendingSync(Relation rel)
> I don't think that I agree that this should be part of relcache.c. The
> syncs are tracked should be tracked out of the relation context.

Yeah.. It's in storage.c in the latest patch. (Sorry for the
duplicate name). I think it is a kind of bond between smgr and
relation.

> Seeing how invasive this change is, I would also advocate for this
> patch as only being a HEAD-only change, not many people are
> complaining about this optimization of TRUNCATE missing when wal_level
> = minimal, and this needs a very careful review.

Agreed.

> Should I code something? Or Horiguchi-san, would you take care of it?
> The previous crash I saw has been taken care of, but it's been really
> some time since I looked at this patch...

My point is hash-search on every tuple insertion should be evaded
even if it happens rearely. Once it was a bit apart from your
original patch, but in the latest patch the significant part
(pending-sync hash) is revived from the original one.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Daniel Gustafsson

Дата:

05 сентября 2017 г., 16:05:01

> On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA@mail.gmail.com>
>> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Sorry, what I have just sent was broken.
>>
>> You can use PROVE_TESTS when running make check to select a subset of
>> tests you want to run. I use that all the time when working on patches
>> dedicated to certain code paths.
>
> Thank you for the information. Removing unwanted test scripts
> from t/ directories was annoyance. This makes me happy.
>
>>>> - Relation has new members no_pending_sync and pending_sync that
>>>>  works as instant cache of an entry in pendingSync hash.
>>>> - Commit-time synchronizing is restored as Michael's patch.
>>>> - If relfilenode is replaced, pending_sync for the old node is
>>>>  removed. Anyway this is ignored on abort and meaningless on
>>>>  commit.
>>>> - TAP test is renamed to 012 since some new files have been added.
>>>>
>>>> Accessing pending sync hash occurred on every calling of
>>>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
>>>> accessing relations has pending sync.  Almost of them are
>>>> eliminated as the result.
>>
>> Did you actually test this patch? One of the logs added makes the
>> tests a long time to run:
>
> Maybe this patch requires make clean since it extends the
> structure RelationData. (Perhaps I saw the same trouble.)
>
>> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
>> STATEMENT:  ANALYZE;
>> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
>> = 0x0, no_pending_sync = 0
>>
>> -       lsn = XLogInsert(RM_SMGR_ID,
>> -                        XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
>> +           rel->no_pending_sync= false;
>> +           rel->pending_sync = pending;
>> +       }
>>
>> It seems to me that those flags and the pending_sync data should be
>> kept in the context of backend process and not be part of the Relation
>> data...
>
> I understand that the context of "backend process" means
> storage.c local. I don't mind the context on which the data is,
> but I found only there that can get rid of frequent hash
> searching. For pending deletions, just appending to a list is
> enough and costs almost nothing, on the other hand pendig syncs
> are required to be referenced, sometimes very frequently.
>
>> +void
>> +RecordPendingSync(Relation rel)
>> I don't think that I agree that this should be part of relcache.c. The
>> syncs are tracked should be tracked out of the relation context.
>
> Yeah.. It's in storage.c in the latest patch. (Sorry for the
> duplicate name). I think it is a kind of bond between smgr and
> relation.
>
>> Seeing how invasive this change is, I would also advocate for this
>> patch as only being a HEAD-only change, not many people are
>> complaining about this optimization of TRUNCATE missing when wal_level
>> = minimal, and this needs a very careful review.
>
> Agreed.
>
>> Should I code something? Or Horiguchi-san, would you take care of it?
>> The previous crash I saw has been taken care of, but it's been really
>> some time since I looked at this patch...
>
> My point is hash-search on every tuple insertion should be evaded
> even if it happens rearely. Once it was a bit apart from your
> original patch, but in the latest patch the significant part
> (pending-sync hash) is revived from the original one.

This patch has followed along since CF 2016-03, do we think we can reach a
conclusion in this CF?  It was marked as "Waiting on Author”, based on
developments since in this thread, I’ve changed it back to “Needs Review”
again.

cheers ./daniel

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

08 сентября 2017 г., 13:30:01

Thank you for your notification.

At Tue, 5 Sep 2017 12:05:01 +0200, Daniel Gustafsson <daniel@yesql.se> wrote in
<B3EC34FC-A48E-41AA-8598-BFC5D87CB383@yesql.se>
> > On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > 
> > At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA@mail.gmail.com>
> >> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>> Sorry, what I have just sent was broken.
> >> 
> >> You can use PROVE_TESTS when running make check to select a subset of
> >> tests you want to run. I use that all the time when working on patches
> >> dedicated to certain code paths.
> > 
> > Thank you for the information. Removing unwanted test scripts
> > from t/ directories was annoyance. This makes me happy.
> > 
> >>>> - Relation has new members no_pending_sync and pending_sync that
> >>>>  works as instant cache of an entry in pendingSync hash.
> >>>> - Commit-time synchronizing is restored as Michael's patch.
> >>>> - If relfilenode is replaced, pending_sync for the old node is
> >>>>  removed. Anyway this is ignored on abort and meaningless on
> >>>>  commit.
> >>>> - TAP test is renamed to 012 since some new files have been added.
> >>>> 
> >>>> Accessing pending sync hash occurred on every calling of
> >>>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> >>>> accessing relations has pending sync.  Almost of them are
> >>>> eliminated as the result.
> >> 
> >> Did you actually test this patch? One of the logs added makes the
> >> tests a long time to run:
> > 
> > Maybe this patch requires make clean since it extends the
> > structure RelationData. (Perhaps I saw the same trouble.)
> > 
> >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> >> STATEMENT:  ANALYZE;
> >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> >> = 0x0, no_pending_sync = 0
> >> 
> >> -       lsn = XLogInsert(RM_SMGR_ID,
> >> -                        XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> >> +           rel->no_pending_sync= false;
> >> +           rel->pending_sync = pending;
> >> +       }
> >> 
> >> It seems to me that those flags and the pending_sync data should be
> >> kept in the context of backend process and not be part of the Relation
> >> data...
> > 
> > I understand that the context of "backend process" means
> > storage.c local. I don't mind the context on which the data is,
> > but I found only there that can get rid of frequent hash
> > searching. For pending deletions, just appending to a list is
> > enough and costs almost nothing, on the other hand pendig syncs
> > are required to be referenced, sometimes very frequently.
> > 
> >> +void
> >> +RecordPendingSync(Relation rel)
> >> I don't think that I agree that this should be part of relcache.c. The
> >> syncs are tracked should be tracked out of the relation context.
> > 
> > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > duplicate name). I think it is a kind of bond between smgr and
> > relation.
> > 
> >> Seeing how invasive this change is, I would also advocate for this
> >> patch as only being a HEAD-only change, not many people are
> >> complaining about this optimization of TRUNCATE missing when wal_level
> >> = minimal, and this needs a very careful review.
> > 
> > Agreed.
> > 
> >> Should I code something? Or Horiguchi-san, would you take care of it?
> >> The previous crash I saw has been taken care of, but it's been really
> >> some time since I looked at this patch...
> > 
> > My point is hash-search on every tuple insertion should be evaded
> > even if it happens rearely. Once it was a bit apart from your
> > original patch, but in the latest patch the significant part
> > (pending-sync hash) is revived from the original one.
> 
> This patch has followed along since CF 2016-03, do we think we can reach a
> conclusion in this CF?  It was marked as "Waiting on Author”, based on
> developments since in this thread, I’ve changed it back to “Needs Review”
> again.

I manged to reload its context into my head. It doesn't apply on
the current master and needs some amendment. I'm going to work on
this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

12 сентября 2017 г., 10:14:41

Hello,

At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170908.163001.53230385.horiguchi.kyotaro@lab.ntt.co.jp>
> > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> > >> STATEMENT:  ANALYZE;
> > >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> > >> = 0x0, no_pending_sync = 0
> > >> 
> > >> -       lsn = XLogInsert(RM_SMGR_ID,
> > >> -                        XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> > >> +           rel->no_pending_sync= false;
> > >> +           rel->pending_sync = pending;
> > >> +       }
> > >> 
> > >> It seems to me that those flags and the pending_sync data should be
> > >> kept in the context of backend process and not be part of the Relation
> > >> data...
> > > 
> > > I understand that the context of "backend process" means
> > > storage.c local. I don't mind the context on which the data is,
> > > but I found only there that can get rid of frequent hash
> > > searching. For pending deletions, just appending to a list is
> > > enough and costs almost nothing, on the other hand pendig syncs
> > > are required to be referenced, sometimes very frequently.
> > > 
> > >> +void
> > >> +RecordPendingSync(Relation rel)
> > >> I don't think that I agree that this should be part of relcache.c. The
> > >> syncs are tracked should be tracked out of the relation context.
> > > 
> > > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > > duplicate name). I think it is a kind of bond between smgr and
> > > relation.
> > > 
> > >> Seeing how invasive this change is, I would also advocate for this
> > >> patch as only being a HEAD-only change, not many people are
> > >> complaining about this optimization of TRUNCATE missing when wal_level
> > >> = minimal, and this needs a very careful review.
> > > 
> > > Agreed.
> > > 
> > >> Should I code something? Or Horiguchi-san, would you take care of it?
> > >> The previous crash I saw has been taken care of, but it's been really
> > >> some time since I looked at this patch...
> > > 
> > > My point is hash-search on every tuple insertion should be evaded
> > > even if it happens rearely. Once it was a bit apart from your
> > > original patch, but in the latest patch the significant part
> > > (pending-sync hash) is revived from the original one.
> > 
> > This patch has followed along since CF 2016-03, do we think we can reach a
> > conclusion in this CF?  It was marked as "Waiting on Author”, based on
> > developments since in this thread, I’ve changed it back to “Needs Review”
> > again.
> 
> I manged to reload its context into my head. It doesn't apply on
> the current master and needs some amendment. I'm going to work on
> this.

Rebased and slightly modified.

Michael's latest patch on which this patch is piggybacking seems
works perfectly. The motive of my addition is avoiding frequent
(I think specifically per tuple modification) hash accessing
occurs while pending-syncs exist. The hash contains at least 6 or
more entries.

The attached patch emits more log messages that will be removed
in the final shape to see how much the addition reduces the hash
access.  As a basis of determining the worthiness of the
additional mechanism, I'll show an example of a set of queries
below.

In the log messages, "r" is relation oid, "b" is buffer number,
"hash" is the pointer to the backend-global hash table for
pending syncs. "ent" is the entry in the hash belongs to the
relation, "neg" is a flag indicates that the existing pending
sync hash doesn't have an entry for the relation.

=# set log_min_message to debug2;
=# begin;
=# create table test1(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0
# relid=2608 buf=55, hash has not been created

=# insert into test1 values ('inserted row');
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0
# relid=24807, fist buffer, hash has not bee created

=# copy test1 from '/<somewhere>/copy_data.txt';
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0
# hash created, pending sync entry linked, no longer needs hash acess
# (repeats for the number of buffers)
COPY 200

=# create table test3(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = 0x171de00, ent=(nil), neg = 1
# no pending sync entry for this relation, no longer needs hash access.

=# insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : not found
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 1
# This table no longer needs hash access, (repeats for the number of tuples)

=#  truncate test3;
=#  insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : found
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=0x171f340, neg = 0
# This table has pending sync but no longer needs hash access,
#  (repeats for the number of tuples)

The hash is required in the case of relcache invalidation. When
ent=(nil) and neg = 0 but hash != (nil), it tries hash search and
restores the previous state.

This mechanism avoids most of the hash accesses by replacing into
just following a pointer. On the other hand, the hash access
occurs only after relation truncate in the current
transaction. In other words, this won't be in effect unless any
of table truncation, copy, create as, alter table or refresing
matview occurs.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 34,39 ****
--- 34,61 ----  *      the POSTGRES heap access method used for all POSTGRES  *      relations.  *
+  * WAL CONSIDERATIONS
+  *      All heap operations are normally WAL-logged. but there are a few
+  *      exceptions. Temporary and unlogged relations never need to be
+  *      WAL-logged, but we can also skip WAL-logging for a table that was
+  *      created in the same transaction, if we don't need WAL for PITR or
+  *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+  *      the file to disk at COMMIT instead.
+  *
+  *      The same-relation optimization is not employed automatically on all
+  *      updates to a table that was created in the same transacton, because
+  *      for a small number of changes, it's cheaper to just create the WAL
+  *      records than fsyncing() the whole relation at COMMIT. It is only
+  *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+  *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+  *      operation; it will cause any subsequent updates to the table to skip
+  *      WAL-logging, if possible, and cause the heap to be synced to disk at
+  *      COMMIT.
+  *
+  *      To make that work, all modifications to heap must use
+  *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+  *      for the given block.
+  *  *-------------------------------------------------------------------------  */ #include "postgres.h"
***************
*** 56,61 ****
--- 78,84 ---- #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h"
+ #include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "port/atomics.h"
***************
*** 2370,2381 **** ReleaseBulkInsertStatePin(BulkInsertState bistate)  * The new tuple is stamped with current
transactionID and the specified  * command ID.  *
 
-  * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
-  * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
-  * requires that we arrange that all new tuples go into new pages not
-  * containing any tuples from other transactions, and that the relation gets
-  * fsync'd before commit.  (See also heap_sync() comments)
-  *  * The HEAP_INSERT_SKIP_FSM option is passed directly to  * RelationGetBufferForTuple, which see for more info.
*
--- 2393,2398 ----
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 20,25 ****
--- 20,26 ---- #include "access/htup_details.h" #include "access/xlog.h" #include "catalog/catalog.h"
+ #include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h"
***************
*** 259,265 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,         /*          * Emit
aWAL HEAP_CLEAN record showing what we did          */
 
!         if (RelationNeedsWAL(relation))         {             XLogRecPtr    recptr; 
--- 260,266 ----         /*          * Emit a WAL HEAP_CLEAN record showing what we did          */
!         if (BufferNeedsWAL(relation, buffer))         {             XLogRecPtr    recptr; 
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
***************
*** 649,657 **** raw_heap_insert(RewriteState state, HeapTuple tup)     }     else if (HeapTupleHasExternal(tup) ||
tup->t_len> TOAST_TUPLE_THRESHOLD)         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 
!                                          HEAP_INSERT_SKIP_FSM |
!                                          (state->rs_use_wal ?
!                                           0 : HEAP_INSERT_SKIP_WAL));     else         heaptup = tup; 
--- 649,655 ----     }     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)         heaptup =
toast_insert_or_update(state->rs_new_rel,tup, NULL,
 
!                                          HEAP_INSERT_SKIP_FSM);     else         heaptup = tup; 
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 88,93 ****
--- 88,94 ---- #include "access/heapam_xlog.h" #include "access/visibilitymap.h" #include "access/xlog.h"
+ #include "catalog/storage.h" #include "miscadmin.h" #include "storage/bufmgr.h" #include "storage/lmgr.h"
***************
*** 307,313 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,         map[mapByte] |= (flags <<
mapOffset);        MarkBufferDirty(vmBuf); 
 
!         if (RelationNeedsWAL(rel))         {             if (XLogRecPtrIsInvalid(recptr))             {
--- 308,314 ----         map[mapByte] |= (flags << mapOffset);         MarkBufferDirty(vmBuf); 
!         if (BufferNeedsWAL(rel, heapBuf))         {             if (XLogRecPtrIsInvalid(recptr))             {
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 2007,2012 **** CommitTransaction(void)
--- 2007,2015 ----     /* close large objects before lower-level cleanup */     AtEOXact_LargeObject(true); 
+     /* Flush updates to relations that we didn't WAL-logged */
+     smgrDoPendingSyncs(true);
+      /*      * Mark serializable transaction as complete for predicate locking      * purposes.  This should be done
aslate as we can put it and still allow
 
***************
*** 2235,2240 **** PrepareTransaction(void)
--- 2238,2246 ----     /* close large objects before lower-level cleanup */     AtEOXact_LargeObject(true); 
+     /* Flush updates to relations that we didn't WAL-logged */
+     smgrDoPendingSyncs(true);
+      /*      * Mark serializable transaction as complete for predicate locking      * purposes.  This should be done
aslate as we can put it and still allow
 
***************
*** 2548,2553 **** AbortTransaction(void)
--- 2554,2560 ----     AtAbort_Notify();     AtEOXact_RelationMap(false);     AtAbort_Twophase();
+     smgrDoPendingSyncs(false);    /* abandone pending syncs */      /*      * Advertise the fact that we aborted in
pg_xact(assuming that we got as
 
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 29,34 ****
--- 29,35 ---- #include "catalog/storage_xlog.h" #include "storage/freespace.h" #include "storage/smgr.h"
+ #include "utils/hsearch.h" #include "utils/memutils.h" #include "utils/rel.h" 
***************
*** 64,69 **** typedef struct PendingRelDelete
--- 65,113 ---- static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */  /*
+  * We also track relation files (RelFileNode values) that have been created
+  * in the same transaction, and that have been modified without WAL-logging
+  * the action (an optimization possible with wal_level=minimal). When we are
+  * about to skip WAL-logging, a PendingRelSync entry is created, and
+  * 'sync_above' is set to the current size of the relation. Any operations
+  * on blocks < sync_above need to be WAL-logged as usual, but for operations
+  * on higher blocks, WAL-logging is skipped.
+  *
+  * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+  * any subsequent actions on the same block either. Replaying the WAL record
+  * of the subsequent action might fail otherwise, as the "before" state of
+  * the block might not match, as the earlier actions were not WAL-logged.
+  * Likewise, after we have WAL-logged an operation for a block, we must
+  * WAL-log any subsequent operations on the same page as well. Replaying
+  * a possible full-page-image from the earlier WAL record would otherwise
+  * revert the page to the old state, even if we sync the relation at end
+  * of transaction.
+  *
+  * If a relation is truncated (without creating a new relfilenode), and we
+  * emit a WAL record of the truncation, we can't skip WAL-logging for any
+  * of the truncated blocks anymore, as replaying the truncation record will
+  * destroy all the data inserted after that. But if we have already decided
+  * to skip WAL-logging changes to a relation, and the relation is truncated,
+  * we don't need to WAL-log the truncation either.
+  *
+  * This mechanism is currently only used by heaps. Indexes are always
+  * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+  * WAL levels we need the WAL for PITR/replication anyway.
+  */
+ typedef struct PendingRelSync
+ {
+     RelFileNode relnode;        /* relation created in same xact */
+     BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                  * sync_above */
+     BlockNumber truncated_to;    /* truncation WAL record was written */
+ }    PendingRelSync;
+ 
+ /* Relations that need to be fsync'd at commit */
+ static HTAB *pendingSyncs = NULL;
+ 
+ static void createPendingSyncsHash(void);
+ 
+ /*  * RelationCreateStorage  *        Create physical storage for a relation.  *
***************
*** 226,231 **** RelationPreserveStorage(RelFileNode rnode, bool atCommit)
--- 270,277 ---- void RelationTruncate(Relation rel, BlockNumber nblocks) {
+     PendingRelSync *pending = NULL;
+     bool        found;     bool        fsm;     bool        vm; 
***************
*** 260,296 **** RelationTruncate(Relation rel, BlockNumber nblocks)      */     if (RelationNeedsWAL(rel))     {
!         /*
!          * Make an XLOG entry reporting the file truncation.
!          */
!         XLogRecPtr    lsn;
!         xl_smgr_truncate xlrec;
! 
!         xlrec.blkno = nblocks;
!         xlrec.rnode = rel->rd_node;
!         xlrec.flags = SMGR_TRUNCATE_ALL;
! 
!         XLogBeginInsert();
!         XLogRegisterData((char *) &xlrec, sizeof(xlrec));
! 
!         lsn = XLogInsert(RM_SMGR_ID,
!                          XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
! 
!         /*
!          * Flush, because otherwise the truncation of the main relation might
!          * hit the disk before the WAL record, and the truncation of the FSM
!          * or visibility map. If we crashed during that window, we'd be left
!          * with a truncated heap, but the FSM or visibility map would still
!          * contain entries for the non-existent heap pages.
!          */
!         if (fsm || vm)
!             XLogFlush(lsn);     }      /* Do the real work */     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
} /*  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.  *
 
--- 306,386 ----      */     if (RelationNeedsWAL(rel))     {
!         /* no_pending_sync is ignored since new entry is created here */
!         if (!rel->pending_sync)
!         {
!             if (!pendingSyncs)
!                 createPendingSyncsHash();
!             elog(DEBUG2, "RelationTruncate: accessing hash");
!             pending = (PendingRelSync *) hash_search(pendingSyncs,
!                                                  (void *) &rel->rd_node,
!                                                  HASH_ENTER, &found);
!             if (!found)
!             {
!                 pending->sync_above = InvalidBlockNumber;
!                 pending->truncated_to = InvalidBlockNumber;
!             }
! 
!             rel->no_pending_sync= false;
!             rel->pending_sync = pending;
!         }
! 
!         if (rel->pending_sync->sync_above == InvalidBlockNumber ||
!             rel->pending_sync->sync_above < nblocks)
!         {
!             /*
!              * Make an XLOG entry reporting the file truncation.
!              */
!             XLogRecPtr        lsn;
!             xl_smgr_truncate xlrec;
! 
!             xlrec.blkno = nblocks;
!             xlrec.rnode = rel->rd_node;
! 
!             XLogBeginInsert();
!             XLogRegisterData((char *) &xlrec, sizeof(xlrec));
! 
!             lsn = XLogInsert(RM_SMGR_ID,
!                              XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
! 
!             elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
!                  rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
!                  nblocks);
! 
!             /*
!              * Flush, because otherwise the truncation of the main relation
!              * might hit the disk before the WAL record, and the truncation of
!              * the FSM or visibility map. If we crashed during that window,
!              * we'd be left with a truncated heap, but the FSM or visibility
!              * map would still contain entries for the non-existent heap
!              * pages.
!              */
!             if (fsm || vm)
!                 XLogFlush(lsn);
! 
!             rel->pending_sync->truncated_to = nblocks;
!         }     }      /* Do the real work */     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks); } 
+ /* create the hash table to track pending at-commit fsyncs */
+ static void
+ createPendingSyncsHash(void)
+ {
+     /* First time through: initialize the hash table */
+     HASHCTL        ctl;
+ 
+     MemSet(&ctl, 0, sizeof(ctl));
+     ctl.keysize = sizeof(RelFileNode);
+     ctl.entrysize = sizeof(PendingRelSync);
+     ctl.hash = tag_hash;
+     pendingSyncs = hash_create("pending relation sync table", 5,
+                                &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+  /*  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.  *
***************
*** 369,374 **** smgrDoPendingDeletes(bool isCommit)
--- 459,482 ---- }  /*
+  * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+  */
+ void
+ RelationRemovePendingSync(Relation rel)
+ {
+     bool found;
+ 
+     rel->pending_sync = NULL;
+     rel->no_pending_sync = true;
+     if (pendingSyncs)
+     {
+         elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+         hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+     }
+ }
+ 
+ 
+ /*  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.  *  * The return value is the number
ofrelations scheduled for termination.
 
***************
*** 419,424 **** smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
--- 527,696 ----     return nrels; } 
+ 
+ /*
+  * Remember that the given relation needs to be sync'd at commit, because we
+  * are going to skip WAL-logging subsequent actions to it.
+  */
+ void
+ RecordPendingSync(Relation rel)
+ {
+     bool found = true;
+     BlockNumber nblocks;
+ 
+     Assert(RelationNeedsWAL(rel));
+ 
+     /* ignore no_pending_sync since new entry is created here */
+     if (!rel->pending_sync)
+     {
+         if (!pendingSyncs)
+             createPendingSyncsHash();
+ 
+         /* Look up or create an entry */
+         rel->no_pending_sync = false;
+         elog(DEBUG2, "RecordPendingSync: accessing hash");
+         rel->pending_sync =
+             (PendingRelSync *) hash_search(pendingSyncs,
+                                            (void *) &rel->rd_node,
+                                            HASH_ENTER, &found);
+     }
+ 
+     nblocks = RelationGetNumberOfBlocks(rel);
+     if (!found)
+     {
+         rel->pending_sync->truncated_to = InvalidBlockNumber;
+         rel->pending_sync->sync_above = nblocks;
+ 
+         elog(DEBUG2,
+              "registering new pending sync for rel %u/%u/%u at block %u",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              nblocks);
+ 
+     }
+     else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+     {
+         elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              nblocks);
+         rel->pending_sync->sync_above = nblocks;
+     }
+     else
+         elog(DEBUG2,
+              "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              rel->pending_sync->sync_above, nblocks);
+ }
+ 
+ /*
+  * Do changes to given heap page need to be WAL-logged?
+  *
+  * This takes into account any previous RecordPendingSync() requests.
+  *
+  * Note that it is required to check this before creating any WAL records for
+  * heap pages - it is not merely an optimization! WAL-logging a record, when
+  * we have already skipped a previous WAL record for the same page could lead
+  * to failure at WAL replay, as the "before" state expected by the record
+  * might not match what's on disk. Also, if the heap was truncated earlier, we
+  * must WAL-log any changes to the once-truncated blocks, because replaying
+  * the truncation record will destroy them.
+  */
+ bool
+ BufferNeedsWAL(Relation rel, Buffer buf)
+ {
+     BlockNumber blkno = InvalidBlockNumber;
+ 
+     if (!RelationNeedsWAL(rel))
+         return false;
+ 
+     elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf),
pendingSyncs,rel->pending_sync, rel->no_pending_sync);
 
+     /* no further work if we know that we don't have pending sync */
+     if (!pendingSyncs || rel->no_pending_sync)
+         return true;
+ 
+     /* do the real work */
+     if (!rel->pending_sync)
+     {
+         bool found = false;
+ 
+         /*
+          * Hold the entry in rel. This relies on the fact that hash entry
+          * never moves.
+          */
+         rel->pending_sync =
+             (PendingRelSync *) hash_search(pendingSyncs,
+                                            (void *) &rel->rd_node,
+                                            HASH_FIND, &found);
+         elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+         if (!found)
+         {
+             /* we don't have no one. don't access the hash no longer */
+             rel->no_pending_sync = true;
+             return true;
+         }
+     }
+ 
+     blkno = BufferGetBlockNumber(buf);
+     if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+         rel->pending_sync->sync_above > blkno)
+     {
+         elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              blkno, rel->pending_sync->sync_above);
+         return true;
+     }
+ 
+     /*
+      * We have emitted a truncation record for this block.
+      */
+     if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+         rel->pending_sync->truncated_to <= blkno)
+     {
+         elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the
samexact",
 
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              blkno);
+         return true;
+     }
+ 
+     elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+          rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+          blkno);
+ 
+     return false;
+ }
+ 
+ /*
+  * Sync to disk any relations that we skipped WAL-logging for earlier.
+  */
+ void
+ smgrDoPendingSyncs(bool isCommit)
+ {
+     if (!pendingSyncs)
+         return;
+ 
+     if (isCommit)
+     {
+         HASH_SEQ_STATUS status;
+         PendingRelSync *pending;
+ 
+         hash_seq_init(&status, pendingSyncs);
+ 
+         while ((pending = hash_seq_search(&status)) != NULL)
+         {
+             if (pending->sync_above != InvalidBlockNumber)
+             {
+                 FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                 smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+ 
+                 elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                      pending->relnode.dbNode, pending->relnode.relNode);
+             }
+         }
+     }
+ 
+     hash_destroy(pendingSyncs);
+     pendingSyncs = NULL;
+ }
+  /*  *    PostPrepare_smgr -- Clean up after a successful PREPARE  *
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 2347,2354 **** CopyFrom(CopyState cstate)      *    - data is being written to relfilenode created in this
transaction     * then we can skip writing WAL.  It's safe because if the transaction      * doesn't commit, we'll
discardthe table (or the new relfilenode file).
 
!      * If it does commit, we'll have done the heap_sync at the bottom of this
!      * routine first.      *      * As mentioned in comments in utils/rel.h, the in-same-transaction test      * is
notalways set correctly, since in rare cases rd_newRelfilenodeSubid
 
--- 2347,2353 ----      *    - data is being written to relfilenode created in this transaction      * then we can skip
writingWAL.  It's safe because if the transaction      * doesn't commit, we'll discard the table (or the new
relfilenodefile).
 
!      * If it does commit, commit will do heap_sync().      *      * As mentioned in comments in utils/rel.h, the
in-same-transactiontest      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
 
***************
*** 2380,2386 **** CopyFrom(CopyState cstate)     {         hi_options |= HEAP_INSERT_SKIP_FSM;         if
(!XLogIsNeeded())
!             hi_options |= HEAP_INSERT_SKIP_WAL;     }      /*
--- 2379,2385 ----     {         hi_options |= HEAP_INSERT_SKIP_FSM;         if (!XLogIsNeeded())
!             heap_register_sync(cstate->rel);     }      /*
***************
*** 2862,2872 **** CopyFrom(CopyState cstate)     FreeExecutorState(estate);      /*
!      * If we skipped writing WAL, then we need to sync the heap (but not
!      * indexes since those use WAL anyway)      */
-     if (hi_options & HEAP_INSERT_SKIP_WAL)
-         heap_sync(cstate->rel);      return processed; }
--- 2861,2871 ----     FreeExecutorState(estate);      /*
!      * If we skipped writing WAL, then we will sync the heap at the end of
!      * the transaction. (We used to do it here, but it was later found out
!      * that to be safe, we must also avoid WAL-logging any subsequent
!      * actions on the pages we skipped WAL for). Indexes always use WAL.      */      return processed; }
*** a/src/backend/commands/createas.c
--- b/src/backend/commands/createas.c
***************
*** 567,574 **** intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)      * We can skip WAL-logging
theinsertions, unless PITR or streaming      * replication is in use. We can skip the FSM in any case.      */
 
!     myState->hi_options = HEAP_INSERT_SKIP_FSM |
!         (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);     myState->bistate = GetBulkInsertState();      /* Not using
WALrequires smgr_targblock be initially invalid */
 
--- 567,575 ----      * We can skip WAL-logging the insertions, unless PITR or streaming      * replication is in use.
Wecan skip the FSM in any case.      */
 
!     if (!XLogIsNeeded())
!         heap_register_sync(intoRelationDesc);
!     myState->hi_options = HEAP_INSERT_SKIP_FSM;     myState->bistate = GetBulkInsertState();      /* Not using WAL
requiressmgr_targblock be initially invalid */
 
***************
*** 617,625 **** intorel_shutdown(DestReceiver *self)      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, must heap_sync before commit */
!     if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
!         heap_sync(myState->rel);      /* close rel, but keep lock until commit */     heap_close(myState->rel,
NoLock);
--- 618,624 ----      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, we will sync the relation at commit */      /* close rel, but keep lock until commit
*/    heap_close(myState->rel, NoLock);
 
*** a/src/backend/commands/matview.c
--- b/src/backend/commands/matview.c
***************
*** 477,483 **** transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)      */
myState->hi_options= HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;     if (!XLogIsNeeded())
 
!         myState->hi_options |= HEAP_INSERT_SKIP_WAL;     myState->bistate = GetBulkInsertState();      /* Not using
WALrequires smgr_targblock be initially invalid */
 
--- 477,483 ----      */     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;     if (!XLogIsNeeded())
!         heap_register_sync(transientrel);     myState->bistate = GetBulkInsertState();      /* Not using WAL requires
smgr_targblockbe initially invalid */
 
***************
*** 520,528 **** transientrel_shutdown(DestReceiver *self)      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, must heap_sync before commit */
!     if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
!         heap_sync(myState->transientrel);      /* close transientrel, but keep lock until commit */
heap_close(myState->transientrel,NoLock);
 
--- 520,526 ----      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, we will sync the relation at commit */      /* close transientrel, but keep lock
untilcommit */     heap_close(myState->transientrel, NoLock);
 
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 4357,4364 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)         bistate =
GetBulkInsertState();         hi_options = HEAP_INSERT_SKIP_FSM;         if (!XLogIsNeeded())
 
!             hi_options |= HEAP_INSERT_SKIP_WAL;     }     else     {
--- 4357,4365 ----         bistate = GetBulkInsertState();          hi_options = HEAP_INSERT_SKIP_FSM;
+          if (!XLogIsNeeded())
!             heap_register_sync(newrel);     }     else     {
***************
*** 4624,4631 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);         /* If we skipped writing WAL, then we need to sync the heap. */
 
-         if (hi_options & HEAP_INSERT_SKIP_WAL)
-             heap_sync(newrel);          heap_close(newrel, NoLock);     }
--- 4625,4630 ----
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 891,897 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,                  * page has been
previouslyWAL-logged, and if not, do that                  * now.                  */
 
!                 if (RelationNeedsWAL(onerel) &&                     PageGetLSN(page) == InvalidXLogRecPtr)
        log_newpage_buffer(buf, true); 
 
--- 891,897 ----                  * page has been previously WAL-logged, and if not, do that                  * now.
             */
 
!                 if (BufferNeedsWAL(onerel, buf) &&                     PageGetLSN(page) == InvalidXLogRecPtr)
           log_newpage_buffer(buf, true); 
 
***************
*** 1118,1124 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,             }              /*
NowWAL-log freezing if necessary */
 
!             if (RelationNeedsWAL(onerel))             {                 XLogRecPtr    recptr; 
--- 1118,1124 ----             }              /* Now WAL-log freezing if necessary */
!             if (BufferNeedsWAL(onerel, buf))             {                 XLogRecPtr    recptr; 
***************
*** 1476,1482 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,     MarkBufferDirty(buffer);
/* XLOG stuff */
 
!     if (RelationNeedsWAL(onerel))     {         XLogRecPtr    recptr; 
--- 1476,1482 ----     MarkBufferDirty(buffer);      /* XLOG stuff */
!     if (BufferNeedsWAL(onerel, buffer))     {         XLogRecPtr    recptr; 
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 451,456 **** static BufferDesc *BufferAlloc(SMgrRelation smgr,
--- 451,457 ----             BufferAccessStrategy strategy,             bool *foundPtr); static void
FlushBuffer(BufferDesc*buf, SMgrRelation reln);
 
+ static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal); static void AtProcExit_Buffers(int code,
Datumarg); static void CheckForBufferLeaks(void); static int    rnode_comparator(const void *p1, const void *p2);
 
***************
*** 3147,3166 **** PrintPinnedBufs(void) void FlushRelationBuffers(Relation rel) {
-     int            i;
-     BufferDesc *bufHdr;
-      /* Open rel at the smgr level if not already done */     RelationOpenSmgr(rel); 
!     if (RelationUsesLocalBuffers(rel))     {         for (i = 0; i < NLocBuffer; i++)         {             uint32
   buf_state;              bufHdr = GetLocalBufferDescriptor(i);
 
!             if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&                 ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))             {
 
--- 3148,3188 ---- void FlushRelationBuffers(Relation rel) {     /* Open rel at the smgr level if not already done */
 RelationOpenSmgr(rel); 
 
!     FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
! }
! 
! /*
!  * Like FlushRelationBuffers(), but the relation is specified by a
!  * RelFileNode
!  */
! void
! FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
! {
!     FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
! }
! 
! /*
!  * Code shared between functions FlushRelationBuffers() and
!  * FlushRelationBuffersWithoutRelCache().
!  */
! static void
! FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
! {
!     RelFileNode rnode = smgr->smgr_rnode.node;
!     int            i;
!     BufferDesc *bufHdr;
! 
!     if (islocal)     {         for (i = 0; i < NLocBuffer; i++)         {             uint32        buf_state;
     bufHdr = GetLocalBufferDescriptor(i);
 
!             if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&                 ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))             {
 
***************
*** 3177,3183 **** FlushRelationBuffers(Relation rel)                  PageSetChecksumInplace(localpage,
bufHdr->tag.blockNum);
 
!                 smgrwrite(rel->rd_smgr,                           bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                          localpage,
 
--- 3199,3205 ----                  PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); 
!                 smgrwrite(smgr,                           bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                          localpage,
 
***************
*** 3207,3224 **** FlushRelationBuffers(Relation rel)          * As in DropRelFileNodeBuffers, an unlocked precheck
shouldbe safe          * and saves some cycles.          */
 
!         if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))             continue;
ReservePrivateRefCountEntry();         buf_state = LockBufHdr(bufHdr);
 
!         if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&             (buf_state & (BM_VALID | BM_DIRTY)) ==
(BM_VALID| BM_DIRTY))         {             PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
!             FlushBuffer(bufHdr, rel->rd_smgr);             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
       UnpinBuffer(bufHdr, true);         }
 
--- 3229,3246 ----          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe          * and saves
somecycles.          */
 
!         if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))             continue;
ReservePrivateRefCountEntry();         buf_state = LockBufHdr(bufHdr);
 
!         if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&             (buf_state & (BM_VALID | BM_DIRTY)) ==
(BM_VALID| BM_DIRTY))         {             PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
!             FlushBuffer(bufHdr, smgr);             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr,true);         }
 
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 72,77 ****
--- 72,78 ---- #include "optimizer/var.h" #include "rewrite/rewriteDefine.h" #include "rewrite/rowsecurity.h"
+ #include "storage/bufmgr.h" #include "storage/lmgr.h" #include "storage/smgr.h" #include "utils/array.h"
***************
*** 418,423 **** AllocateRelationDesc(Form_pg_class relp)
--- 419,428 ----     /* which we mark as a reference-counted tupdesc */     relation->rd_att->tdrefcount = 1; 
+     /* We don't know if pending sync for this relation exists so far */
+     relation->pending_sync = NULL;
+     relation->no_pending_sync = false;
+      MemoryContextSwitchTo(oldcxt);      return relation;
***************
*** 2040,2045 **** formrdesc(const char *relationName, Oid relationReltype,
--- 2045,2054 ----         relation->rd_rel->relhasindex = true;     } 
+     /* We don't know if pending sync for this relation exists so far */
+     relation->pending_sync = NULL;
+     relation->no_pending_sync = false;
+      /*      * add new reldesc to relcache      */
***************
*** 3364,3369 **** RelationBuildLocalRelation(const char *relname,
--- 3373,3382 ----     else         rel->rd_rel->relfilenode = relfilenode; 
+     /* newly built relation has no pending sync */
+     rel->no_pending_sync = true;
+     rel->pending_sync = NULL;
+      RelationInitLockInfo(rel);    /* see lmgr.c */      RelationInitPhysicalAddr(rel);
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 25,34 ****   /* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_WAL    0x0001
! #define HEAP_INSERT_SKIP_FSM    0x0002
! #define HEAP_INSERT_FROZEN        0x0004
! #define HEAP_INSERT_SPECULATIVE 0x0008  typedef struct BulkInsertStateData *BulkInsertState; 
--- 25,33 ----   /* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_FSM    0x0001
! #define HEAP_INSERT_FROZEN        0x0002
! #define HEAP_INSERT_SPECULATIVE 0x0004  typedef struct BulkInsertStateData *BulkInsertState; 
***************
*** 179,184 **** extern void simple_heap_delete(Relation relation, ItemPointer tid);
--- 178,184 ---- extern void simple_heap_update(Relation relation, ItemPointer otid,                    HeapTuple tup);

+ extern void heap_register_sync(Relation relation); extern void heap_sync(Relation relation); extern void
heap_update_snapshot(HeapScanDescscan, Snapshot snapshot); 
 
*** a/src/include/catalog/storage.h
--- b/src/include/catalog/storage.h
***************
*** 22,34 **** extern void RelationCreateStorage(RelFileNode rnode, char relpersistence); extern void
RelationDropStorage(Relationrel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void
RelationTruncate(Relationrel, BlockNumber nblocks);
 
!  /*  * These functions used to be in storage/smgr/smgr.c, which explains the  * naming  */ extern void
smgrDoPendingDeletes(boolisCommit); extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); extern void
AtSubCommit_smgr(void);extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void);
 
--- 22,37 ---- extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode,
boolatCommit); extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 
! extern void RelationRemovePendingSync(Relation rel); /*  * These functions used to be in storage/smgr/smgr.c, which
explainsthe  * naming  */ extern void smgrDoPendingDeletes(bool isCommit); extern int    smgrGetPendingDeletes(bool
forCommit,RelFileNode **ptr);
 
+ extern void smgrDoPendingSyncs(bool isCommit);
+ extern void RecordPendingSync(Relation rel);
+ bool BufferNeedsWAL(Relation rel, Buffer buf); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void);
externvoid PostPrepare_smgr(void);
 
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 190,195 **** extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
--- 190,197 ----                                 ForkNumber forkNum); extern void FlushOneBuffer(Buffer buffer); extern
voidFlushRelationBuffers(Relation rel);
 
+ extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                     bool islocal); extern void FlushDatabaseBuffers(Oid dbid); extern void
DropRelFileNodeBuffers(RelFileNodeBackendrnode,                        ForkNumber forkNum, BlockNumber firstDelBlock);
 
*** a/src/include/utils/rel.h
--- b/src/include/utils/rel.h
***************
*** 216,221 **** typedef struct RelationData
--- 216,229 ----      /* use "struct" here to avoid needing to include pgstat.h: */     struct PgStat_TableStatus
*pgstat_info;/* statistics collection area */
 
+ 
+     /*
+      * no_pending_sync is true if this relation is known not to have pending
+      * syncs.  Elsewise searching for registered sync is required if
+      * pending_sync is NULL.
+      */
+     bool                   no_pending_sync;
+     struct PendingRelSync *pending_sync; } RelationData;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

13 сентября 2017 г., 07:04:21

Hello, (does this seem to be a top post?)

The CF status of this patch turned into "Waiting on Author" by
automated CI checking. However, I still don't get any error even
on the current master (69835bc) after make distclean. Also I
don't see any difference between the "problematic" patch and my
working branch has nothing different other than patching line
shifts. (So I haven't post a new one.)

I looked on the location heapam.c:2502 where the CI complains at
in my working branch and I found a different code with the
complaint.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/274777750

1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in this function)
1364   if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

heapam.c:2502@work branch
2502:   /* XLOG stuff */
2503:   if (BufferNeedsWAL(relation, buffer))

So I conclude that the CI mechinery failed to applly the patch
correctly.


At Thu, 13 Apr 2017 15:29:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170413.152935.100104316.horiguchi.kyotaro@lab.ntt.co.jp>
> > > > I'll post new patch in this way soon.
> > > 
> > > Here it is.
> > 
> > It contained tariling space and missing test script.  This is the
> > correct patch.
> > 
> > > - Relation has new members no_pending_sync and pending_sync that
> > >   works as instant cache of an entry in pendingSync hash.
> > > 
> > > - Commit-time synchronizing is restored as Michael's patch.
> > > 
> > > - If relfilenode is replaced, pending_sync for the old node is
> > >   removed. Anyway this is ignored on abort and meaningless on
> > >   commit.
> > > 
> > > - TAP test is renamed to 012 since some new files have been added.
> > > 
> > > Accessing pending sync hash occured on every calling of
> > > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> > > accessing relations has pending sync.  Almost of them are
> > > eliminated as the result.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Thomas Munro

Дата:

13 сентября 2017 г., 09:05:31

On Wed, Sep 13, 2017 at 1:04 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> The CF status of this patch turned into "Waiting on Author" by
> automated CI checking. However, I still don't get any error even
> on the current master (69835bc) after make distclean. Also I
> don't see any difference between the "problematic" patch and my
> working branch has nothing different other than patching line
> shifts. (So I haven't post a new one.)
>
> I looked on the location heapam.c:2502 where the CI complains at
> in my working branch and I found a different code with the
> complaint.
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/274777750
>
> 1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in this function)
> 1364   if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
>
> heapam.c:2502@work branch
> 2502:   /* XLOG stuff */
> 2503:   if (BufferNeedsWAL(relation, buffer))
>
> So I conclude that the CI mechinery failed to applly the patch
> correctly.

Hi Horiguchi-san,

Hmm.  Here is that line in heamap.c in unpatched master:


https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;h=d20f0381f3bc23f99c505ef8609d63240ac5d44b;hb=HEAD#l2485

It says:

2485     if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

After applying fix-wal-level-minimal-michael-horiguchi-3.patch from
this message:

https://www.postgresql.org/message-id/20170912.131441.20602611.horiguchi.kyotaro%40lab.ntt.co.jp

... that line is unchanged, although it has moved to line number 2502.
It doesn't compile for me, because your patch removed the definition
of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.

I'm not sure what happened.  Is it possible that your patch was not
created by diffing against master?

--
Thomas Munro
http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Alvaro Herrera

Дата:

13 сентября 2017 г., 14:03:48

Kyotaro HORIGUCHI wrote:

> The CF status of this patch turned into "Waiting on Author" by
> automated CI checking.

I object to automated turning of patches to waiting on author by
machinery.  Sending occasional reminder messages to authors making them
know about outdated patches seems acceptable to me at this stage.

It'll take some time for this machinery to get perfected; only when it
is beyond experimental mode it'll be acceptable to change patches'
status in an automated fashion.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

13 сентября 2017 г., 14:42:39

At Wed, 13 Sep 2017 15:05:31 +1200, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=0x7CGYmNM5q7TKzz_KrD+Pr7jbFzD8UZad_+=4PG1PyA@mail.gmail.com>
> It doesn't compile for me, because your patch removed the definition
> of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.
> 
> I'm not sure what happened.  Is it possible that your patch was not
> created by diffing against master?

It created using filterdiff.

> git diff master --patience | grep options
...
> -       if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

but the line dissapears from the output of the following command

> git diff master --patience | filterdiff --format=context | grep options

filterdiff seems to did something wrong..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

14 сентября 2017 г., 12:34:59

At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp>
> filterdiff seems to did something wrong..

# to did...

The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7086b5855080065f73de4d099cbaab09511f01fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem

---src/backend/access/heap/heapam.c        | 113 +++++++++---src/backend/access/heap/pruneheap.c     |   3
+-src/backend/access/heap/rewriteheap.c  |   4 +-src/backend/access/heap/visibilitymap.c |   3
+-src/backend/access/transam/xact.c      |   7 +src/backend/catalog/storage.c           | 318
+++++++++++++++++++++++++++++---src/backend/commands/copy.c            |  13 +-src/backend/commands/createas.c
|  9 +-src/backend/commands/matview.c          |   6 +-src/backend/commands/tablecmds.c        |   8
+-src/backend/commands/vacuumlazy.c      |   6 +-src/backend/storage/buffer/bufmgr.c     |  40
+++-src/backend/utils/cache/relcache.c     |  13 ++src/include/access/heapam.h             |   8
+-src/include/catalog/storage.h          |   5 +-src/include/storage/bufmgr.h            |   2 +src/include/utils/rel.h
               |   8 +17 files changed, 476 insertions(+), 90 deletions(-)
 

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d20f038..e40254d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@ *      the POSTGRES heap access method used for all POSTGRES *      relations. *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ * *------------------------------------------------------------------------- */#include "postgres.h"
@@ -56,6 +78,7 @@#include "access/xlogutils.h"#include "catalog/catalog.h"#include "catalog/namespace.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * The new tuple is stamped with current
transactionID and the specified * command ID. *
 
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. *
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate) * TID where the tuple was stored.  But note
thatany toasting of fields * within the tuple data is NOT reflected into *tup. */
 
+extern HTAB *pendingSyncs;Oidheap_insert(Relation relation, HeapTuple tup, CommandId cid,            int options,
BulkInsertStatebistate)
 
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,    MarkBufferDirty(buffer);    /*
XLOGstuff */
 
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_insert xlrec;        xl_heap_header xlhdr;
@@ -2681,12 +2699,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,    int            ndone;
char       *scratch = NULL;    Page        page;
 
-    bool        needwal;    Size        saveFreeSpace;    bool        need_tuple_data =
RelationIsLogicallyLogged(relation);   bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);    saveFreeSpace =
RelationGetTargetPageFreeSpace(relation,                                                  HEAP_DEFAULT_FILLFACTOR);
 
@@ -2701,7 +2717,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,     * palloc() within a
criticalsection is not safe, so we allocate this     * beforehand.     */
 
-    if (needwal)
+    if (RelationNeedsWAL(relation))        scratch = palloc(BLCKSZ);    /*
@@ -2736,6 +2752,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,        Buffer
vmbuffer= InvalidBuffer;        bool        all_visible_cleared = false;        int            nthispage;
 
+        bool        needwal;        CHECK_FOR_INTERRUPTS();
@@ -2747,6 +2764,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
            InvalidBuffer, options, bistate,                                           &vmbuffer, NULL);        page =
BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);        /* NO EREPORT(ERROR) from here till changes are logged */
    START_CRIT_SECTION();
 
@@ -3303,7 +3321,7 @@ l1:     * NB: heap_abort_speculative() uses the same xlog record and replay     * routines.
*/
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -4269,7 +4287,8 @@ l2:    MarkBufferDirty(buffer);    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))    {        XLogRecPtr    recptr;
@@ -5160,7 +5179,7 @@ failed:     * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG     *
entriesfor everything anyway.)     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))    {        xl_heap_lock xlrec;        XLogRecPtr    recptr;
@@ -5894,7 +5913,7 @@ l4:        MarkBufferDirty(buf);        /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))        {            xl_heap_lock_updated xlrec;            XLogRecPtr
recptr;
@@ -6050,7 +6069,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)    htup->t_ctid = tuple->t_self;
/*XLOG stuff */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_confirm xlrec;        XLogRecPtr    recptr;
@@ -6183,7 +6202,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)     * The WAL records generated here
matchheap_delete().  The same recovery     * routines are used.     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -6292,7 +6311,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)    MarkBufferDirty(buffer);    /* XLOG
stuff*/
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_inplace xlrec;        XLogRecPtr    recptr;
@@ -7406,7 +7425,7 @@ log_heap_clean(Relation reln, Buffer buffer,    XLogRecPtr    recptr;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    xlrec.latestRemovedXid = latestRemovedXid;    xlrec.nredirected =
nredirected;
@@ -7454,7 +7473,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,    XLogRecPtr    recptr;
 /* Caller should not call me on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    /* nor when there are no tuples to freeze */    Assert(ntuples > 0);
@@ -7539,7 +7558,7 @@ log_heap_update(Relation reln, Buffer oldbuf,    int            bufflags;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));    XLogBeginInsert();
@@ -8630,8 +8649,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)     */    /* Deal with old tuple
version*/
 
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+    if (oldaction == BLK_NEEDS_REDO)    {        page = BufferGetPage(obuffer);
@@ -8685,6 +8709,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)        PageInit(page,
BufferGetPageSize(nbuffer),0);        newaction = BLK_NEEDS_REDO;    }
 
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;    else        newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9121,9 +9147,16 @@ heap2_redo(XLogReaderState *record) *    heap_sync        - sync a heap, for use when no WAL has
beenwritten * * This forces the heap contents (including TOAST heap if any) down to disk.
 
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead. * * Indexes are not touched.  (Currently, index operations associated with * the
commandsthat use this are WAL-logged and so do not need fsync.
 
@@ -9233,3 +9266,33 @@ heap_mask(char *pagedata, BlockNumber blkno)        }    }}
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 52231ac..97edb99 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@#include "access/htup_details.h"#include "access/xlog.h"#include "catalog/catalog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,        /*         *
Emita WAL HEAP_CLEAN record showing what we did         */
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))        {            XLogRecPtr    recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bd560e4..3c457db 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)    }    else if (HeapTupleHasExternal(tup) ||
tup->t_len> TOAST_TUPLE_THRESHOLD)        heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);    else        heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..971d469 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@#include "access/heapam_xlog.h"#include "access/visibilitymap.h"#include "access/xlog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "storage/bufmgr.h"#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,        map[mapByte] |= (flags
<<mapOffset);        MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))        {            if (XLogRecPtrIsInvalid(recptr))            {
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 93dca7a..7fba3df 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2008,6 +2008,9 @@ CommitTransaction(void)    /* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+    /*     * Mark serializable transaction as complete for predicate locking     * purposes.  This should be done as
lateas we can put it and still allow
 
@@ -2236,6 +2239,9 @@ PrepareTransaction(void)    /* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+    /*     * Mark serializable transaction as complete for predicate locking     * purposes.  This should be done as
lateas we can put it and still allow
 
@@ -2549,6 +2555,7 @@ AbortTransaction(void)    AtAbort_Notify();    AtEOXact_RelationMap(false);
AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */    /*     * Advertise the fact that we aborted in
pg_xact(assuming that we got as
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde0..6bc1088 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@#include "catalog/storage_xlog.h"#include "storage/freespace.h"#include "storage/smgr.h"
+#include "utils/hsearch.h"#include "utils/memutils.h"#include "utils/rel.h"
@@ -64,6 +65,49 @@ typedef struct PendingRelDeletestatic PendingRelDelete *pendingDeletes = NULL; /* head of linked
list*//*
 
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/* * RelationCreateStorage *        Create physical storage for a relation. *
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)voidRelationTruncate(Relation rel,
BlockNumbernblocks){
 
+    PendingRelSync *pending = NULL;
+    bool        found;    bool        fsm;    bool        vm;
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)     */    if (RelationNeedsWAL(rel))    {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
-
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        /* no_pending_sync is ignored since new entry is created here */
+        if (!rel->pending_sync)
+        {
+            if (!pendingSyncs)
+                createPendingSyncsHash();
+            elog(DEBUG2, "RelationTruncate: accessing hash");
+            pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                 (void *) &rel->rd_node,
+                                                 HASH_ENTER, &found);
+            if (!found)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            rel->no_pending_sync= false;
+            rel->pending_sync = pending;
+        }
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+            rel->pending_sync->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }    }    /* Do the real work */    smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);}
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+    /* First time through: initialize the hash table */
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hash = tag_hash;
+    pendingSyncs = hash_create("pending relation sync table", 5,
+                               &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+/* *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact. *
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)}/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
+/* * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted. * * The return value is the number of
relationsscheduled for termination.
 
@@ -419,6 +527,170 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)    return nrels;}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    bool found = true;
+    BlockNumber nblocks;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* ignore no_pending_sync since new entry is created here */
+    if (!rel->pending_sync)
+    {
+        if (!pendingSyncs)
+            createPendingSyncsHash();
+
+        /* Look up or create an entry */
+        rel->no_pending_sync = false;
+        elog(DEBUG2, "RecordPendingSync: accessing hash");
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_ENTER, &found);
+    }
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+    if (!found)
+    {
+        rel->pending_sync->truncated_to = InvalidBlockNumber;
+        rel->pending_sync->sync_above = nblocks;
+
+        elog(DEBUG2,
+             "registering new pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+
+    }
+    else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+        rel->pending_sync->sync_above = nblocks;
+    }
+    else
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf),
pendingSyncs,rel->pending_sync, rel->no_pending_sync);
 
+    /* no further work if we know that we don't have pending sync */
+    if (!pendingSyncs || rel->no_pending_sync)
+        return true;
+
+    /* do the real work */
+    if (!rel->pending_sync)
+    {
+        bool found = false;
+
+        /*
+         * Hold the entry in rel. This relies on the fact that hash entry
+         * never moves.
+         */
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_FIND, &found);
+        elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+        if (!found)
+        {
+            /* we don't have no one. don't access the hash no longer */
+            rel->no_pending_sync = true;
+            return true;
+        }
+    }
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+        rel->pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+        rel->pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+/* *    PostPrepare_smgr -- Clean up after a successful PREPARE *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index cfa3f05..6c0ffae 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2347,8 +2347,7 @@ CopyFrom(CopyState cstate)     *    - data is being written to relfilenode created in this
transaction    * then we can skip writing WAL.  It's safe because if the transaction     * doesn't commit, we'll
discardthe table (or the new relfilenode file).
 
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().     *     * As mentioned in comments in utils/rel.h, the
in-same-transactiontest     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
 
@@ -2380,7 +2379,7 @@ CopyFrom(CopyState cstate)    {        hi_options |= HEAP_INSERT_SKIP_FSM;        if
(!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);    }    /*
@@ -2862,11 +2861,11 @@ CopyFrom(CopyState cstate)    FreeExecutorState(estate);    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.     */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);    return processed;}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index e60210c..dbc2028 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     * We can skip
WAL-loggingthe insertions, unless PITR or streaming     * replication is in use. We can skip the FSM in any case.
*/
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;    myState->bistate = GetBulkInsertState();    /* Not using WAL
requiressmgr_targblock be initially invalid */
 
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close rel, but keep lock until commit */
 heap_close(myState->rel, NoLock);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index d2e0376..5645a6e 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     */
myState->hi_options= HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;    if (!XLogIsNeeded())
 
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);    myState->bistate = GetBulkInsertState();    /* Not using WAL requires
smgr_targblockbe initially invalid */
 
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close transientrel, but keep lock until
commit*/    heap_close(myState->transientrel, NoLock);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 96354bd..3fdb99d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4401,8 +4401,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)        bistate =
GetBulkInsertState();       hi_options = HEAP_INSERT_SKIP_FSM;
 
+        if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);    }    else    {
@@ -4675,8 +4676,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);       /* If we skipped writing WAL, then we need to sync the heap. */
 
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);        heap_close(newrel, NoLock);    }
@@ -10656,11 +10655,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)    /*     * Create
andcopy all forks of the relation, and schedule unlinking of
 
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.     *     * NOTE: any conflict in
relfilenodevalue will be caught in     * RelationCreateStorage().     */
 
+    RelationRemovePendingSync(rel);    RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);    /* copy main
fork*/
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..757ed7f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -891,7 +891,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,                 * page has
beenpreviously WAL-logged, and if not, do that                 * now.                 */
 
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&                    PageGetLSN(page) == InvalidXLogRecPtr)
        log_newpage_buffer(buf, true);
 
@@ -1118,7 +1118,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,            }            /*
NowWAL-log freezing if necessary */
 
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))            {                XLogRecPtr    recptr;
@@ -1476,7 +1476,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,    MarkBufferDirty(buffer);
/* XLOG stuff */
 
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))    {        XLogRecPtr    recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 15795b0..be57547 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,            BufferAccessStrategy strategy,
  bool *foundPtr);static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);static void AtProcExit_Buffers(int code,
Datumarg);static void CheckForBufferLeaks(void);static int    rnode_comparator(const void *p1, const void *p2);
 
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)voidFlushRelationBuffers(Relation rel){
-    int            i;
-    BufferDesc *bufHdr;
-    /* Open rel at the smgr level if not already done */    RelationOpenSmgr(rel);
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)    {        for (i = 0; i < NLocBuffer; i++)        {            uint32        buf_state;
bufHdr= GetLocalBufferDescriptor(i);
 
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&                ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))            {
 
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)                PageSetChecksumInplace(localpage,
bufHdr->tag.blockNum);
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,                          bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                         localpage,
 
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)         * As in DropRelFileNodeBuffers, an unlocked precheck
shouldbe safe         * and saves some cycles.         */
 
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))            continue;        ReservePrivateRefCountEntry();
    buf_state = LockBufHdr(bufHdr);
 
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID
|BM_DIRTY))        {            PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr,true);        }
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index b8e3780..3dff4ed 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@#include "optimizer/var.h"#include "rewrite/rewriteDefine.h"#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"#include "storage/lmgr.h"#include "storage/smgr.h"#include "utils/array.h"
@@ -418,6 +419,10 @@ AllocateRelationDesc(Form_pg_class relp)    /* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount= 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+    MemoryContextSwitchTo(oldcxt);    return relation;
@@ -2040,6 +2045,10 @@ formrdesc(const char *relationName, Oid relationReltype,        relation->rd_rel->relhasindex =
true;   }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+    /*     * add new reldesc to relcache     */
@@ -3364,6 +3373,10 @@ RelationBuildLocalRelation(const char *relname,    else        rel->rd_rel->relfilenode =
relfilenode;
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+    RelationInitLockInfo(rel);    /* see lmgr.c */    RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024..79b964f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004typedef struct BulkInsertStateData *BulkInsertState;
@@ -179,6 +178,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);extern void
simple_heap_update(Relationrelation, ItemPointer otid,                   HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);extern void heap_sync(Relation relation);extern void
heap_update_snapshot(HeapScanDescscan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index a3a97db..03964e2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);extern void
RelationDropStorage(Relationrel);extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);extern void
RelationTruncate(Relationrel, BlockNumber nblocks);
 
-
+extern void RelationRemovePendingSync(Relation rel);/* * These functions used to be in storage/smgr/smgr.c, which
explainsthe * naming */extern void smgrDoPendingDeletes(bool isCommit);extern int    smgrGetPendingDeletes(bool
forCommit,RelFileNode **ptr);
 
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);extern void AtSubCommit_smgr(void);extern void
AtSubAbort_smgr(void);externvoid PostPrepare_smgr(void);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 98b63fc..598d1a0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);extern void FlushOneBuffer(Buffer buffer);extern void FlushRelationBuffers(Relation rel);
 
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);extern void FlushDatabaseBuffers(Oid dbid);extern void
DropRelFileNodeBuffers(RelFileNodeBackendrnode,                       ForkNumber forkNum, BlockNumber firstDelBlock);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 4bc61e5..c7610bd 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData    /* use "struct" here to avoid needing to include pgstat.h: */
structPgStat_TableStatus *pgstat_info; /* statistics collection area */
 
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;} RelationData;
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

28 ноября 2017 г., 07:36:39

On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> filterdiff seems to did something wrong..
>
> # to did...
>
> The patch is broken by filterdiff so I send a new patch made
> directly by git format-patch. I confirmed that a build completes
> with applying this.

To my surprise this patch still applies but fails recovery tests. I am
bumping it to next CF, for what will be its 8th registration as it is
for a bug fix, switching the status to "waiting on author".
-- 
Michael

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 декабря 2017 г., 14:54:24

At Tue, 28 Nov 2017 10:36:39 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSqukqS5Xx6_6KEk53eRy5ObdvaNG-5aN_4cE8=gTeOdg@mail.gmail.com>
> On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp>
 
> >> filterdiff seems to did something wrong..
> >
> > # to did...

It's horrid to see that:p

> > The patch is broken by filterdiff so I send a new patch made
> > directly by git format-patch. I confirmed that a build completes
> > with applying this.
> 
> To my surprise this patch still applies but fails recovery tests. I am
> bumping it to next CF, for what will be its 8th registration as it is
> for a bug fix, switching the status to "waiting on author".

Thank you for checking that. I saw maybe the same failure. It
occurred when visibilitymap_set() is called with heapBuf =
InvalidBuffer during recovery. Checking pendingSyncs and
no_pending_sync before the elog fixes it. Anyway the DEBUG2 elogs
are to removed before committing. They are just to look how it
works.

The attached patch applies on the current HEAD and passes all
recovery tests.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From af24850bf8ec5ea082d3affce9d0754daf1862ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem

---
 src/backend/access/heap/heapam.c        | 113 ++++++++---
 src/backend/access/heap/pruneheap.c     |   3 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/access/transam/xact.c       |   7 +
 src/backend/catalog/storage.c           | 324 +++++++++++++++++++++++++++++---
 src/backend/commands/copy.c             |  13 +-
 src/backend/commands/createas.c         |   9 +-
 src/backend/commands/matview.c          |   6 +-
 src/backend/commands/tablecmds.c        |   8 +-
 src/backend/commands/vacuumlazy.c       |   6 +-
 src/backend/storage/buffer/bufmgr.c     |  40 +++-
 src/backend/utils/cache/relcache.c      |  13 ++
 src/include/access/heapam.h             |   8 +-
 src/include/catalog/storage.h           |   5 +-
 src/include/storage/bufmgr.h            |   2 +
 src/include/utils/rel.h                 |   8 +
 17 files changed, 482 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3acef27..ecb9ad8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -56,6 +78,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+extern HTAB *pendingSyncs;
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
             int options, BulkInsertState bistate)
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2683,12 +2701,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     char       *scratch = NULL;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2703,7 +2719,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
      * palloc() within a critical section is not safe, so we allocate this
      * beforehand.
      */
-    if (needwal)
+    if (RelationNeedsWAL(relation))
         scratch = palloc(BLCKSZ);
 
     /*
@@ -2738,6 +2754,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2749,6 +2766,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3305,7 +3323,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -4271,7 +4289,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -5162,7 +5181,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5896,7 +5915,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -6052,7 +6071,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -6185,7 +6204,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6294,7 +6313,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7408,7 +7427,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7456,7 +7475,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);

@@ -7541,7 +7560,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8632,8 +8651,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
      */
 
     /* Deal with old tuple version */
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+
     if (oldaction == BLK_NEEDS_REDO)
     {
         page = BufferGetPage(obuffer);
@@ -8687,6 +8711,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
         PageInit(page, BufferGetPageSize(nbuffer), 0);
         newaction = BLK_NEEDS_REDO;
     }
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;
     else
         newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
 
@@ -9123,9 +9149,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
@@ -9235,3 +9268,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 9f33e0c..1f184c9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f93c194..899d7a5 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     }
     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);
     else
         heaptup = tup;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4c2a13a..971d469 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 046898c..24400e7 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2000,6 +2000,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2228,6 +2231,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2541,6 +2547,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9a5fde0..722f740 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -64,6 +65,49 @@ typedef struct PendingRelDelete
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
 /*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
  * RelationCreateStorage
  *        Create physical storage for a relation.
  *
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 void
 RelationTruncate(Relation rel, BlockNumber nblocks)
 {
+    PendingRelSync *pending = NULL;
+    bool        found;
     bool        fsm;
     bool        vm;
 
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
-
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        /* no_pending_sync is ignored since new entry is created here */
+        if (!rel->pending_sync)
+        {
+            if (!pendingSyncs)
+                createPendingSyncsHash();
+            elog(DEBUG2, "RelationTruncate: accessing hash");
+            pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                 (void *) &rel->rd_node,
+                                                 HASH_ENTER, &found);
+            if (!found)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            rel->no_pending_sync= false;
+            rel->pending_sync = pending;
+        }
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+            rel->pending_sync->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+    /* First time through: initialize the hash table */
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hash = tag_hash;
+    pendingSyncs = hash_create("pending relation sync table", 5,
+                               &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)
 }
 
 /*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
+/*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
  * The return value is the number of relations scheduled for termination.
@@ -419,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    bool found = true;
+    BlockNumber nblocks;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* ignore no_pending_sync since new entry is created here */
+    if (!rel->pending_sync)
+    {
+        if (!pendingSyncs)
+            createPendingSyncsHash();
+
+        /* Look up or create an entry */
+        rel->no_pending_sync = false;
+        elog(DEBUG2, "RecordPendingSync: accessing hash");
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_ENTER, &found);
+    }
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+    if (!found)
+    {
+        rel->pending_sync->truncated_to = InvalidBlockNumber;
+        rel->pending_sync->sync_above = nblocks;
+
+        elog(DEBUG2,
+             "registering new pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+
+    }
+    else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+        rel->pending_sync->sync_above = nblocks;
+    }
+    else
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pendingSyncs || rel->no_pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf),
pendingSyncs,rel->pending_sync, rel->no_pending_sync); 
+
+    /* do the real work */
+    if (!rel->pending_sync)
+    {
+        bool found = false;
+
+        /*
+         * Hold the entry in rel. This relies on the fact that hash entry
+         * never moves.
+         */
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_FIND, &found);
+        elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+        if (!found)
+        {
+            /* we don't have no one. don't access the hash no longer */
+            rel->no_pending_sync = true;
+            return true;
+        }
+    }
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+        rel->pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+        rel->pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 254be28..1ba8cce 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2357,8 +2357,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2390,7 +2389,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -2887,11 +2886,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4d77411..01bbb51 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index d2e0376..5645a6e 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d979ce2..594d7bf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4412,8 +4412,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4686,8 +4687,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         heap_close(newrel, NoLock);
     }
@@ -10727,11 +10726,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 20ce431..82bbf05 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -902,7 +902,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1129,7 +1129,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1487,7 +1487,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 26df7cb..171b17b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 12a5f15..08711b5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -73,6 +73,7 @@
 #include "optimizer/var.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -414,6 +415,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -2043,6 +2048,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3367,6 +3376,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4e41024..79b964f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
@@ -179,6 +178,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index a3a97db..03964e2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 98b63fc..598d1a0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 68fd6fb..507844f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.9.2

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Stephen Frost

Дата:

05 января 2018 г., 10:10:40

Greetings,

* Kyotaro HORIGUCHI (horiguchi.kyotaro@lab.ntt.co.jp) wrote:
> At Tue, 28 Nov 2017 10:36:39 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSqukqS5Xx6_6KEk53eRy5ObdvaNG-5aN_4cE8=gTeOdg@mail.gmail.com>
> > On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
wrotein <20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp> 
> > >> filterdiff seems to did something wrong..
> > >
> > > # to did...
>
> It's horrid to see that:p
>
> > > The patch is broken by filterdiff so I send a new patch made
> > > directly by git format-patch. I confirmed that a build completes
> > > with applying this.
> >
> > To my surprise this patch still applies but fails recovery tests. I am
> > bumping it to next CF, for what will be its 8th registration as it is
> > for a bug fix, switching the status to "waiting on author".
>
> Thank you for checking that. I saw maybe the same failure. It
> occurred when visibilitymap_set() is called with heapBuf =
> InvalidBuffer during recovery. Checking pendingSyncs and
> no_pending_sync before the elog fixes it. Anyway the DEBUG2 elogs
> are to removed before committing. They are just to look how it
> works.
>
> The attached patch applies on the current HEAD and passes all
> recovery tests.

This is currently marked as 'waiting on author' in the CF app, but it
sounds like it should be 'Needs review'.  If that's the case, please
update the CF app accordingly.  If you run into any issues with that,
let me know.

Thanks!

Stephen

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 января 2018 г., 14:03:55

Hello,

At Thu, 4 Jan 2018 23:10:40 -0500, Stephen Frost <sfrost@snowman.net> wrote in
<20180105041040.GI2416@tamriel.snowman.net>
> > The attached patch applies on the current HEAD and passes all
> > recovery tests.
> 
> This is currently marked as 'waiting on author' in the CF app, but it
> sounds like it should be 'Needs review'.  If that's the case, please
> update the CF app accordingly.  If you run into any issues with that,
> let me know.
> 
> Thanks!

Thank you for noticing me of that. The attached is the rebased
patch (the previous version didn't conflict with the current
master, though) and changed the status to "Needs Review".

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 15e3d095b89e9a5bb8025008d1475107b340cbd4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem

---
 src/backend/access/heap/heapam.c        | 113 ++++++++---
 src/backend/access/heap/pruneheap.c     |   3 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/access/transam/xact.c       |   7 +
 src/backend/catalog/storage.c           | 324 +++++++++++++++++++++++++++++---
 src/backend/commands/copy.c             |  13 +-
 src/backend/commands/createas.c         |   9 +-
 src/backend/commands/matview.c          |   6 +-
 src/backend/commands/tablecmds.c        |   8 +-
 src/backend/commands/vacuumlazy.c       |   6 +-
 src/backend/storage/buffer/bufmgr.c     |  40 +++-
 src/backend/utils/cache/relcache.c      |  13 ++
 src/include/access/heapam.h             |   8 +-
 src/include/catalog/storage.h           |   5 +-
 src/include/storage/bufmgr.h            |   2 +
 src/include/utils/rel.h                 |   8 +
 17 files changed, 482 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dbc8f2d..df7e050 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -56,6 +78,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -2373,12 +2396,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2409,6 +2426,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+extern HTAB *pendingSyncs;
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
             int options, BulkInsertState bistate)
@@ -2482,7 +2500,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2683,12 +2701,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     char       *scratch = NULL;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2703,7 +2719,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
      * palloc() within a critical section is not safe, so we allocate this
      * beforehand.
      */
-    if (needwal)
+    if (RelationNeedsWAL(relation))
         scratch = palloc(BLCKSZ);
 
     /*
@@ -2738,6 +2754,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2749,6 +2766,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3305,7 +3323,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -4271,7 +4289,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -5162,7 +5181,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5896,7 +5915,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -6052,7 +6071,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -6185,7 +6204,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6294,7 +6313,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7480,7 +7499,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7528,7 +7547,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);

@@ -7613,7 +7632,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8704,8 +8723,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
      */
 
     /* Deal with old tuple version */
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+
     if (oldaction == BLK_NEEDS_REDO)
     {
         page = BufferGetPage(obuffer);
@@ -8759,6 +8783,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
         PageInit(page, BufferGetPageSize(nbuffer), 0);
         newaction = BLK_NEEDS_REDO;
     }
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;
     else
         newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
 
@@ -9195,9 +9221,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
@@ -9307,3 +9340,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f67d7d1..6dd2ae5 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2..7471d74 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -652,9 +652,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     }
     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);
     else
         heaptup = tup;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b251e69..4a46444 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ea81f4b..8a0c3b4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2001,6 +2001,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2229,6 +2232,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2542,6 +2548,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cff49ba..e9abd49 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -64,6 +65,49 @@ typedef struct PendingRelDelete
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
 /*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
  * RelationCreateStorage
  *        Create physical storage for a relation.
  *
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 void
 RelationTruncate(Relation rel, BlockNumber nblocks)
 {
+    PendingRelSync *pending = NULL;
+    bool        found;
     bool        fsm;
     bool        vm;
 
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
-
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        /* no_pending_sync is ignored since new entry is created here */
+        if (!rel->pending_sync)
+        {
+            if (!pendingSyncs)
+                createPendingSyncsHash();
+            elog(DEBUG2, "RelationTruncate: accessing hash");
+            pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                 (void *) &rel->rd_node,
+                                                 HASH_ENTER, &found);
+            if (!found)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            rel->no_pending_sync= false;
+            rel->pending_sync = pending;
+        }
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+            rel->pending_sync->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+    /* First time through: initialize the hash table */
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hash = tag_hash;
+    pendingSyncs = hash_create("pending relation sync table", 5,
+                               &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -369,6 +459,24 @@ smgrDoPendingDeletes(bool isCommit)
 }
 
 /*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
+/*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
  * The return value is the number of relations scheduled for termination.
@@ -419,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    bool found = true;
+    BlockNumber nblocks;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* ignore no_pending_sync since new entry is created here */
+    if (!rel->pending_sync)
+    {
+        if (!pendingSyncs)
+            createPendingSyncsHash();
+
+        /* Look up or create an entry */
+        rel->no_pending_sync = false;
+        elog(DEBUG2, "RecordPendingSync: accessing hash");
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_ENTER, &found);
+    }
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+    if (!found)
+    {
+        rel->pending_sync->truncated_to = InvalidBlockNumber;
+        rel->pending_sync->sync_above = nblocks;
+
+        elog(DEBUG2,
+             "registering new pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+
+    }
+    else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+        rel->pending_sync->sync_above = nblocks;
+    }
+    else
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pendingSyncs || rel->no_pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf),
pendingSyncs,rel->pending_sync, rel->no_pending_sync); 
+
+    /* do the real work */
+    if (!rel->pending_sync)
+    {
+        bool found = false;
+
+        /*
+         * Hold the entry in rel. This relies on the fact that hash entry
+         * never moves.
+         */
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_FIND, &found);
+        elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+        if (!found)
+        {
+            /* we don't have no one. don't access the hash no longer */
+            rel->no_pending_sync = true;
+            return true;
+        }
+    }
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+        rel->pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+        rel->pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6bfca2a..a7f0e5f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2354,8 +2354,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2387,7 +2386,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -2841,11 +2840,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 3d82edb..a3c3518 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index ab6a889..33a2167 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -477,7 +477,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -520,9 +520,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f2a928b..81e5ccf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4411,8 +4411,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4685,8 +4686,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         heap_close(newrel, NoLock);
     }
@@ -10668,11 +10667,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index cf7f5e1..bbb0215 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -904,7 +904,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1140,7 +1140,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1498,7 +1498,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4e44336..f0f3ac2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 28a4483..ce9f361 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -73,6 +73,7 @@
 #include "optimizer/var.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -413,6 +414,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -1998,6 +2003,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3322,6 +3331,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b..fff3fd4 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
@@ -180,6 +179,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85..49d93cd 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce390..9fae7c6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index aa8add5..9fa06a5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -216,6 +216,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.9.2

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

30 марта 2018 г., 07:06:46

Hello.  I found that c203d6cf81 hit this and this is the rebased
version on the current master.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 3dac5baf787dc949cfb22a698a0d72b6eb48e75e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH] Fix WAL logging problem

---
 src/backend/access/heap/heapam.c        | 113 ++++++++---
 src/backend/access/heap/pruneheap.c     |   3 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/access/transam/xact.c       |   7 +
 src/backend/catalog/storage.c           | 320 +++++++++++++++++++++++++++++---
 src/backend/commands/copy.c             |  13 +-
 src/backend/commands/createas.c         |   9 +-
 src/backend/commands/matview.c          |   6 +-
 src/backend/commands/tablecmds.c        |   8 +-
 src/backend/commands/vacuumlazy.c       |   6 +-
 src/backend/storage/buffer/bufmgr.c     |  40 +++-
 src/backend/utils/cache/relcache.c      |  13 ++
 src/include/access/heapam.h             |   8 +-
 src/include/catalog/storage.h           |   5 +-
 src/include/storage/bufmgr.h            |   2 +
 src/include/utils/rel.h                 |   8 +
 17 files changed, 480 insertions(+), 88 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d7279248e7..8fd2c2948e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -57,6 +79,7 @@
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/index.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -2400,12 +2423,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2436,6 +2453,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+extern HTAB *pendingSyncs;
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
             int options, BulkInsertState bistate)
@@ -2509,7 +2527,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2710,12 +2728,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     char       *scratch = NULL;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2730,7 +2746,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
      * palloc() within a critical section is not safe, so we allocate this
      * beforehand.
      */
-    if (needwal)
+    if (RelationNeedsWAL(relation))
         scratch = palloc(BLCKSZ);
 
     /*
@@ -2765,6 +2781,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2776,6 +2793,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3332,7 +3350,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -4307,7 +4325,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -5276,7 +5295,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -6020,7 +6039,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -6174,7 +6193,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -6307,7 +6326,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6416,7 +6435,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7602,7 +7621,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7650,7 +7669,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7735,7 +7754,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8826,8 +8845,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
      */
 
     /* Deal with old tuple version */
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+
     if (oldaction == BLK_NEEDS_REDO)
     {
         page = BufferGetPage(obuffer);
@@ -8881,6 +8905,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
         PageInit(page, BufferGetPageSize(nbuffer), 0);
         newaction = BLK_NEEDS_REDO;
     }
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;
     else
         newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
 
@@ -9317,9 +9343,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
@@ -9429,3 +9462,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index f67d7d15df..6dd2ae5254 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7d466c2588..7471d7461b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -652,9 +652,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     }
     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);
     else
         heaptup = tup;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index b251e69703..4a46444f33 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b88d4ccf74..976fbeb02f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2001,6 +2001,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2229,6 +2232,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2542,6 +2548,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cff49bae9e..e9abd49070 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,6 +29,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -63,6 +64,49 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -226,6 +270,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 void
 RelationTruncate(Relation rel, BlockNumber nblocks)
 {
+    PendingRelSync *pending = NULL;
+    bool        found;
     bool        fsm;
     bool        vm;
 
@@ -260,37 +306,81 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        /* no_pending_sync is ignored since new entry is created here */
+        if (!rel->pending_sync)
+        {
+            if (!pendingSyncs)
+                createPendingSyncsHash();
+            elog(DEBUG2, "RelationTruncate: accessing hash");
+            pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                 (void *) &rel->rd_node,
+                                                 HASH_ENTER, &found);
+            if (!found)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+            rel->no_pending_sync= false;
+            rel->pending_sync = pending;
+        }
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+            rel->pending_sync->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+    /* First time through: initialize the hash table */
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hash = tag_hash;
+    pendingSyncs = hash_create("pending relation sync table", 5,
+                               &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -368,6 +458,24 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -419,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    bool found = true;
+    BlockNumber nblocks;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* ignore no_pending_sync since new entry is created here */
+    if (!rel->pending_sync)
+    {
+        if (!pendingSyncs)
+            createPendingSyncsHash();
+
+        /* Look up or create an entry */
+        rel->no_pending_sync = false;
+        elog(DEBUG2, "RecordPendingSync: accessing hash");
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_ENTER, &found);
+    }
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+    if (!found)
+    {
+        rel->pending_sync->truncated_to = InvalidBlockNumber;
+        rel->pending_sync->sync_above = nblocks;
+
+        elog(DEBUG2,
+             "registering new pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+
+    }
+    else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+        rel->pending_sync->sync_above = nblocks;
+    }
+    else
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pendingSyncs || rel->no_pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf),
pendingSyncs,rel->pending_sync, rel->no_pending_sync);
 
+
+    /* do the real work */
+    if (!rel->pending_sync)
+    {
+        bool found = false;
+
+        /*
+         * Hold the entry in rel. This relies on the fact that hash entry
+         * never moves.
+         */
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_FIND, &found);
+        elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+        if (!found)
+        {
+            /* we don't have no one. don't access the hash no longer */
+            rel->no_pending_sync = true;
+            return true;
+        }
+    }
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+        rel->pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+        rel->pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index a42861da0d..de9fc12615 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2352,8 +2352,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2385,7 +2384,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -2821,11 +2820,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 3d82edbf58..a3c3518c69 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 23892b1b81..f1b48583ba 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 83a881eff3..ee8c80f34f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4481,8 +4481,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4755,8 +4756,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         heap_close(newrel, NoLock);
     }
@@ -10811,11 +10810,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index f9da24c491..78909bc519 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -900,7 +900,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1170,7 +1170,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1531,7 +1531,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 48f92dc430..399390e6c1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -76,6 +76,7 @@
 #include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -417,6 +418,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -2072,6 +2077,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3402,6 +3411,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c0256b18a..fff3fd42aa 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
@@ -180,6 +179,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c26c395b0b..040ae3a07a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

04 июля 2018 г., 10:59:12

On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
> Hello. I found that c203d6cf81 hit this and this is the rebased
> version on the current master.

Okay, as this is visibly the oldest item in this commit fest, Andrew has
asked me to look at a solution which would allow us to definitely close
the loop for all maintained branches. In consequence, I have been
looking at this problem. Here are my thoughts:
- The set of errors reported on this thread are alarming, depending on
the scenarios used, we could have "could not read file" stuff, or even
data loss after WAL replay comes and wipes out everything.
- Disabling completely the TRUNCATE optimization is definitely not cool,
as there could be an impact for users.
- Removing wal_level = minimal is not acceptable as well, as some people
rely on this feature.
- Rewriting the sync handling of heap relation files in an invasive way
may be something to investigate and improve on HEAD (I am not really
convinced about that actually for the optimizations discussed on this
thread as this may result in more bugs than actual fixes), but that
would do nothing for back-branches.

Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.

The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.

Another thing that this patch adds is a set of regression tests to
stress all the various scenarios presented on this thread with table
creation, INSERT, COPY and TRUNCATE running in the same transactions for
both wal_level = minimal and replica, which make sure that there are no
failures and no actual data loss. The test is useful anyway, as any
patch presented did not present a way to test easily all the scenarios,
except for a bash script present upthread, but this discarded some of
the cases.

I would propose that for a back-patch, except for the test which can go
down easily to 9.6 but I have not tested that yet.

Thoughts?
--
Michael

Вложения

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Andrew Dunstan

Дата:

04 июля 2018 г., 17:55:53

On Wed, Jul 4, 2018 at 12:59 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
>> Hello.  I found that c203d6cf81 hit this and this is the rebased
>> version on the current master.
>
> Okay, as this is visibly the oldest item in this commit fest, Andrew has
> asked me to look at a solution which would allow us to definitely close
> the loop for all maintained branches.  In consequence, I have been
> looking at this problem.  Here are my thoughts:
> - The set of errors reported on this thread are alarming, depending on
> the scenarios used, we could have "could not read file" stuff, or even
> data loss after WAL replay comes and wipes out everything.
> - Disabling completely the TRUNCATE optimization is definitely not cool,
> as there could be an impact for users.
> - Removing wal_level = minimal is not acceptable as well, as some people
> rely on this feature.
> - Rewriting the sync handling of heap relation files in an invasive way
> may be something to investigate and improve on HEAD (I am not really
> convinced about that actually for the optimizations discussed on this
> thread as this may result in more bugs than actual fixes), but that
> would do nothing for back-branches.
>
> Hence I propose the patch attached which disables the TRUNCATE and COPY
> optimizations for two cases, which are the ones actually causing
> problems.  One solution has been presented by Simon here for COPY, which
> is to disable the optimization when there are no blocks on a relation
> with wal_level = minimal:
> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
> For back-patching, I find that really appealing.
>
> The second thing that the patch attached does is to tweak
> ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
> wal_level = minimal.
>
> Another thing that this patch adds is a set of regression tests to
> stress all the various scenarios presented on this thread with table
> creation, INSERT, COPY and TRUNCATE running in the same transactions for
> both wal_level = minimal and replica, which make sure that there are no
> failures and no actual data loss.  The test is useful anyway, as any
> patch presented did not present a way to test easily all the scenarios,
> except for a bash script present upthread, but this discarded some of
> the cases.
>
> I would propose that for a back-patch, except for the test which can go
> down easily to 9.6 but I have not tested that yet.
>


Many thanks for working on this.

+1 for these changes, even though the TRUNCATE fix looks perverse. If
anyone wants to propose further optimizations in this area this would
at least give us a startpoint which is correct.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

05 июля 2018 г., 13:11:53

On Wed, Jul 04, 2018 at 07:55:53AM -0400, Andrew Dunstan wrote:
> Many thanks for working on this.

No problem.  Thanks for the lookup.

> +1 for these changes, even though the TRUNCATE fix looks perverse. If
> anyone wants to propose further optimizations in this area this would
> at least give us a startpoint which is correct.

Yes, that's exactly what I am coming at.  The optimizations which are
currently broken just cannot and should not be used.  If anybody wishes
to improve the current set of optimizations in place for wal_level =
minimal, let's also consider the other patch.  Based on the tests I sent
in the previous patch, I have compiled five scenarios by the way:
1) BEGIN -> CREATE TABLE -> TRUNCATE -> COMMIT.
With wal_level = minimal, this fails hard with "could not read block 0
blah" when trying to read the data after commit..
2) BEGIN -> CREATE -> INSERT -> TRUNCATE -> INSERT -> COMMIT, and this
one reports an empty table, without failing, but there should be tuples
from the INSERT.
3) BEGIN -> CREATE -> INSERT -> TRUNCATE -> COPY -> COMMIT, which also
reports an empty table while there should be tuples from the COPY.
4) BEGIN -> CREATE -> INSERT -> TRUNCATE -> INSERT -> COPY -> INSERT ->
COMMIT, which fails at WAL replay with a PANIC: invalid max offset
number.
5) BEGIN -> CREATE -> INSERT -> COPY -> COMMIT, which sees only the
tuple inserted, causing an incorrect number of tuples.  If you reverse
the COPY and INSERT, then this is able to pass.

This stuff really generates a good number of different failures.  There
have been so many people participating on this thread that discussing
more this approach would be surely a good step forward, and this
summarizes quite nicely the set of failures discussed until now here.  I
would be happy to push forward with this patch to close all the holes
mentioned.
--
Michael

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Heikki Linnakangas

Дата:

10 июля 2018 г., 20:35:58

Thanks for picking this up!

(I hope this gets through the email filters this time, sending a shell 
script seems to be difficult. I also trimmed the CC list, if that helps.)

On 04/07/18 07:59, Michael Paquier wrote:
> Hence I propose the patch attached which disables the TRUNCATE and COPY
> optimizations for two cases, which are the ones actually causing
> problems.  One solution has been presented by Simon here for COPY, which
> is to disable the optimization when there are no blocks on a relation
> with wal_level = minimal:
> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
> For back-patching, I find that really appealing.

This fails in the case that there are any WAL-logged changes to the 
table while the COPY is running. That can happen at least if the table 
has an INSERT trigger, that performs operations on the same table, and 
the COPY fires the trigger. That scenario is covered by the little bash 
script I posted earlier in this thread 
(https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). 
Attached is a new version of that script, updated to make it work with v11.

> The second thing that the patch attached does is to tweak
> ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
> wal_level = minimal.

If we go down that route, let's at least keep the TRUNCATE optimization 
for temporary and unlogged tables.

- Heikki

Вложения

test-wal-minimal-2-bash-script

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

11 июля 2018 г., 09:32:41

On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:
> Thanks for picking this up!
>
> (I hope this gets through the email filters this time, sending a shell
> script seems to be difficult. I also trimmed the CC list, if that helps.)
>
> On 04/07/18 07:59, Michael Paquier wrote:
>> Hence I propose the patch attached which disables the TRUNCATE and COPY
>> optimizations for two cases, which are the ones actually causing
>> problems.  One solution has been presented by Simon here for COPY, which
>> is to disable the optimization when there are no blocks on a relation
>> with wal_level = minimal:
>> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
>> For back-patching, I find that really appealing.
>
> This fails in the case that there are any WAL-logged changes to the table
> while the COPY is running. That can happen at least if the table has an
> INSERT trigger, that performs operations on the same table, and the COPY
> fires the trigger. That scenario is covered by the little bash script I
> posted earlier in this thread
> (https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached
> is a new version of that script, updated to make it work with v11.

Thanks for the pointer.  My tap test has been covering two out of the
three scenarios you have in your script.  I have been able to convert
the extra as the attached, and I have added as well an extra test with
TRUNCATE triggers.  So it seems to me that we want to disable the
optimization if any type of trigger are defined on the relation copied
to as it could be possible that these triggers work on the blocks copied
as well, for any BEFORE/AFTER and STATEMENT/ROW triggers.  What do you
think?

>> The second thing that the patch attached does is to tweak
>> ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
>> wal_level = minimal.
>
> If we go down that route, let's at least keep the TRUNCATE optimization for
> temporary and unlogged tables.

Yes, that sounds right.  Fixed as well.  I have additionally done more
work on the comments.

Thoughts?
--
Michael

On Thu, Jul 12, 2018 at 05:12:21PM +0300, Heikki Linnakangas wrote:
> Doesn't have to be a trigger, could be a CHECK constraint, datatype input
> function, etc. Admittedly, having a datatype input function that inserts to
> the table is worth a "huh?", but I'm feeling very confident that we can
> catch all such cases, and some of them might even be sensible.

Sure, but do we want to be that invasive?  Triggers are easy enough to
block because those are available directly within cstate so you would
know if those are triggered.  CHECK constraint can be also easily looked
after by looking at the Relation information, and actually as DEFAULT
values could have an expression we'd want to block them, no?  The input
datatype is well, more tricky to deal with as there is no actual way to
know if the INSERT is happening within the context of a COPY and this
could be just C code.  One way to tackle that would be to enforce the
optimization to not be used if a non-system data type is used when doing
COPY...

Disabling entirely the optimization for any relation which has a CHECK
constraint or DEFAULT expression basically applies to a hell lot of
them, which makes the optimization, at least it seems to me, useless
because it is never going to apply to most real-world cases.
--
Michael

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

17 июля 2018 г., 00:38:39

On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Doesn't have to be a trigger, could be a CHECK constraint, datatype input
> function, etc. Admittedly, having a datatype input function that inserts to
> the table is worth a "huh?", but I'm feeling very confident that we can
> catch all such cases, and some of them might even be sensible.

Is this sentence missing a "not"?  i.e. "I'm not feeling very confident"?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Heikki Linnakangas

Дата:

17 июля 2018 г., 00:41:51

On 16 July 2018 21:38:39 EEST, Robert Haas <robertmhaas@gmail.com> wrote:
>On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka@iki.fi>
>wrote:
>> Doesn't have to be a trigger, could be a CHECK constraint, datatype
>input
>> function, etc. Admittedly, having a datatype input function that
>inserts to
>> the table is worth a "huh?", but I'm feeling very confident that we
>can
>> catch all such cases, and some of them might even be sensible.
>
>Is this sentence missing a "not"?  i.e. "I'm not feeling very
>confident"?

Yes, sorry.

- Heikki

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Alvaro Herrera

Дата:

17 июля 2018 г., 02:14:09

On 2018-Jul-12, Heikki Linnakangas wrote:

> > > Thanks for the pointer.  My tap test has been covering two out of
> > > the three scenarios you have in your script.  I have been able to
> > > convert the extra as the attached, and I have added as well an
> > > extra test with TRUNCATE triggers.  So it seems to me that we want
> > > to disable the optimization if any type of trigger are defined on
> > > the relation copied to as it could be possible that these triggers
> > > work on the blocks copied as well, for any BEFORE/AFTER and
> > > STATEMENT/ROW triggers.  What do you think?
> > 
> > Yeah, this seems like the only sane approach.
> 
> Doesn't have to be a trigger, could be a CHECK constraint, datatype
> input function, etc. Admittedly, having a datatype input function that
> inserts to the table is worth a "huh?", but I'm feeling very confident
> that we can catch all such cases, and some of them might even be
> sensible.

A counterexample could be a a JSON compresion scheme that uses a catalog
for a dictionary of keys.  Hasn't this been described already?  Also not
completely out of the question for GIS data, I think (Not sure if
PostGIS does this already.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

17 июля 2018 г., 06:01:29

On Mon, Jul 16, 2018 at 09:41:51PM +0300, Heikki Linnakangas wrote:
> On 16 July 2018 21:38:39 EEST, Robert Haas <robertmhaas@gmail.com> wrote:
>>On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka@iki.fi>
>>wrote:
>>> Doesn't have to be a trigger, could be a CHECK constraint, datatype
>>input
>>> function, etc. Admittedly, having a datatype input function that
>>inserts to
>>> the table is worth a "huh?", but I'm feeling very confident that we
>>can
>>> catch all such cases, and some of them might even be sensible.
>>
>>Is this sentence missing a "not"?  i.e. "I'm not feeling very
>>confident"?
>
> Yes, sorry.

This explains a lot :p

I doubt as well that we'd be able to catch all the holes as well as the
conditions where the optimization could be run safely are rather
basically impossible to catch beforehand.  I'd like to vote for getting
rid of this optimization for COPY, this can hurt more than it is
helpful.  Per the lack of complaints, this could happen only in HEAD?
--
Michael

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

17 июля 2018 г., 15:24:22

Hello.

At Mon, 16 Jul 2018 16:14:09 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20180716201409.2qfcneo4qkdwjvpv@alvherre.pgsql>
> On 2018-Jul-12, Heikki Linnakangas wrote:
> 
> > > > Thanks for the pointer.  My tap test has been covering two out of
> > > > the three scenarios you have in your script.  I have been able to
> > > > convert the extra as the attached, and I have added as well an
> > > > extra test with TRUNCATE triggers.  So it seems to me that we want
> > > > to disable the optimization if any type of trigger are defined on
> > > > the relation copied to as it could be possible that these triggers
> > > > work on the blocks copied as well, for any BEFORE/AFTER and
> > > > STATEMENT/ROW triggers.  What do you think?
> > > 
> > > Yeah, this seems like the only sane approach.
> > 
> > Doesn't have to be a trigger, could be a CHECK constraint, datatype
> > input function, etc. Admittedly, having a datatype input function that
> > inserts to the table is worth a "huh?", but I'm feeling very confident
> > that we can catch all such cases, and some of them might even be
> > sensible.
> 
> A counterexample could be a a JSON compresion scheme that uses a catalog
> for a dictionary of keys.  Hasn't this been described already?  Also not
> completely out of the question for GIS data, I think (Not sure if
> PostGIS does this already.)

In the third case, IIUC, disabling bulk-insertion after any
WAL-logging insertion happend seems to work. The attached diff to
v2 patch makes the three TAP tests pass. It uses relcache to
store the last insertion XID but it will not be invalidated
during a COPY operation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 72395a50b8..e5c651b498 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2509,6 +2509,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 
     MarkBufferDirty(buffer);
 
+    /*
+     * Bulk insertion is not safe after a WAL-logging insertion in the same
+     * transaction. We don't start bulk insertion under inhibitin conditions
+     * but we also need to cancel WAL-skipping in the case where WAL-logging
+     * insertion happens during a bulk insertion. This happens by anything
+     * that can insert a tuple during bulk insertion such like triggers,
+     * constraints or type conversions. We need not worry about relcache flush
+     * happening while a bulk insertion is running.
+     */
+    if (relation->last_logged_insert_xid == xid)
+        options &= ~HEAP_INSERT_SKIP_WAL;
+
     /* XLOG stuff */
     if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
     {
@@ -2582,6 +2594,12 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
         recptr = XLogInsert(RM_HEAP_ID, info);
 
         PageSetLSN(page, recptr);
+
+        /*
+         * If this happens during a bulk insertion, stop WAL skipping for the
+         * rest of the current command.
+         */
+        relation->last_logged_insert_xid = xid;
     }
 
     END_CRIT_SECTION();
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 7674369613..7b9a7af2d2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2416,10 +2416,8 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
 
-        if (!XLogIsNeeded() &&
-            cstate->rel->trigdesc == NULL &&
-            RelationGetNumberOfBlocks(cstate->rel) == 0)
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+        if (!XLogIsNeeded() && RelationGetNumberOfBlocks(cstate->rel) == 0)
+             hi_options |= HEAP_INSERT_SKIP_WAL;
     }
 
     /*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 30a956822f..34a692a497 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1575,6 +1575,9 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
         {
             /* Immediate, non-rollbackable truncation is OK */
             heap_truncate_one_rel(rel);
+
+            /* Allow bulk-insert */
+            rel->last_logged_insert_xid = InvalidTransactionId;
         }
         else
         {
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6125421d39..99fb7e1dd8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1243,6 +1243,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     /* It's fully valid */
     relation->rd_isvalid = true;
 
+    relation->last_logged_insert_xid = InvalidTransactionId;
+
     return relation;
 }
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c97f9d1b43..6ee575ad14 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,9 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /* XID of the last transaction on which WAL-logged insertion happened */
+    TransactionId        last_logged_insert_xid;
 } RelationData;

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Andrew Dunstan

Дата:

17 июля 2018 г., 18:28:47


On 07/16/2018 08:01 PM, Michael Paquier wrote:
>
>
> I doubt as well that we'd be able to catch all the holes as well as the
> conditions where the optimization could be run safely are rather
> basically impossible to catch beforehand.  I'd like to vote for getting
> rid of this optimization for COPY, this can hurt more than it is
> helpful.  Per the lack of complaints, this could happen only in HEAD?


Well, we'd be getting rid of it because of a danger of data loss which 
we can't otherwise mitigate. Maybe it does need to be backpatched, even 
if we haven't had complaints.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

18 июля 2018 г., 16:42:10

On Tue, Jul 17, 2018 at 8:28 AM, Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> Well, we'd be getting rid of it because of a danger of data loss which we
> can't otherwise mitigate. Maybe it does need to be backpatched, even if we
> haven't had complaints.

What's wrong with the approach proposed in
http://postgr.es/m/55AFC302.1060805@iki.fi ?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

18 июля 2018 г., 19:06:22

On Wed, Jul 18, 2018 at 06:42:10AM -0400, Robert Haas wrote:
> On Tue, Jul 17, 2018 at 8:28 AM, Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>> Well, we'd be getting rid of it because of a danger of data loss which we
>> can't otherwise mitigate. Maybe it does need to be backpatched, even if we
>> haven't had complaints.
>
> What's wrong with the approach proposed in
> http://postgr.es/m/55AFC302.1060805@iki.fi ?

For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.
--
Michael

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

18 июля 2018 г., 19:29:01

On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz> wrote:
>> What's wrong with the approach proposed in
>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
>
> For back-branches that's very invasive so that seems risky to me
> particularly seeing the low number of complaints on the matter.

Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release.  If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo.  Personally, I'd rather take the latter bet.  Maybe the patch
isn't all there yet, but that seems like something we can work
towards.  If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Heikki Linnakangas

Дата:

18 июля 2018 г., 20:58:16

On 18/07/18 16:29, Robert Haas wrote:
> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz> wrote:
>>> What's wrong with the approach proposed in
>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
>>
>> For back-branches that's very invasive so that seems risky to me
>> particularly seeing the low number of complaints on the matter.
> 
> Hmm. I think that if you disable the optimization, you're betting that
> people won't mind losing performance in this case in a maintenance
> release.  If you back-patch Heikki's approach, you're betting that the
> committed version doesn't have any bugs that are worse than the status
> quo.  Personally, I'd rather take the latter bet.  Maybe the patch
> isn't all there yet, but that seems like something we can work
> towards.  If we just give up and disable the optimization, we won't
> know how many people we ticked off or how badly until after we've done
> it.

Yeah. I'm not happy about backpatching a big patch like what I proposed, 
and Kyotaro developed further. But I think it's the least bad option we 
have, the other options discussed seem even worse.

One way to review the patch is to look at what it changes, when 
wal_level is *not* set to minimal, i.e. what risk or overhead does it 
pose to users who are not affected by this bug? It seems pretty safe to me.

The other aspect is, how confident are we that this actually fixes the 
bug, with least impact to users using wal_level='minimal'? I think it's 
the best shot we have so far. All the other proposals either don't fully 
fix the bug, or hurt performance in some legit cases.

I'd suggest that we continue based on the patch that Kyotaro posted at 
https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.

- Heikki

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

25 июля 2018 г., 20:08:33

On Wed, Jul 18, 2018 at 05:58:16PM +0300, Heikki Linnakangas wrote:
> I'd suggest that we continue based on the patch that Kyotaro posted at
> https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.

Whatever happens here, perhaps one way to move on would be to commit
first the TAP test that I proposed upthread.  That would not work for
wal_level=minimal so this part should be commented out, but that's
easier this way to test basically all the cases we talked about with any
approach taken.
--
Michael

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

26 июля 2018 г., 11:50:11

Hello.

At Wed, 25 Jul 2018 23:08:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180725140833.GC6660@paquier.xyz>
> On Wed, Jul 18, 2018 at 05:58:16PM +0300, Heikki Linnakangas wrote:
> > I'd suggest that we continue based on the patch that Kyotaro posted at
> > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> 
> Whatever happens here, perhaps one way to move on would be to commit
> first the TAP test that I proposed upthread.  That would not work for
> wal_level=minimal so this part should be commented out, but that's
> easier this way to test basically all the cases we talked about with any
> approach taken.

https://www.postgresql.org/message-id/20180704045912.GG1672@paquier.xyz

However I'm not sure the policy (if any) allows us to add a test
that should success, I'm not opposed to do that. But even if we
did that, it won't be visible to other than us in this thread. It
seems to me more or less similar to pasting a boilerplate that
points to the above message in this thread, or just writing "this
patch passes "the" test.".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Andrew Dunstan

Дата:

28 июля 2018 г., 01:26:24


On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> On 18/07/18 16:29, Robert Haas wrote:
>> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier 
>> <michael@paquier.xyz> wrote:
>>>> What's wrong with the approach proposed in
>>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
>>>
>>> For back-branches that's very invasive so that seems risky to me
>>> particularly seeing the low number of complaints on the matter.
>>
>> Hmm. I think that if you disable the optimization, you're betting that
>> people won't mind losing performance in this case in a maintenance
>> release.  If you back-patch Heikki's approach, you're betting that the
>> committed version doesn't have any bugs that are worse than the status
>> quo.  Personally, I'd rather take the latter bet.  Maybe the patch
>> isn't all there yet, but that seems like something we can work
>> towards.  If we just give up and disable the optimization, we won't
>> know how many people we ticked off or how badly until after we've done
>> it.
>
> Yeah. I'm not happy about backpatching a big patch like what I 
> proposed, and Kyotaro developed further. But I think it's the least 
> bad option we have, the other options discussed seem even worse.
>
> One way to review the patch is to look at what it changes, when 
> wal_level is *not* set to minimal, i.e. what risk or overhead does it 
> pose to users who are not affected by this bug? It seems pretty safe 
> to me.
>
> The other aspect is, how confident are we that this actually fixes the 
> bug, with least impact to users using wal_level='minimal'? I think 
> it's the best shot we have so far. All the other proposals either 
> don't fully fix the bug, or hurt performance in some legit cases.
>
> I'd suggest that we continue based on the patch that Kyotaro posted at 
> https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
>



I have just spent some time reviewing Kyatoro's patch. I'm a bit 
nervous, too, given the size. But I'm also nervous about leaving things 
as they are. I suspect the reason we haven't heard more about this is 
that these days use of "wal_level = minimal" is relatively rare.

I like the fact that this is closer to being a real fix rather than just 
throwing out the optimization. Like Heikki I've come round to the view 
that something like this is the least bad option.

The code looks good to me - some comments might be helpful in 
heap_xlog_update()

Do we want to try this on HEAD and then backpatch it? Do we want to add 
some testing along the lines Michael suggested?

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 октября 2018 г., 07:42:35

Hello.

At Fri, 27 Jul 2018 15:26:24 -0400, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote in
<d0c9e197-5219-c094-418a-e5a6fbd8cdda@2ndQuadrant.com>
>
>
> On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > On 18/07/18 16:29, Robert Haas wrote:
> >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz>
> >> wrote:
> >>>> What's wrong with the approach proposed in
> >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> >>>
> >>> For back-branches that's very invasive so that seems risky to me
> >>> particularly seeing the low number of complaints on the matter.
> >>
> >> Hmm. I think that if you disable the optimization, you're betting that
> >> people won't mind losing performance in this case in a maintenance
> >> release.  If you back-patch Heikki's approach, you're betting that the
> >> committed version doesn't have any bugs that are worse than the status
> >> quo.  Personally, I'd rather take the latter bet.  Maybe the patch
> >> isn't all there yet, but that seems like something we can work
> >> towards.  If we just give up and disable the optimization, we won't
> >> know how many people we ticked off or how badly until after we've done
> >> it.
> >
> > Yeah. I'm not happy about backpatching a big patch like what I
> > proposed, and Kyotaro developed further. But I think it's the least
> > bad option we have, the other options discussed seem even worse.
> >
> > One way to review the patch is to look at what it changes, when
> > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > pose to users who are not affected by this bug? It seems pretty safe
> > to me.
> >
> > The other aspect is, how confident are we that this actually fixes the
> > bug, with least impact to users using wal_level='minimal'? I think
> > it's the best shot we have so far. All the other proposals either
> > don't fully fix the bug, or hurt performance in some legit cases.
> >
> > I'd suggest that we continue based on the patch that Kyotaro posted at
> > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> >
>
>
>
> I have just spent some time reviewing Kyatoro's patch. I'm a bit
> nervous, too, given the size. But I'm also nervous about leaving
> things as they are. I suspect the reason we haven't heard more about
> this is that these days use of "wal_level = minimal" is relatively
> rare.

Thank you for lokking this (and sorry for the late response).

> I like the fact that this is closer to being a real fix rather than
> just throwing out the optimization. Like Heikki I've come round to the
> view that something like this is the least bad option.
>
> The code looks good to me - some comments might be helpful in
> heap_xlog_update()

Thanks. It is intending to avoid PANIC for a broken record. I
reverted the part since PANIC would be preferable in the case.

> Do we want to try this on HEAD and then backpatch it? Do we want to
> add some testing along the lines Michael suggested?

44cac93464 hit this, rebased. And added Michael's TAP test
contained in [1] as patch 0001.

I regard [2] as an orthogonal issue.

The previous patch didn't care of the case of
BEGIN;CREATE;TRUNCATE;COMMIT case. This version contains a "fix"
of nbtree (patch 0003) so that FPI of the metapage is always
emitted when building an empty index. On the other hand this
emits useless one or two FPIs (136 bytes each) on TRUNCATE in a
separate transaction, but it won't matter so much.. Other index
methods don't have this problem. Some other AMs emits initialize
WALs even in minimal mode.

This still has a bit too many elog(DEBUG2)s to see how it is
working. I'm going to remove most of them in the final version.

I started to prefix the file names with version 2.

regards.

[1] https://www.postgresql.org/message-id/20180711033241.GQ1661@paquier.xyz

[2] https://www.postgresql.org/message-id/CAKJS1f9iF55cwx-LUOreRokyi9UZESXOLHuFDkt0wksZN+KqWw@mail.gmail.com

    or

    https://commitfest.postgresql.org/20/1811/

--
Kyotaro Horiguchi
NTT Open Source Software Center
From 092e7412f361c39530911d4592fb46653ca027ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 src/test/recovery/t/016_wal_optimize.pl

diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    $node->teardown_node;
+    $node->clean_node;
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 76d5e5ed12ef510bf7ea43a948979b052bc26aee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Sep 2017 13:01:33 +0900
Subject: [PATCH 2/3] Fix WAL logging problem

We skip WAL logging for some bulk insertion operations, but this can
cause curruption when such operations are mixd with truncation.  This
patch fixes the issue by introducing buffer-resolution WAL emittion.
With this patch, in minimal mode we still skip WAL logging for
extended pages and fsync them at commit time but write for exiting
pages or for the pages re-extended after a WAL-logged trancation.
---
 src/backend/access/heap/heapam.c        | 100 +++++++---
 src/backend/access/heap/pruneheap.c     |   3 +-
 src/backend/access/heap/rewriteheap.c   |   4 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/access/transam/xact.c       |   7 +
 src/backend/catalog/storage.c           | 321 +++++++++++++++++++++++++++++---
 src/backend/commands/copy.c             |  13 +-
 src/backend/commands/createas.c         |   9 +-
 src/backend/commands/matview.c          |   6 +-
 src/backend/commands/tablecmds.c        |   8 +-
 src/backend/commands/vacuumlazy.c       |   6 +-
 src/backend/storage/buffer/bufmgr.c     |  40 +++-
 src/backend/utils/cache/relcache.c      |  13 ++
 src/include/access/heapam.h             |   8 +-
 src/include/catalog/storage.h           |   5 +-
 src/include/storage/bufmgr.h            |   2 +
 src/include/utils/rel.h                 |   8 +
 17 files changed, 471 insertions(+), 85 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5f1a69ca53..97b4159362 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -57,6 +79,7 @@
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/index.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -2413,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2449,6 +2466,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+extern HTAB *pendingSyncs;
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
             int options, BulkInsertState bistate)
@@ -2522,7 +2540,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2723,12 +2741,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2770,6 +2786,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2781,6 +2798,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3343,7 +3361,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -4322,7 +4340,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -5294,7 +5313,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -6038,7 +6057,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -6198,7 +6217,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -6331,7 +6350,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6440,7 +6459,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7636,7 +7655,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7684,7 +7703,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7769,7 +7788,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -9390,9 +9409,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
@@ -9509,3 +9535,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c9..ec9d1b3113 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -653,9 +653,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     }
     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);
     else
         heaptup = tup;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6cd00d9aaa..e0ba2aff29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2016,6 +2016,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2245,6 +2248,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2559,6 +2565,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..ef0b75d288 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -225,6 +269,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 void
 RelationTruncate(Relation rel, BlockNumber nblocks)
 {
+    PendingRelSync *pending = NULL;
+    bool        found;
     bool        fsm;
     bool        vm;
 
@@ -259,37 +305,82 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        /* no_pending_sync is ignored since new entry is created here */
+        if (!rel->pending_sync)
+        {
+            if (!pendingSyncs)
+                createPendingSyncsHash();
+            elog(DEBUG2, "RelationTruncate: accessing hash");
+            pending = (PendingRelSync *) hash_search(pendingSyncs,
+                                                 (void *) &rel->rd_node,
+                                                 HASH_ENTER, &found);
+            if (!found)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+            rel->no_pending_sync= false;
+            rel->pending_sync = pending;
+        }
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+            rel->pending_sync->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+    /* First time through: initialize the hash table */
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hash = tag_hash;
+    pendingSyncs = hash_create("pending relation sync table", 5,
+                               &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -367,6 +458,24 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -418,6 +527,176 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    bool found = true;
+    BlockNumber nblocks;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* ignore no_pending_sync since new entry is created here */
+    if (!rel->pending_sync)
+    {
+        if (!pendingSyncs)
+            createPendingSyncsHash();
+
+        /* Look up or create an entry */
+        rel->no_pending_sync = false;
+        elog(DEBUG2, "RecordPendingSync: accessing hash");
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_ENTER, &found);
+    }
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+    if (!found)
+    {
+        rel->pending_sync->truncated_to = InvalidBlockNumber;
+        rel->pending_sync->sync_above = nblocks;
+
+        elog(DEBUG2,
+             "registering new pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+
+    }
+    else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             nblocks);
+        rel->pending_sync->sync_above = nblocks;
+    }
+    else
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pendingSyncs || rel->no_pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf),
pendingSyncs,rel->pending_sync, rel->no_pending_sync);
 
+
+    /* do the real work */
+    if (!rel->pending_sync)
+    {
+        bool found = false;
+
+        /*
+         * Hold the entry in rel. This relies on the fact that hash entry
+         * never moves.
+         */
+        rel->pending_sync =
+            (PendingRelSync *) hash_search(pendingSyncs,
+                                           (void *) &rel->rd_node,
+                                           HASH_FIND, &found);
+        elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+        if (!found)
+        {
+            /* we don't have no one. don't access the hash no longer */
+            rel->no_pending_sync = true;
+            return true;
+        }
+    }
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+        rel->pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+        rel->pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 86b0fb300f..07f96fde56 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2424,7 +2423,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3079,11 +3078,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d5cb62da15..0f58da40c6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -568,8 +568,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -618,9 +619,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e1eb7c374b..986f7baf39 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e10d3dbf3d..715718450d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4611,8 +4611,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4885,8 +4886,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         heap_close(newrel, NoLock);
     }
@@ -11019,11 +11018,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..72849a9a94 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1194,7 +1194,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a4fc001103..20ba6fc989 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
 #include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -419,6 +420,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -1872,6 +1877,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3271,6 +3280,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ca5cad7497..c5e5e9a8b2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
@@ -180,6 +179,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 6ecbdb6294..ea44e0e15f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.16.3

From c57c30c911031ac3257dd58935486fde7d4ddef0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 3/3] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
+    /*
+     * If no tuple was inserted, it's possible that we are truncating a
+     * relation. We need to emit WAL for the metapage in the case. However it
+     * is not required elsewise,
+     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

11 октября 2018 г., 11:04:53

At Thu, 11 Oct 2018 13:42:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20181011.134235.218062184.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello.
>
> At Fri, 27 Jul 2018 15:26:24 -0400, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote in
<d0c9e197-5219-c094-418a-e5a6fbd8cdda@2ndQuadrant.com>
> >
> >
> > On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > > On 18/07/18 16:29, Robert Haas wrote:
> > >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael@paquier.xyz>
> > >> wrote:
> > >>>> What's wrong with the approach proposed in
> > >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> > >>>
> > >>> For back-branches that's very invasive so that seems risky to me
> > >>> particularly seeing the low number of complaints on the matter.
> > >>
> > >> Hmm. I think that if you disable the optimization, you're betting that
> > >> people won't mind losing performance in this case in a maintenance
> > >> release.  If you back-patch Heikki's approach, you're betting that the
> > >> committed version doesn't have any bugs that are worse than the status
> > >> quo.  Personally, I'd rather take the latter bet.  Maybe the patch
> > >> isn't all there yet, but that seems like something we can work
> > >> towards.  If we just give up and disable the optimization, we won't
> > >> know how many people we ticked off or how badly until after we've done
> > >> it.
> > >
> > > Yeah. I'm not happy about backpatching a big patch like what I
> > > proposed, and Kyotaro developed further. But I think it's the least
> > > bad option we have, the other options discussed seem even worse.
> > >
> > > One way to review the patch is to look at what it changes, when
> > > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > > pose to users who are not affected by this bug? It seems pretty safe
> > > to me.
> > >
> > > The other aspect is, how confident are we that this actually fixes the
> > > bug, with least impact to users using wal_level='minimal'? I think
> > > it's the best shot we have so far. All the other proposals either
> > > don't fully fix the bug, or hurt performance in some legit cases.
> > >
> > > I'd suggest that we continue based on the patch that Kyotaro posted at
> > > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> > >
> >
> >
> >
> > I have just spent some time reviewing Kyatoro's patch. I'm a bit
> > nervous, too, given the size. But I'm also nervous about leaving
> > things as they are. I suspect the reason we haven't heard more about
> > this is that these days use of "wal_level = minimal" is relatively
> > rare.
>
> Thank you for lokking this (and sorry for the late response).
>
> > I like the fact that this is closer to being a real fix rather than
> > just throwing out the optimization. Like Heikki I've come round to the
> > view that something like this is the least bad option.
> >
> > The code looks good to me - some comments might be helpful in
> > heap_xlog_update()
>
> Thanks. It is intending to avoid PANIC for a broken record. I
> reverted the part since PANIC would be preferable in the case.
>
> > Do we want to try this on HEAD and then backpatch it? Do we want to
> > add some testing along the lines Michael suggested?
>
> 44cac93464 hit this, rebased. And added Michael's TAP test
> contained in [1] as patch 0001.
>
> I regard [2] as an orthogonal issue.
>
> The previous patch didn't care of the case of
> BEGIN;CREATE;TRUNCATE;COMMIT case. This version contains a "fix"
> of nbtree (patch 0003) so that FPI of the metapage is always
> emitted when building an empty index. On the other hand this
> emits useless one or two FPIs (136 bytes each) on TRUNCATE in a
> separate transaction, but it won't matter so much.. Other index
> methods don't have this problem. Some other AMs emits initialize
> WALs even in minimal mode.
>
> This still has a bit too many elog(DEBUG2)s to see how it is
> working. I'm going to remove most of them in the final version.
>
> I started to prefix the file names with version 2.
>
> regards.
>
> [1] https://www.postgresql.org/message-id/20180711033241.GQ1661@paquier.xyz
>
> [2] https://www.postgresql.org/message-id/CAKJS1f9iF55cwx-LUOreRokyi9UZESXOLHuFDkt0wksZN+KqWw@mail.gmail.com
>
>     or
>
>     https://commitfest.postgresql.org/20/1811/

I refactored getPendingSyncEntry out of RecordPendingSync,
BufferNeedsWAL and RelationTruncate. And split the second patch
into infrastracture-side and user-side ones. I expect it makes
reviewing far easier.

I reaplce RelationNeedsWAL in a part of code added in
heap_update() by bfa2ab56bb.

- v3-0001-TAP-test-for-copy-truncation-optimization.patch

 TAP test

-v3-0002-Write-WAL-for-empty-nbtree-index-build.patch

 nbtree "fix"

- v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch

 Pending-sync infrastructure.

- v3-0004-Fix-WAL-skipping-feature.patch

 Actual fix of WAL skipping feature.


regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
From cdbb6f3af2b66f3b2fefd374e0bcf2bc7096a17a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.

This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c        | 71 ++++++++++++++++++++++-----------
 src/backend/access/heap/pruneheap.c     |  3 +-
 src/backend/access/heap/rewriteheap.c   |  4 +-
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/commands/copy.c             | 13 +++---
 src/backend/commands/createas.c         |  9 ++---
 src/backend/commands/matview.c          |  6 +--
 src/backend/commands/tablecmds.c        |  5 +--
 src/backend/commands/vacuumlazy.c       |  6 +--
 src/include/access/heapam.h             |  7 ++--
 10 files changed, 73 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4e823b6e39..46a3dda09f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -2414,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2450,6 +2466,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
             int options, BulkInsertState bistate)
@@ -2523,7 +2540,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2724,12 +2741,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2771,6 +2786,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2782,6 +2798,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3344,7 +3361,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -4101,7 +4118,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -4323,7 +4340,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -5295,7 +5313,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -6039,7 +6057,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -6199,7 +6217,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -6332,7 +6350,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6441,7 +6459,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7637,7 +7655,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7685,7 +7703,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7770,7 +7788,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -9391,9 +9409,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 85f92973c9..ec9d1b3113 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -653,9 +653,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     }
     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);
     else
         heaptup = tup;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 86b0fb300f..07f96fde56 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2424,7 +2423,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3079,11 +3078,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d5cb62da15..0f58da40c6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -568,8 +568,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -618,9 +619,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e1eb7c374b..986f7baf39 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6dff2c696b..715718450d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4611,8 +4611,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4885,8 +4886,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         heap_close(newrel, NoLock);
     }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..72849a9a94 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1194,7 +1194,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 169c2f730e..c5e5e9a8b2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
-- 
2.16.3

From 521064b509f640388e5c0d3fca12d5538d212635 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
 src/backend/access/heap/heapam.c    |  31 ++++
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c       | 317 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   3 +-
 src/backend/storage/buffer/bufmgr.c |  40 ++++-
 src/backend/utils/cache/relcache.c  |  13 ++
 src/include/access/heapam.h         |   1 +
 src/include/catalog/storage.h       |   5 +-
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   8 +
 10 files changed, 395 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5f1a69ca53..4e823b6e39 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/index.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -9509,3 +9510,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6cd00d9aaa..e0ba2aff29 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2016,6 +2016,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2245,6 +2248,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2559,6 +2565,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..e14ce64fc4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        PendingRelSync *pending_sync;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+        /* get pending sync entry, create if not yet */
+        pending_sync = getPendingSyncEntry(rel, true);
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (pending_sync->sync_above == InvalidBlockNumber ||
+            pending_sync->sync_above < nblocks)
+        {
+            /*
+             * This is the first time truncation of this relation in this
+             * transaction or truncation that leaves pages that need at-commit
+             * fsync.  Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation.  Creates one if needed when create is
+ * true.
+ */  
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+    PendingRelSync *pendsync_entry = NULL;
+    bool            found;
+
+    if (rel->pending_sync)
+        return rel->pending_sync;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->no_pending_sync)
+        return NULL;
+
+    if (!pendingSyncs)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(PendingRelSync);
+        ctl.hash = tag_hash;
+        pendingSyncs = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+         rel->rd_node.relNode);
+    pendsync_entry = (PendingRelSync *)
+        hash_search(pendingSyncs, (void *) &rel->rd_node,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!pendsync_entry)
+    {
+        rel->no_pending_sync = true;
+        return NULL;
+    }
+
+    /* new entry created */
+    if (!found)
+    {
+        pendsync_entry->truncated_to = InvalidBlockNumber;
+        pendsync_entry->sync_above = InvalidBlockNumber;
+    }
+
+    /* hold shortcut in Relation */
+    rel->no_pending_sync = false;
+    rel->pending_sync = pendsync_entry;
+
+    return pendsync_entry;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    BlockNumber nblocks;
+    PendingRelSync *pending_sync;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    pending_sync = getPendingSyncEntry(rel, true);
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+
+    if (pending_sync->sync_above != InvalidBlockNumber)
+    {
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+
+        return;
+    }
+
+    elog(DEBUG2,
+         "registering new pending sync for rel %u/%u/%u at block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         nblocks);
+    pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    PendingRelSync *pending_sync;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    pending_sync = getPendingSyncEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /* we don't skip WAL-logging for pages that already done */
+    if (pending_sync->sync_above == InvalidBlockNumber ||
+        pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (pending_sync->truncated_to != InvalidBlockNumber &&
+        pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e10d3dbf3d..6dff2c696b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11019,11 +11019,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a4fc001103..20ba6fc989 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
 #include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -419,6 +420,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -1872,6 +1877,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3271,6 +3280,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ca5cad7497..169c2f730e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -180,6 +180,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 6ecbdb6294..ea44e0e15f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.16.3

From 19d9f2ec8868df606eabf3987140b7a305449536 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
+    /*
+     * If no tuple was inserted, it's possible that we are truncating a
+     * relation. We need to emit WAL for the metapage in the case. However it
+     * is not required elsewise,
+     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
-- 
2.16.3

From 092e7412f361c39530911d4592fb46653ca027ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 src/test/recovery/t/016_wal_optimize.pl

diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    $node->teardown_node;
+    $node->clean_node;
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

14 ноября 2018 г., 06:47:36

Hello.

At Thu, 11 Oct 2018 17:04:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20181011.170453.123148806.horiguchi.kyotaro@lab.ntt.co.jp>
> At Thu, 11 Oct 2018 13:42:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20181011.134235.218062184.horiguchi.kyotaro@lab.ntt.co.jp>
 
> I refactored getPendingSyncEntry out of RecordPendingSync,
> BufferNeedsWAL and RelationTruncate. And split the second patch
> into infrastracture-side and user-side ones. I expect it makes
> reviewing far easier.
> 
> I reaplce RelationNeedsWAL in a part of code added in
> heap_update() by bfa2ab56bb.
> 
> - v3-0001-TAP-test-for-copy-truncation-optimization.patch
> 
>  TAP test
> 
> -v3-0002-Write-WAL-for-empty-nbtree-index-build.patch
> 
>  nbtree "fix"
> 
> - v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch
> 
>  Pending-sync infrastructure.
> 
> - v3-0004-Fix-WAL-skipping-feature.patch
> 
>  Actual fix of WAL skipping feature.

0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
Successfully built and passeed all regression/recovery tests
including additional recovery/t/016_wal_optimize.pl.

reagrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 666d27dbc47c9963e5098904ffb9b173effaf853 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.

This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c        | 71 ++++++++++++++++++++++-----------
 src/backend/access/heap/pruneheap.c     |  3 +-
 src/backend/access/heap/rewriteheap.c   |  3 --
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/commands/copy.c             | 13 +++---
 src/backend/commands/createas.c         |  9 ++---
 src/backend/commands/matview.c          |  6 +--
 src/backend/commands/tablecmds.c        |  5 +--
 src/backend/commands/vacuumlazy.c       |  6 +--
 src/include/access/heapam.h             |  9 ++---
 10 files changed, 73 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7caa3ec248..a68eae9b11 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -2414,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2455,6 +2471,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
             int options, BulkInsertState bistate)
@@ -2528,7 +2545,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2730,7 +2747,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2738,7 +2754,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2780,6 +2795,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2791,6 +2807,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3353,7 +3370,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -4110,7 +4127,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -4332,7 +4349,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -5304,7 +5322,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -6048,7 +6066,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -6208,7 +6226,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -6341,7 +6359,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6450,7 +6468,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7646,7 +7664,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7694,7 +7712,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7779,7 +7797,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -9400,9 +9418,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5db75afa1..d2f78199ee 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -655,9 +655,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * The new relfilenode's relcache entrye doesn't have the necessary
          * information to determine whether a relation should emit data for
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..f54f80777b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2424,7 +2423,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3078,11 +3077,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d5cb62da15..0f58da40c6 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -568,8 +568,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -618,9 +619,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e1eb7c374b..986f7baf39 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -464,7 +464,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c507a1ab34..98084ad98c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4617,8 +4617,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4891,8 +4892,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         heap_close(newrel, NoLock);
     }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..72849a9a94 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1194,7 +1194,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f1d4a803ae..708cdd6cc5 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,11 +25,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL    0x0010
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL    0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
-- 
2.16.3

From ec791053430111c5ec62d659b9104c8163b95916 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
 src/backend/access/heap/heapam.c    |  31 ++++
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c       | 317 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   3 +-
 src/backend/storage/buffer/bufmgr.c |  40 ++++-
 src/backend/utils/cache/relcache.c  |  13 ++
 src/include/access/heapam.h         |   1 +
 src/include/catalog/storage.h       |   5 +-
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   8 +
 10 files changed, 395 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fb63471a0e..7caa3ec248 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/index.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -9518,3 +9519,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e07b..2a77f7daa3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2016,6 +2016,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2245,6 +2248,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2559,6 +2565,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..e14ce64fc4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        PendingRelSync *pending_sync;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+        /* get pending sync entry, create if not yet */
+        pending_sync = getPendingSyncEntry(rel, true);
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (pending_sync->sync_above == InvalidBlockNumber ||
+            pending_sync->sync_above < nblocks)
+        {
+            /*
+             * This is the first time truncation of this relation in this
+             * transaction or truncation that leaves pages that need at-commit
+             * fsync.  Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation.  Creates one if needed when create is
+ * true.
+ */  
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+    PendingRelSync *pendsync_entry = NULL;
+    bool            found;
+
+    if (rel->pending_sync)
+        return rel->pending_sync;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->no_pending_sync)
+        return NULL;
+
+    if (!pendingSyncs)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(PendingRelSync);
+        ctl.hash = tag_hash;
+        pendingSyncs = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+         rel->rd_node.relNode);
+    pendsync_entry = (PendingRelSync *)
+        hash_search(pendingSyncs, (void *) &rel->rd_node,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!pendsync_entry)
+    {
+        rel->no_pending_sync = true;
+        return NULL;
+    }
+
+    /* new entry created */
+    if (!found)
+    {
+        pendsync_entry->truncated_to = InvalidBlockNumber;
+        pendsync_entry->sync_above = InvalidBlockNumber;
+    }
+
+    /* hold shortcut in Relation */
+    rel->no_pending_sync = false;
+    rel->pending_sync = pendsync_entry;
+
+    return pendsync_entry;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    BlockNumber nblocks;
+    PendingRelSync *pending_sync;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    pending_sync = getPendingSyncEntry(rel, true);
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+
+    if (pending_sync->sync_above != InvalidBlockNumber)
+    {
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+
+        return;
+    }
+
+    elog(DEBUG2,
+         "registering new pending sync for rel %u/%u/%u at block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         nblocks);
+    pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    PendingRelSync *pending_sync;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    pending_sync = getPendingSyncEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /* we don't skip WAL-logging for pages that already done */
+    if (pending_sync->sync_above == InvalidBlockNumber ||
+        pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (pending_sync->truncated_to != InvalidBlockNumber &&
+        pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 946119fa86..c507a1ab34 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11025,11 +11025,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..1095f6c721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3147,20 +3148,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3177,7 +3199,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3207,18 +3229,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index aecbd4a943..280b481e88 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
 #include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -419,6 +420,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -1872,6 +1877,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3271,6 +3280,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 40e153f71a..f1d4a803ae 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -181,6 +181,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 84469f5715..55af2aa6bc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -188,6 +188,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.16.3

From c6e5f68e7b0e6036ff96c7789f9f4314e449a990 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
+    /*
+     * If no tuple was inserted, it's possible that we are truncating a
+     * relation. We need to emit WAL for the metapage in the case. However it
+     * is not required elsewise,
+     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
-- 
2.16.3

From ee1624fe2f3d556da2ce9b41c32576fedef686fa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 src/test/recovery/t/016_wal_optimize.pl

diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    $node->teardown_node;
+    $node->clean_node;
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Dmitry Dolgov

Дата:

30 ноября 2018 г., 20:27:05

> On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> 0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
> Successfully built and passeed all regression/recovery tests
> including additional recovery/t/016_wal_optimize.pl.

Thank you for working on this patch. Unfortunately, cfbot complains that
v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts.
Could you please post a rebased version one more time?

> On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > On 18/07/18 16:29, Robert Haas wrote:
> >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
> >> <michael@paquier.xyz> wrote:
> >>>> What's wrong with the approach proposed in
> >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> >>>
> >>> For back-branches that's very invasive so that seems risky to me
> >>> particularly seeing the low number of complaints on the matter.
> >>
> >> Hmm. I think that if you disable the optimization, you're betting that
> >> people won't mind losing performance in this case in a maintenance
> >> release.  If you back-patch Heikki's approach, you're betting that the
> >> committed version doesn't have any bugs that are worse than the status
> >> quo.  Personally, I'd rather take the latter bet.  Maybe the patch
> >> isn't all there yet, but that seems like something we can work
> >> towards.  If we just give up and disable the optimization, we won't
> >> know how many people we ticked off or how badly until after we've done
> >> it.
> >
> > Yeah. I'm not happy about backpatching a big patch like what I
> > proposed, and Kyotaro developed further. But I think it's the least
> > bad option we have, the other options discussed seem even worse.
> >
> > One way to review the patch is to look at what it changes, when
> > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > pose to users who are not affected by this bug? It seems pretty safe
> > to me.
> >
> > The other aspect is, how confident are we that this actually fixes the
> > bug, with least impact to users using wal_level='minimal'? I think
> > it's the best shot we have so far. All the other proposals either
> > don't fully fix the bug, or hurt performance in some legit cases.
> >
> > I'd suggest that we continue based on the patch that Kyotaro posted at
> > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> >
> I have just spent some time reviewing Kyatoro's patch. I'm a bit
> nervous, too, given the size. But I'm also nervous about leaving things
> as they are. I suspect the reason we haven't heard more about this is
> that these days use of "wal_level = minimal" is relatively rare.

I'm totally out of context of this patch, but reading this makes me nervous
too. Taking into account that the problem now is lack of review, do you have
plans to spend more time reviewing this patch?

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

20 декабря 2018 г., 11:32:25

Hello.

At Fri, 30 Nov 2018 18:27:05 +0100, Dmitry Dolgov <9erthalion6@gmail.com> wrote in
<CA+q6zcV6MUg1BEoQUywX917Oiz6JoMdoZ1Vu3RT5GgBb-yPszg@mail.gmail.com>
> > On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >
> > 0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
> > Successfully built and passeed all regression/recovery tests
> > including additional recovery/t/016_wal_optimize.pl.
> 
> Thank you for working on this patch. Unfortunately, cfbot complains that
> v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts.
> Could you please post a rebased version one more time?

Thanks. Here's the rebased version. I found no other amendment
required other than the apparent conflict.


> > On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
> >
> > On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > > On 18/07/18 16:29, Robert Haas wrote:
> > >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
> > >> <michael@paquier.xyz> wrote:
> > >>>> What's wrong with the approach proposed in
> > >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> > >>>
> > >>> For back-branches that's very invasive so that seems risky to me
> > >>> particularly seeing the low number of complaints on the matter.
> > >>
> > >> Hmm. I think that if you disable the optimization, you're betting that
> > >> people won't mind losing performance in this case in a maintenance
> > >> release.  If you back-patch Heikki's approach, you're betting that the
> > >> committed version doesn't have any bugs that are worse than the status
> > >> quo.  Personally, I'd rather take the latter bet.  Maybe the patch
> > >> isn't all there yet, but that seems like something we can work
> > >> towards.  If we just give up and disable the optimization, we won't
> > >> know how many people we ticked off or how badly until after we've done
> > >> it.
> > >
> > > Yeah. I'm not happy about backpatching a big patch like what I
> > > proposed, and Kyotaro developed further. But I think it's the least
> > > bad option we have, the other options discussed seem even worse.
> > >
> > > One way to review the patch is to look at what it changes, when
> > > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > > pose to users who are not affected by this bug? It seems pretty safe
> > > to me.
> > >
> > > The other aspect is, how confident are we that this actually fixes the
> > > bug, with least impact to users using wal_level='minimal'? I think
> > > it's the best shot we have so far. All the other proposals either
> > > don't fully fix the bug, or hurt performance in some legit cases.
> > >
> > > I'd suggest that we continue based on the patch that Kyotaro posted at
> > > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> > >
> > I have just spent some time reviewing Kyatoro's patch. I'm a bit
> > nervous, too, given the size. But I'm also nervous about leaving things
> > as they are. I suspect the reason we haven't heard more about this is
> > that these days use of "wal_level = minimal" is relatively rare.
> 
> I'm totally out of context of this patch, but reading this makes me nervous
> too. Taking into account that the problem now is lack of review, do you have
> plans to spend more time reviewing this patch?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 120f3f1d4dc47eb74a6ad7fde3c116e31b8eab3e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 src/test/recovery/t/016_wal_optimize.pl

diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    $node->teardown_node;
+    $node->clean_node;
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 7b29c2c9b3d19fd6230bc5663df9d6953197479a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..2c2647b530 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -610,8 +610,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1055,6 +1061,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
+    /*
+     * If no tuple was inserted, it's possible that we are truncating a
+     * relation. We need to emit WAL for the metapage in the case. However it
+     * is not required elsewise,
+     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
-- 
2.16.3

From 92d023071580e3f211a82b191b1afe9afbe824b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
 src/backend/access/heap/heapam.c    |  31 ++++
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c       | 317 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   3 +-
 src/backend/storage/buffer/bufmgr.c |  40 ++++-
 src/backend/utils/cache/relcache.c  |  13 ++
 src/include/access/heapam.h         |   1 +
 src/include/catalog/storage.h       |   5 +-
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   8 +
 10 files changed, 395 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9650145642..8f1ea73541 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -57,6 +57,7 @@
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/index.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -9460,3 +9461,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400384..d79b2a94dc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2020,6 +2020,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2249,6 +2252,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2563,6 +2569,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5df4382b7e..e14ce64fc4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        PendingRelSync *pending_sync;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+        /* get pending sync entry, create if not yet */
+        pending_sync = getPendingSyncEntry(rel, true);
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (pending_sync->sync_above == InvalidBlockNumber ||
+            pending_sync->sync_above < nblocks)
+        {
+            /*
+             * This is the first time truncation of this relation in this
+             * transaction or truncation that leaves pages that need at-commit
+             * fsync.  Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation.  Creates one if needed when create is
+ * true.
+ */  
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+    PendingRelSync *pendsync_entry = NULL;
+    bool            found;
+
+    if (rel->pending_sync)
+        return rel->pending_sync;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->no_pending_sync)
+        return NULL;
+
+    if (!pendingSyncs)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(PendingRelSync);
+        ctl.hash = tag_hash;
+        pendingSyncs = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+         rel->rd_node.relNode);
+    pendsync_entry = (PendingRelSync *)
+        hash_search(pendingSyncs, (void *) &rel->rd_node,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!pendsync_entry)
+    {
+        rel->no_pending_sync = true;
+        return NULL;
+    }
+
+    /* new entry created */
+    if (!found)
+    {
+        pendsync_entry->truncated_to = InvalidBlockNumber;
+        pendsync_entry->sync_above = InvalidBlockNumber;
+    }
+
+    /* hold shortcut in Relation */
+    rel->no_pending_sync = false;
+    rel->pending_sync = pendsync_entry;
+
+    return pendsync_entry;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    BlockNumber nblocks;
+    PendingRelSync *pending_sync;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    pending_sync = getPendingSyncEntry(rel, true);
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+
+    if (pending_sync->sync_above != InvalidBlockNumber)
+    {
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+
+        return;
+    }
+
+    elog(DEBUG2,
+         "registering new pending sync for rel %u/%u/%u at block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         nblocks);
+    pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    PendingRelSync *pending_sync;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    pending_sync = getPendingSyncEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /* we don't skip WAL-logging for pages that already done */
+    if (pending_sync->sync_above == InvalidBlockNumber ||
+        pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (pending_sync->truncated_to != InvalidBlockNumber &&
+        pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ad8c176793..879c3d981e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10905,11 +10905,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9817770aff..1cb93ca486 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index c3071db1cd..40b00e1275 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -77,6 +77,7 @@
 #include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -417,6 +418,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -1868,6 +1873,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3264,6 +3273,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 64cfdbd2f0..4baa287c8c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -181,6 +181,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef52d85803..49d93cd01f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0..9fae7c6ae5 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -190,6 +190,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 2217081dcc..db60eddea0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -187,6 +187,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.16.3

From 5b4cb2ba0065bf40f6eedca35e6c262e4f5d7050 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.

This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c        | 70 ++++++++++++++++++++++-----------
 src/backend/access/heap/pruneheap.c     |  3 +-
 src/backend/access/heap/rewriteheap.c   |  3 --
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/commands/copy.c             | 13 +++---
 src/backend/commands/createas.c         |  9 ++---
 src/backend/commands/matview.c          |  6 +--
 src/backend/commands/tablecmds.c        |  5 +--
 src/backend/commands/vacuumlazy.c       |  6 +--
 src/include/access/heapam.h             |  9 ++---
 10 files changed, 72 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8f1ea73541..c9c254a032 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -2414,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2528,7 +2544,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2704,7 +2720,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2712,7 +2727,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2754,6 +2768,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2765,6 +2780,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3327,7 +3343,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -4069,7 +4085,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -4291,7 +4307,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -5263,7 +5280,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -6007,7 +6024,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -6167,7 +6184,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -6300,7 +6317,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6409,7 +6426,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7605,7 +7622,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7653,7 +7670,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7738,7 +7755,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -9342,9 +9359,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343dac..d0b68902d9 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -259,7 +260,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 44caeca336..ecddc40329 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -655,9 +655,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 695567b4b0..fce14ce35f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4311e16007..d583b5a8a3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2364,8 +2364,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2406,7 +2405,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3036,11 +3035,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index d01b258b65..3d32d07d69 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -555,8 +555,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -599,9 +600,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     heap_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index a171ebabf8..174aa3376a 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -461,7 +461,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -507,9 +507,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     heap_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 879c3d981e..ce8f7cd881 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4591,8 +4591,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4857,8 +4858,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         heap_close(newrel, NoLock);
     }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8134c52253..28caf92073 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -924,7 +924,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1188,7 +1188,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1569,7 +1569,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4baa287c8c..d2fbc1ad47 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,11 +25,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL    0x0010
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL    0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

30 января 2019 г., 04:26:34

Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From d5f2b47b6ba191d0ad1673f9bd9c5851d91a1b59 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 src/test/recovery/t/016_wal_optimize.pl

diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    $node->teardown_node;
+    $node->clean_node;
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 5613d41deca5a5691d18457db6bfd177ee2febe1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..70d4380533 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1056,6 +1062,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
+    /*
+     * If no tuple was inserted, it's possible that we are truncating a
+     * relation. We need to emit WAL for the metapage in the case. However it
+     * is not required elsewise,
+     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
-- 
2.16.3

From ec2f481feb39247584e06b92aaee42c21c9dec2c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
 src/backend/access/heap/heapam.c    |  31 ++++
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c       | 317 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   3 +-
 src/backend/storage/buffer/bufmgr.c |  40 ++++-
 src/backend/utils/cache/relcache.c  |  13 ++
 src/include/access/heapam.h         |   1 +
 src/include/catalog/storage.h       |   5 +-
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   8 +
 10 files changed, 395 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4406a69ef2..5972e9d190 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -50,6 +50,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -9080,3 +9081,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0181976964..fa845bfd45 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2020,6 +2020,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2249,6 +2252,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2568,6 +2574,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..68947b017f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        PendingRelSync *pending_sync;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+        /* get pending sync entry, create if not yet */
+        pending_sync = getPendingSyncEntry(rel, true);
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (pending_sync->sync_above == InvalidBlockNumber ||
+            pending_sync->sync_above < nblocks)
+        {
+            /*
+             * This is the first time truncation of this relation in this
+             * transaction or truncation that leaves pages that need at-commit
+             * fsync.  Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation.  Creates one if needed when create is
+ * true.
+ */  
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+    PendingRelSync *pendsync_entry = NULL;
+    bool            found;
+
+    if (rel->pending_sync)
+        return rel->pending_sync;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->no_pending_sync)
+        return NULL;
+
+    if (!pendingSyncs)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(PendingRelSync);
+        ctl.hash = tag_hash;
+        pendingSyncs = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+         rel->rd_node.relNode);
+    pendsync_entry = (PendingRelSync *)
+        hash_search(pendingSyncs, (void *) &rel->rd_node,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!pendsync_entry)
+    {
+        rel->no_pending_sync = true;
+        return NULL;
+    }
+
+    /* new entry created */
+    if (!found)
+    {
+        pendsync_entry->truncated_to = InvalidBlockNumber;
+        pendsync_entry->sync_above = InvalidBlockNumber;
+    }
+
+    /* hold shortcut in Relation */
+    rel->no_pending_sync = false;
+    rel->pending_sync = pendsync_entry;
+
+    return pendsync_entry;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    BlockNumber nblocks;
+    PendingRelSync *pending_sync;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    pending_sync = getPendingSyncEntry(rel, true);
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+
+    if (pending_sync->sync_above != InvalidBlockNumber)
+    {
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+
+        return;
+    }
+
+    elog(DEBUG2,
+         "registering new pending sync for rel %u/%u/%u at block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         nblocks);
+    pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    PendingRelSync *pending_sync;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    pending_sync = getPendingSyncEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /* we don't skip WAL-logging for pages that already done */
+    if (pending_sync->sync_above == InvalidBlockNumber ||
+        pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (pending_sync->truncated_to != InvalidBlockNumber &&
+        pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 434be403fe..e15296e373 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11387,11 +11387,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..a9741f138c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index af96a03338..66e7d5a301 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -414,6 +415,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -1869,6 +1874,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3263,6 +3272,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ab0879138f..fab5052868 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -163,6 +163,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..95d7898e25 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1d05465303..0f39f209d3 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -185,6 +185,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.16.3

From 7b52c9dd2d6bb76f0264bfd0f17d034001351b6f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.

This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c        | 70 ++++++++++++++++++++++-----------
 src/backend/access/heap/pruneheap.c     |  3 +-
 src/backend/access/heap/rewriteheap.c   |  3 --
 src/backend/access/heap/vacuumlazy.c    |  6 +--
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/commands/copy.c             | 13 +++---
 src/backend/commands/createas.c         |  9 ++---
 src/backend/commands/matview.c          |  6 +--
 src/backend/commands/tablecmds.c        |  5 +--
 src/include/access/heapam.h             |  9 ++---
 10 files changed, 72 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5972e9d190..a2d8aefa28 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -2127,12 +2149,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2239,7 +2255,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2414,7 +2430,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2422,7 +2437,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2464,6 +2478,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2475,6 +2490,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3037,7 +3053,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -3777,7 +3793,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -3992,7 +4008,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -4882,7 +4899,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5626,7 +5643,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -5786,7 +5803,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -5919,7 +5936,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6028,7 +6045,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7225,7 +7242,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7273,7 +7290,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7358,7 +7375,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8962,9 +8979,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..1e9c07c9b2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 37aa484ec3..3309c93bce 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -923,7 +923,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1187,7 +1187,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1568,7 +1568,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 931ae81fd6..53da0da68f 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dbb06397e6..b42bfbfd47 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2390,8 +2390,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2437,7 +2436,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3092,11 +3091,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bc8f928ea..5eb45a4a65 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -556,8 +556,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -600,9 +601,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5a47be4b33..5f447c6d94 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,9 +509,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e15296e373..65be3c2869 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4616,8 +4616,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4882,8 +4883,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);

         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         table_close(newrel, NoLock);
     }
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fab5052868..32a365021a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -27,11 +27,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL    0x0010
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL    0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

04 марта 2019 г., 06:24:48

Rebased.

No commit hit this but I fixed one space error.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From d048aedbee48a1a0d91ae6e009b7a7903f272720 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/016_wal_optimize.pl | 192 ++++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 src/test/recovery/t/016_wal_optimize.pl

diff --git a/src/test/recovery/t/016_wal_optimize.pl b/src/test/recovery/t/016_wal_optimize.pl
new file mode 100644
index 0000000000..310772a2b3
--- /dev/null
+++ b/src/test/recovery/t/016_wal_optimize.pl
@@ -0,0 +1,192 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    $node->teardown_node;
+    $node->clean_node;
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 5a435c9c82155204484f31601a12821cf1e5e96e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/4] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..70d4380533 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1056,6 +1062,11 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
+    /*
+     * If no tuple was inserted, it's possible that we are truncating a
+     * relation. We need to emit WAL for the metapage in the case. However it
+     * is not required elsewise,
+     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
-- 
2.16.3

From 1123bd8ce20ff177673f614722d3fe092a2bcbeb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 3/4] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just singaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-comit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
 src/backend/access/heap/heapam.c    |  31 ++++
 src/backend/access/transam/xact.c   |   7 +
 src/backend/catalog/storage.c       | 317 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   3 +-
 src/backend/storage/buffer/bufmgr.c |  40 ++++-
 src/backend/utils/cache/relcache.c  |  13 ++
 src/include/access/heapam.h         |   1 +
 src/include/catalog/storage.h       |   5 +-
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   8 +
 10 files changed, 395 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc3499349b..5ea5ff5848 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -50,6 +50,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -9079,3 +9080,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..6d62d6e34f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2021,6 +2021,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2250,6 +2253,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2575,6 +2581,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..26dc3ddb1b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +63,49 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'sync_above' is set to the current size of the relation. Any operations
+ * on blocks < sync_above need to be WAL-logged as usual, but for operations
+ * on higher blocks, WAL-logging is skipped.
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct PendingRelSync
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                 * sync_above */
+    BlockNumber truncated_to;    /* truncation WAL record was written */
+}    PendingRelSync;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *pendingSyncs = NULL;
+
+static PendingRelSync *getPendingSyncEntry(Relation rel, bool create);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -259,37 +303,117 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        PendingRelSync *pending_sync;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+        /* get pending sync entry, create if not yet */
+        pending_sync = getPendingSyncEntry(rel, true);
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (pending_sync->sync_above == InvalidBlockNumber ||
+            pending_sync->sync_above < nblocks)
+        {
+            /*
+             * This is the first time truncation of this relation in this
+             * transaction or truncation that leaves pages that need at-commit
+             * fsync.  Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->pending_sync->truncated_to = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/*
+ * getPendingSyncEntry: get pendig sync entry.
+ *
+ * Returns pending sync entry for the relation. The entry tracks pending
+ * at-commit fsyncs for the relation.  Creates one if needed when create is
+ * true.
+ */
+static PendingRelSync *
+getPendingSyncEntry(Relation rel, bool create)
+{
+    PendingRelSync *pendsync_entry = NULL;
+    bool            found;
+
+    if (rel->pending_sync)
+        return rel->pending_sync;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->no_pending_sync)
+        return NULL;
+
+    if (!pendingSyncs)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(PendingRelSync);
+        ctl.hash = tag_hash;
+        pendingSyncs = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    elog(DEBUG2, "getPendingSyncEntry: accessing hash for %d",
+         rel->rd_node.relNode);
+    pendsync_entry = (PendingRelSync *)
+        hash_search(pendingSyncs, (void *) &rel->rd_node,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!pendsync_entry)
+    {
+        rel->no_pending_sync = true;
+        return NULL;
+    }
+
+    /* new entry created */
+    if (!found)
+    {
+        pendsync_entry->truncated_to = InvalidBlockNumber;
+        pendsync_entry->sync_above = InvalidBlockNumber;
+    }
+
+    /* hold shortcut in Relation */
+    rel->no_pending_sync = false;
+    rel->pending_sync = pendsync_entry;
+
+    return pendsync_entry;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -367,6 +491,24 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+ */
+void
+RelationRemovePendingSync(Relation rel)
+{
+    bool found;
+
+    rel->pending_sync = NULL;
+    rel->no_pending_sync = true;
+    if (pendingSyncs)
+    {
+        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+    }
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -418,6 +560,139 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because we
+ * are going to skip WAL-logging subsequent actions to it.
+ */
+void
+RecordPendingSync(Relation rel)
+{
+    BlockNumber nblocks;
+    PendingRelSync *pending_sync;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    pending_sync = getPendingSyncEntry(rel, true);
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+
+    if (pending_sync->sync_above != InvalidBlockNumber)
+    {
+        elog(DEBUG2,
+             "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->pending_sync->sync_above, nblocks);
+
+        return;
+    }
+
+    elog(DEBUG2,
+         "registering new pending sync for rel %u/%u/%u at block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         nblocks);
+    pending_sync->sync_above = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    PendingRelSync *pending_sync;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    pending_sync = getPendingSyncEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have pending
+     * sync
+     */
+    if (!pending_sync)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /* we don't skip WAL-logging for pages that already done */
+    if (pending_sync->sync_above == InvalidBlockNumber ||
+        pending_sync->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->pending_sync->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (pending_sync->truncated_to != InvalidBlockNumber &&
+        pending_sync->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+         blkno);
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!pendingSyncs)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        PendingRelSync *pending;
+
+        hash_seq_init(&status, pendingSyncs);
+
+        while ((pending = hash_seq_search(&status)) != NULL)
+        {
+            if (pending->sync_above != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+                elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                     pending->relnode.dbNode, pending->relnode.relNode);
+            }
+        }
+    }
+
+    hash_destroy(pendingSyncs);
+    pendingSyncs = NULL;
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index a93b13c2fe..6190b3f605 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11412,11 +11412,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationRemovePendingSync(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..a9741f138c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 54a40ef00b..b5baa430db 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
     /* which we mark as a reference-counted tupdesc */
     relation->rd_att->tdrefcount = 1;
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     MemoryContextSwitchTo(oldcxt);
 
     return relation;
@@ -1813,6 +1818,10 @@ formrdesc(const char *relationName, Oid relationReltype,
         relation->rd_rel->relhasindex = true;
     }
 
+    /* We don't know if pending sync for this relation exists so far */
+    relation->pending_sync = NULL;
+    relation->no_pending_sync = false;
+
     /*
      * add new reldesc to relcache
      */
@@ -3207,6 +3216,10 @@ RelationBuildLocalRelation(const char *relname,
     else
         rel->rd_rel->relfilenode = relfilenode;
 
+    /* newly built relation has no pending sync */
+    rel->no_pending_sync = true;
+    rel->pending_sync = NULL;
+
     RelationInitLockInfo(rel);    /* see lmgr.c */
 
     RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ab0879138f..fab5052868 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -163,6 +163,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..95d7898e25 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,13 +22,16 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-
+extern void RelationRemovePendingSync(Relation rel);
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1d05465303..0f39f209d3 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -185,6 +185,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * no_pending_sync is true if this relation is known not to have pending
+     * syncs.  Elsewise searching for registered sync is required if
+     * pending_sync is NULL.
+     */
+    bool                   no_pending_sync;
+    struct PendingRelSync *pending_sync;
 } RelationData;
 
 
-- 
2.16.3

From 256a04a64ffad9f280577e14683113d33a6633e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.

This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c        | 70 ++++++++++++++++++++++-----------
 src/backend/access/heap/pruneheap.c     |  3 +-
 src/backend/access/heap/rewriteheap.c   |  3 --
 src/backend/access/heap/vacuumlazy.c    |  6 +--
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/commands/copy.c             | 13 +++---
 src/backend/commands/createas.c         |  9 ++---
 src/backend/commands/matview.c          |  6 +--
 src/backend/commands/tablecmds.c        |  5 +--
 src/include/access/heapam.h             |  9 ++---
 10 files changed, 72 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ea5ff5848..c66a468335 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -2127,12 +2149,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2239,7 +2255,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2414,7 +2430,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2422,7 +2437,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2464,6 +2478,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2475,6 +2490,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -3037,7 +3053,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         xl_heap_header xlhdr;
@@ -3776,7 +3792,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -3991,7 +4007,8 @@ l2:
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))
     {
         XLogRecPtr    recptr;
 
@@ -4881,7 +4898,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5625,7 +5642,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -5785,7 +5802,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)
     htup->t_ctid = tuple->t_self;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -5918,7 +5935,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -6027,7 +6044,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7224,7 +7241,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7272,7 +7289,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7357,7 +7374,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..1e9c07c9b2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9416c31889..1f66685c88 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -929,7 +929,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1193,7 +1193,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5dd6fe02c6..db7a94ff6e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2390,8 +2390,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2437,7 +2436,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3087,11 +3086,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 6517ecb738..17fb78ba78 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -556,8 +556,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -603,9 +604,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5a47be4b33..5f447c6d94 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,9 +509,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6190b3f605..94d7876b8c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4617,8 +4617,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4886,8 +4887,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         table_close(newrel, NoLock);
     }
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fab5052868..32a365021a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -27,11 +27,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
-#define HEAP_INSERT_NO_LOGICAL    0x0010
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004
+#define HEAP_INSERT_NO_LOGICAL    0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

11 марта 2019 г., 05:27:08

This has been waiting for a review since October, so I reviewed it.  The code
comment at PendingRelSync summarizes the design well, and I like that design.
I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch.  I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that.  Let's keep pursuing your current design.

This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
to COMMIT.  Users setting a timeout on COMMIT may need to adjust, and
log_min_duration_statement analysis will reflect the change.  I feel that's
fine.  (There already exist ways for COMMIT to be slow.)

On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>      /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
>      RelationOpenSmgr(wstate->index);
>  
> -    /* XLOG stuff */
> -    if (wstate->btws_use_wal)
> +    /* XLOG stuff
> +     *
> +     * Even if minimal mode, WAL is required here if truncation happened after
> +     * being created in the same transaction. It is not needed otherwise but
> +     * we don't bother identifying the case precisely.
> +     */
> +    if (wstate->btws_use_wal ||
> +        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))

We initialized "btws_use_wal" like this:

    #define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
    #define RelationNeedsWAL(relation) \
        ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);

Hence, this change causes us to emit WAL for the metapage of a
RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation.  We should never do
that.  If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
relfilenode.  I've attached a test case for this; it is a patch that applies
on top of your v7 patches.  The test checks for orphaned files after redo.

> +     * If no tuple was inserted, it's possible that we are truncating a
> +     * relation. We need to emit WAL for the metapage in the case. However it
> +     * is not required elsewise,

Did you mean to write more words after that comma?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c

> + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> + * any subsequent actions on the same block either. Replaying the WAL record
> + * of the subsequent action might fail otherwise, as the "before" state of
> + * the block might not match, as the earlier actions were not WAL-logged.

Good point.  To participate in WAL redo properly, each "before" state must
have a distinct pd_lsn.  In CREATE INDEX USING btree, the initial index build
skips WAL, but an INSERT later in the same transaction writes WAL.  There,
however, each "before" state does have a distinct pd_lsn; the initial build
has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
doesn't conform to this code comment.

I think this restriction applies only to full_page_writes=off.  Otherwise, the
first WAL-logged change will find pd_lsn==0 and emit a full-page image.  With
a full-page image in the record, the block's "before" state doesn't matter.
Also, one could make it safe to write WAL for a particular block by issuing
heap_sync() for the block's relation.

> +/*
> + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> + */
> +void
> +RelationRemovePendingSync(Relation rel)

What is the coding rule for deciding when to call this?  Currently, only
ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.

> +{
> +    bool found;
> +
> +    rel->pending_sync = NULL;
> +    rel->no_pending_sync = true;
> +    if (pendingSyncs)
> +    {
> +        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> +        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
> +    }
> +}

We'd need a mechanism to un-remove the sync at subtransaction abort.  My
attachment includes a test case demonstrating the consequences of that defect.
Please look for other areas that need to know about subtransactions; patch v7
had no code pertaining to subtransactions.

> +        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",

As you mention upthread, you have many debugging elog()s.  These are too
detailed to include in every binary, but I do want them in the code.  See
CACHE_elog() for a good example of achieving that.

> +/*
> + * Sync to disk any relations that we skipped WAL-logging for earlier.
> + */
> +void
> +smgrDoPendingSyncs(bool isCommit)
> +{
> +    if (!pendingSyncs)
> +        return;
> +
> +    if (isCommit)
> +    {
> +        HASH_SEQ_STATUS status;
> +        PendingRelSync *pending;
> +
> +        hash_seq_init(&status, pendingSyncs);
> +
> +        while ((pending = hash_seq_search(&status)) != NULL)
> +        {
> +            if (pending->sync_above != InvalidBlockNumber)

I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
InvalidBlockNumber" are not sync requests at all.  Those just record the fact
of a RelationTruncate() happening.  If you can think of a way to improve that,
please do so.  If not, it's okay.

> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c

> @@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
>      /* which we mark as a reference-counted tupdesc */
>      relation->rd_att->tdrefcount = 1;
>  
> +    /* We don't know if pending sync for this relation exists so far */
> +    relation->pending_sync = NULL;
> +    relation->no_pending_sync = false;

RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
prefix to these fields.

This is a nonstandard place to clear fields.  Clear them in
load_relcache_init_file() only, like we do for rd_statvalid.  (Other paths
will then rely on palloc0() for implicit initialization.)

> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c

> @@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
>      MarkBufferDirty(buffer);
>  
>      /* XLOG stuff */
> -    if (RelationNeedsWAL(relation))
> +    if (BufferNeedsWAL(relation, buffer) ||
> +        BufferNeedsWAL(relation, newbuf))

This is fine if both buffers need WAL or neither buffer needs WAL.  It is not
fine when one buffer needs WAL and the other buffer does not.  My attachment
includes a test case.  Of the bugs I'm reporting, this one seems most
difficult to solve well.

> @@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
>   *    heap_sync        - sync a heap, for use when no WAL has been written
>   *
>   * This forces the heap contents (including TOAST heap if any) down to disk.
> - * If we skipped using WAL, and WAL is otherwise needed, we must force the
> - * relation down to disk before it's safe to commit the transaction.  This
> - * requires writing out any dirty buffers and then doing a forced fsync.
> + * If we did any changes to the heap bypassing the buffer manager, we must
> + * force the relation down to disk before it's safe to commit the
> + * transaction, because the direct modifications will not be flushed by
> + * the next checkpoint.
> + *
> + * We used to also use this after batch operations like COPY and CLUSTER,
> + * if we skipped using WAL and WAL is otherwise needed, but there were
> + * corner-cases involving other WAL-logged operations to the same
> + * relation, where that was not enough. heap_register_sync() should be
> + * used for that purpose instead.

We still use heap_sync() in CLUSTER.  Can we migrate CLUSTER to the newer
heap_register_sync()?  Patch v7 makes some commands use the new way (COPY,
CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
TABLESPACE, CLUSTER).  It would make the system simpler to understand if we
eliminated the old way.  If that creates more problems than it solves, please
at least write down a coding rule to explain why certain commands shouldn't
use the old way.

Thanks,
nm

Вложения

wal-optimize-noah-tests-v1.patch

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

20 марта 2019 г., 11:17:54

Thank you for reviewing!

At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190311022708.GA2189728@rfd.leadboat.com>
> This has been waiting for a review since October, so I reviewed it.  The code
> comment at PendingRelSync summarizes the design well, and I like that design.

It is Michael's work.

> I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> last paragraph, and I suspect it would have been no harder to back-patch.  I
> wonder if it would have been simpler and better, but I'm not asking anyone to
> investigate that.  Let's keep pursuing your current design.

I must admit that this is complex..

> This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
> to COMMIT.  Users setting a timeout on COMMIT may need to adjust, and
> log_min_duration_statement analysis will reflect the change.  I feel that's
> fine.  (There already exist ways for COMMIT to be slow.)
> 
> On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > --- a/src/backend/access/nbtree/nbtsort.c
> > +++ b/src/backend/access/nbtree/nbtsort.c
> > @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> >      /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
> >      RelationOpenSmgr(wstate->index);
> >  
> > -    /* XLOG stuff */
> > -    if (wstate->btws_use_wal)
> > +    /* XLOG stuff
> > +     *
> > +     * Even if minimal mode, WAL is required here if truncation happened after
> > +     * being created in the same transaction. It is not needed otherwise but
> > +     * we don't bother identifying the case precisely.
> > +     */
> > +    if (wstate->btws_use_wal ||
> > +        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
> 
> We initialized "btws_use_wal" like this:
> 
>     #define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
>     #define RelationNeedsWAL(relation) \
>         ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
>     wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
> 
> Hence, this change causes us to emit WAL for the metapage of a
> RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation.  We should never do
> that.  If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
> relfilenode.  I've attached a test case for this; it is a patch that applies
> on top of your v7 patches.  The test checks for orphaned files after redo.

Oops!  Added RelationNeedsWAL(index) there. (Attched 1st patch on
top of this patchset)

> > +     * If no tuple was inserted, it's possible that we are truncating a
> > +     * relation. We need to emit WAL for the metapage in the case. However it
> > +     * is not required elsewise,
> 
> Did you mean to write more words after that comma?

Sorry, it is just a garbage. Required work is done in
_bt_blwritepage.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> 
> > + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> > + * any subsequent actions on the same block either. Replaying the WAL record
> > + * of the subsequent action might fail otherwise, as the "before" state of
> > + * the block might not match, as the earlier actions were not WAL-logged.
> 
> Good point.  To participate in WAL redo properly, each "before" state must
> have a distinct pd_lsn.  In CREATE INDEX USING btree, the initial index build
> skips WAL, but an INSERT later in the same transaction writes WAL.  There,
> however, each "before" state does have a distinct pd_lsn; the initial build
> has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
> Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
> doesn't conform to this code comment.

(The NB is Michael's work.)
Yes. Btree works differently from heap. Thak you for confirmation.

> I think this restriction applies only to full_page_writes=off.  Otherwise, the
> first WAL-logged change will find pd_lsn==0 and emit a full-page image.  With
> a full-page image in the record, the block's "before" state doesn't matter.
> Also, one could make it safe to write WAL for a particular block by issuing
> heap_sync() for the block's relation.

Umm.. Once truncate happens, WAL is emitted for all pages. If we
decide to skip WALs on copy or similar bulk operations, WALs are
not emitted at all, including XLOG_HEAP_INIT_PAGE. So that
doesn't happen. The unlogged data is synced at commit time.

> > +/*
> > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > + */
> > +void
> > +RelationRemovePendingSync(Relation rel)
> 
> What is the coding rule for deciding when to call this?  Currently, only
> ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
> much like ALTER TABLE SET TABLESPACE behaves.
> > +{
> > +    bool found;
> > +
> > +    rel->pending_sync = NULL;
> > +    rel->no_pending_sync = true;
> > +    if (pendingSyncs)
> > +    {
> > +        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> > +        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
> > +    }
> > +}
> 
> We'd need a mechanism to un-remove the sync at subtransaction abort.  My
> attachment includes a test case demonstrating the consequences of that defect.
> Please look for other areas that need to know about subtransactions; patch v7
> had no code pertaining to subtransactions.

Agreed It forgets about subtransaction rollbacks. I'll make
RelationRemovePendingSync just mark as "removed" and make
ROLLBACK TO and RELEASE process the flag make it work. (Attached
2nd patch on top of thie patchset)

> 
> > +        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
> 
> As you mention upthread, you have many debugging elog()s.  These are too
> detailed to include in every binary, but I do want them in the code.  See
> CACHE_elog() for a good example of achieving that.

Agreed will do. They were need to check the behavior precisely
but usually not needed.

> > +/*
> > + * Sync to disk any relations that we skipped WAL-logging for earlier.
> > + */
> > +void
> > +smgrDoPendingSyncs(bool isCommit)
> > +{
> > +    if (!pendingSyncs)
> > +        return;
> > +
> > +    if (isCommit)
> > +    {
> > +        HASH_SEQ_STATUS status;
> > +        PendingRelSync *pending;
> > +
> > +        hash_seq_init(&status, pendingSyncs);
> > +
> > +        while ((pending = hash_seq_search(&status)) != NULL)
> > +        {
> > +            if (pending->sync_above != InvalidBlockNumber)
> 
> I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> InvalidBlockNumber" are not sync requests at all.  Those just record the fact
> of a RelationTruncate() happening.  If you can think of a way to improve that,
> please do so.  If not, it's okay.

After a truncation, required WAL records are emitted for the
truncated pages, so no need to sync. Does this make sense for
you? (Maybe commit is needed there)

> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
> 
> > @@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
> >      /* which we mark as a reference-counted tupdesc */
> >      relation->rd_att->tdrefcount = 1;
> >  
> > +    /* We don't know if pending sync for this relation exists so far */
> > +    relation->pending_sync = NULL;
> > +    relation->no_pending_sync = false;
> 
> RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
> prefix to these fields.
> This is a nonstandard place to clear fields.  Clear them in
> load_relcache_init_file() only, like we do for rd_statvalid.  (Other paths
> will then rely on palloc0() for implicit initialization.)

Agreed, will do in the next version.

> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> 
> > @@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> >      MarkBufferDirty(buffer);
> >  
> >      /* XLOG stuff */
> > -    if (RelationNeedsWAL(relation))
> > +    if (BufferNeedsWAL(relation, buffer) ||
> > +        BufferNeedsWAL(relation, newbuf))
> 
> This is fine if both buffers need WAL or neither buffer needs WAL.  It is not
> fine when one buffer needs WAL and the other buffer does not.  My attachment
> includes a test case.  Of the bugs I'm reporting, this one seems most
> difficult to solve well.

Yeah, it is right (and it's rather silly). Thank you for
pointing out. Will fix.

> > @@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
> >   *    heap_sync        - sync a heap, for use when no WAL has been written
> >   *
> >   * This forces the heap contents (including TOAST heap if any) down to disk.
> > - * If we skipped using WAL, and WAL is otherwise needed, we must force the
> > - * relation down to disk before it's safe to commit the transaction.  This
> > - * requires writing out any dirty buffers and then doing a forced fsync.
> > + * If we did any changes to the heap bypassing the buffer manager, we must
> > + * force the relation down to disk before it's safe to commit the
> > + * transaction, because the direct modifications will not be flushed by
> > + * the next checkpoint.
> > + *
> > + * We used to also use this after batch operations like COPY and CLUSTER,
> > + * if we skipped using WAL and WAL is otherwise needed, but there were
> > + * corner-cases involving other WAL-logged operations to the same
> > + * relation, where that was not enough. heap_register_sync() should be
> > + * used for that purpose instead.
> 
> We still use heap_sync() in CLUSTER.  Can we migrate CLUSTER to the newer
> heap_register_sync()?  Patch v7 makes some commands use the new way (COPY,
> CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> TABLESPACE, CLUSTER).  It would make the system simpler to understand if we
> eliminated the old way.  If that creates more problems than it solves, please
> at least write down a coding rule to explain why certain commands shouldn't
> use the old way.

Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
CREATE INDEX. I'll consider them.

I don't have enough time for now so the new version will be
posted early next week.

Thanks you for the review!

regards.

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index fb4a80bf1d..060e0171a5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -627,7 +627,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
      * we don't bother identifying the case precisely.
      */
     if (wstate->btws_use_wal ||
-        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
+        (RelationNeedsWAL(wstate->index) &&
+         (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1071,11 +1072,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
-    /*
-     * If no tuple was inserted, it's possible that we are truncating a
-     * relation. We need to emit WAL for the metapage in the case. However it
-     * is not required elsewise,
-     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d1210de8f4..3ce69b7a40 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4037,6 +4037,8 @@ ReleaseSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessPendingSyncRemoval(s->subTransactionId, true);
+
     /*
      * Mark "commit pending" all subtransactions up to the target
      * subtransaction.  The actual commits will happen when control gets to
@@ -4146,6 +4148,8 @@ RollbackToSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessPendingSyncRemoval(s->subTransactionId, false);
+
     /*
      * Mark "abort pending" all subtransactions up to the target
      * subtransaction.  The actual aborts will happen when control gets to
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 26dc3ddb1b..ad4a1e5127 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -99,6 +99,7 @@ typedef struct PendingRelSync
     BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
                                  * sync_above */
     BlockNumber truncated_to;    /* truncation WAL record was written */
+    SubTransactionId removed_xid; /* subxid where this is removed */
 }    PendingRelSync;
 
 /* Relations that need to be fsync'd at commit */
@@ -405,6 +406,7 @@ getPendingSyncEntry(Relation rel, bool create)
     {
         pendsync_entry->truncated_to = InvalidBlockNumber;
         pendsync_entry->sync_above = InvalidBlockNumber;
+        pendsync_entry->removed_xid = InvalidSubTransactionId;
     }
 
     /* hold shortcut in Relation */
@@ -498,14 +500,17 @@ void
 RelationRemovePendingSync(Relation rel)
 {
     bool found;
+    PendingRelSync *pending_sync;
 
-    rel->pending_sync = NULL;
-    rel->no_pending_sync = true;
-    if (pendingSyncs)
-    {
-        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
-        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
-    }
+    if (rel->no_pending_sync)
+        return;
+
+    pending_sync = getPendingSyncEntry(rel, false);
+    
+    if (pending_sync)
+        return;
+
+    rel->pending_sync->removed_xid = GetCurrentSubTransactionId();
 }
 
 
@@ -693,6 +698,31 @@ smgrDoPendingSyncs(bool isCommit)
     pendingSyncs = NULL;
 }
 
+void
+smgrProcessPendingSyncRemoval(SubTransactionId sxid, bool isCommit)
+{
+    HASH_SEQ_STATUS status;
+    PendingRelSync *pending;
+
+    if (!pendingSyncs)
+        return;
+
+    hash_seq_init(&status, pendingSyncs);
+
+    while ((pending = hash_seq_search(&status)) != NULL)
+    {
+        if (pending->removed_xid == sxid)
+        {
+            pending->removed_xid = InvalidSubTransactionId;
+            if (isCommit)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
+        }        
+    }    
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

21 марта 2019 г., 08:48:35

On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190311022708.GA2189728@rfd.leadboat.com>
> > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > +/*
> > > + * Sync to disk any relations that we skipped WAL-logging for earlier.
> > > + */
> > > +void
> > > +smgrDoPendingSyncs(bool isCommit)
> > > +{
> > > +    if (!pendingSyncs)
> > > +        return;
> > > +
> > > +    if (isCommit)
> > > +    {
> > > +        HASH_SEQ_STATUS status;
> > > +        PendingRelSync *pending;
> > > +
> > > +        hash_seq_init(&status, pendingSyncs);
> > > +
> > > +        while ((pending = hash_seq_search(&status)) != NULL)
> > > +        {
> > > +            if (pending->sync_above != InvalidBlockNumber)
> > 
> > I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> > InvalidBlockNumber" are not sync requests at all.  Those just record the fact
> > of a RelationTruncate() happening.  If you can think of a way to improve that,
> > please do so.  If not, it's okay.
> 
> After a truncation, required WAL records are emitted for the
> truncated pages, so no need to sync. Does this make sense for
> you? (Maybe commit is needed there)

Yes, the behavior makes sense.  I wasn't saying the quoted code had the wrong
behavior.  I was saying that the data structure called "pendingSyncs" is
actually "pending syncs and past truncates".  It's not ideal that the variable
name differs from the variable purpose in this way.  However, it's okay if you
don't find a way to improve that.

> I don't have enough time for now so the new version will be
> posted early next week.

I'll wait for that version.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

25 марта 2019 г., 15:32:04

Hello. This is a revised version.

At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190321054835.GB3842129@rfd.leadboat.com>
> On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> > At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190311022708.GA2189728@rfd.leadboat.com>
> > > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> > > InvalidBlockNumber" are not sync requests at all.  Those just record the fact
> > > of a RelationTruncate() happening.  If you can think of a way to improve that,
> > > please do so.  If not, it's okay.
> > 
> > After a truncation, required WAL records are emitted for the
> > truncated pages, so no need to sync. Does this make sense for
> > you? (Maybe commit is needed there)
> 
> Yes, the behavior makes sense.  I wasn't saying the quoted code had the wrong
> behavior.  I was saying that the data structure called "pendingSyncs" is
> actually "pending syncs and past truncates".  It's not ideal that the variable
> name differs from the variable purpose in this way.  However, it's okay if you
> don't find a way to improve that.

It is convincing. The current member names "sync_above" and
"truncated_to" are wordings based on the operations that have
happened on the relation. I changed the names to words based on
what to do on the relation. Renamed to skip_wal_min_blk and
wal_log_min_blk.

> > I don't have enough time for now so the new version will be
> > posted early next week.
> 
> I'll wait for that version.

At Wed, 20 Mar 2019 17:17:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190320.171754.171896368.horiguchi.kyotaro@lab.ntt.co.jp>
> > Hence, this change causes us to emit WAL for the metapage of a
> > RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation.  We should never do
> > that.  If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
> > relfilenode.  I've attached a test case for this; it is a patch that applies
> > on top of your v7 patches.  The test checks for orphaned files after redo.
> 
> Oops!  Added RelationNeedsWAL(index) there. (Attched 1st patch on
> top of this patchset)

Done in the attached patch. But the orphan file check in the TAP
diff was wrong. It detects orphaned pg_class entry for temprary
tables, which dissapears after the first autovacuum. The revised
tap test (check_orphan_relfilenodes) doesn't faultly fail and
catches the bug in the previous patch.

> > > +     * If no tuple was inserted, it's possible that we are truncating a
> > > +     * relation. We need to emit WAL for the metapage in the case. However it
> > > +     * is not required elsewise,
> > 
> > Did you mean to write more words after that comma?
> 
> Sorry, it is just a garbage. Required work is done in
> _bt_blwritepage.

Removed.

> > We'd need a mechanism to un-remove the sync at subtransaction abort.  My
> > attachment includes a test case demonstrating the consequences of that defect.
> > Please look for other areas that need to know about subtransactions; patch v7
> > had no code pertaining to subtransactions.

Added. Passed the new tests.

> > > +        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
> > 
> > As you mention upthread, you have many debugging elog()s.  These are too
> > detailed to include in every binary, but I do want them in the code.  See
> > CACHE_elog() for a good example of achieving that.
> 
> Agreed will do. They were need to check the behavior precisely
> but usually not needed.

I removed all such elog()s.

> > RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
> > prefix to these fields.
> > This is a nonstandard place to clear fields.  Clear them in
> > load_relcache_init_file() only, like we do for rd_statvalid.  (Other paths
> > will then rely on palloc0() for implicit initialization.)

Both are done.

> > > -    if (RelationNeedsWAL(relation))
> > > +    if (BufferNeedsWAL(relation, buffer) ||
> > > +        BufferNeedsWAL(relation, newbuf))
> > 
> > This is fine if both buffers need WAL or neither buffer needs WAL.  It is not
> > fine when one buffer needs WAL and the other buffer does not.  My attachment
> > includes a test case.  Of the bugs I'm reporting, this one seems most
> > difficult to solve well.

I refactored heap_insert/delete so that the XLOG stuff can be
used from heap_update. Then modify heap_update so that it emits
XLOG_INSERT and XLOG_DELETE in addition to XLOG_UPDATE.

> > We still use heap_sync() in CLUSTER.  Can we migrate CLUSTER to the newer
> > heap_register_sync()?  Patch v7 makes some commands use the new way (COPY,
> > CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> > commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> > TABLESPACE, CLUSTER).  It would make the system simpler to understand if we
> > eliminated the old way.  If that creates more problems than it solves, please
> > at least write down a coding rule to explain why certain commands shouldn't
> > use the old way.
> 
> Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
> CREATE INDEX. I'll consider them.

I added the CLUSTER case in the new patchset.  For the SET
TABLESPACE case, it works on SMGR layer and manipulates fork
files explicitly but this stuff is Relation based and doesn't
distinguish forks. We can modify this stuff to work on smgr and
make it fork-aware but I don't think it is worth doing.

CREATE INDEX is not changed in this version. I continue to
consider it.

The attached is the new patchset.

v8-0001-TAP-test-for-copy-truncation-optimization.patch
  - Revised version of test.

v8-0002-Write-WAL-for-empty-nbtree-index-build.patch
  - Fixed version of v7

v8-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch
  - New file, moves xlog stuff of heap_insert and heap_delete out
    of the functions so that heap_update can use them.

v8-0004-Add-infrastructure-to-WAL-logging-skip-feature.patch
  - Renamed variables, functions. Removed elogs.

v8-0005-Fix-WAL-skipping-feature.patch
  - Fixed heap_update.

v8-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patch
  - New file, modifies CLUSTER to use this feature.

v8-0007-Add-a-comment-to-ATExecSetTableSpace.patch
  - New file, adds a comment that excuses for not using this stuff.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

From 885e9ac73434aa8d5fe80393dc64746c36148acd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/7] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/017_wal_optimize.pl | 254 ++++++++++++++++++++++++++++++++
 1 file changed, 254 insertions(+)
 create mode 100644 src/test/recovery/t/017_wal_optimize.pl

diff --git a/src/test/recovery/t/017_wal_optimize.pl b/src/test/recovery/t/017_wal_optimize.pl
new file mode 100644
index 0000000000..5d67548b54
--- /dev/null
+++ b/src/test/recovery/t/017_wal_optimize.pl
@@ -0,0 +1,254 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+                $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From a28a1e9a87d4cc2135fbaf079a16e7487de8d357 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/7] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2762a2d548..70fe3bec32 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -622,8 +622,15 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (RelationNeedsWAL(wstate->index) &&
+         (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-- 
2.16.3

From a070ada24a7f448a449435e38a57209725a8c914 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/7] Move XLOG stuff from heap_insert and heap_delete

Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
 src/backend/access/heap/heapam.c | 277 ++++++++++++++++++++++-----------------
 1 file changed, 157 insertions(+), 120 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 65536c7214..fe5d939c45 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,11 @@
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
                     TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
                 Buffer newbuf, HeapTuple oldtup,
                 HeapTuple newtup, HeapTuple old_key_tup,
@@ -1889,6 +1894,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     TransactionId xid = GetCurrentTransactionId();
     HeapTuple    heaptup;
     Buffer        buffer;
+    Page        page;
     Buffer        vmbuffer = InvalidBuffer;
     bool        all_visible_cleared = false;
 
@@ -1925,16 +1931,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
      */
     CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
+    page = BufferGetPage(buffer);
+
     /* NO EREPORT(ERROR) from here till changes are logged */
     START_CRIT_SECTION();
 
     RelationPutHeapTuple(relation, buffer, heaptup,
                          (options & HEAP_INSERT_SPECULATIVE) != 0);
 
-    if (PageIsAllVisible(BufferGetPage(buffer)))
+    if (PageIsAllVisible(page))
     {
         all_visible_cleared = true;
-        PageClearAllVisible(BufferGetPage(buffer));
+        PageClearAllVisible(page);
         visibilitymap_clear(relation,
                             ItemPointerGetBlockNumber(&(heaptup->t_self)),
                             vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1956,76 +1964,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     /* XLOG stuff */
     if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
     {
-        xl_heap_insert xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
-        Page        page = BufferGetPage(buffer);
-        uint8        info = XLOG_HEAP_INSERT;
-        int            bufflags = 0;
-
-        /*
-         * If this is a catalog, we need to transmit combocids to properly
-         * decode, so log that as well.
-         */
-        if (RelationIsAccessibleInLogicalDecoding(relation))
-            log_heap_new_cid(relation, heaptup);
-
-        /*
-         * If this is the single and first tuple on page, we can reinit the
-         * page instead of restoring the whole thing.  Set flag, and hide
-         * buffer references from XLogInsert.
-         */
-        if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
-            PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
-        {
-            info |= XLOG_HEAP_INIT_PAGE;
-            bufflags |= REGBUF_WILL_INIT;
-        }
-
-        xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
-        if (options & HEAP_INSERT_SPECULATIVE)
-            xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
-        Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
-        /*
-         * For logical decoding, we need the tuple even if we're doing a full
-         * page write, so make sure it's included even if we take a full-page
-         * image. (XXX We could alternatively store a pointer into the FPW).
-         */
-        if (RelationIsLogicallyLogged(relation) &&
-            !(options & HEAP_INSERT_NO_LOGICAL))
-        {
-            xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
-            bufflags |= REGBUF_KEEP_DATA;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
-        xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
-        xlhdr.t_infomask = heaptup->t_data->t_infomask;
-        xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
-        /*
-         * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
-         * write the whole page to the xlog, we don't need to store
-         * xl_heap_header in the xlog.
-         */
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
-        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-        /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-        XLogRegisterBufData(0,
-                            (char *) heaptup->t_data + SizeofHeapTupleHeader,
-                            heaptup->t_len - SizeofHeapTupleHeader);
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, info);
 
+        recptr = log_heap_insert(relation, buffer, heaptup,
+                                 options, all_visible_cleared);
+            
         PageSetLSN(page, recptr);
     }
 
@@ -2744,58 +2687,10 @@ l1:
      */
     if (RelationNeedsWAL(relation))
     {
-        xl_heap_delete xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
 
-        /* For logical decode we need combocids to properly decode the catalog */
-        if (RelationIsAccessibleInLogicalDecoding(relation))
-            log_heap_new_cid(relation, &tp);
-
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
-        if (changingPart)
-            xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
-        xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
-                                              tp.t_data->t_infomask2);
-        xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
-        xlrec.xmax = new_xmax;
-
-        if (old_key_tuple != NULL)
-        {
-            if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
-            else
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-        /*
-         * Log replica identity of the deleted tuple if there is one
-         */
-        if (old_key_tuple != NULL)
-        {
-            xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
-            xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
-            xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
-            XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
-            XLogRegisterData((char *) old_key_tuple->t_data
-                             + SizeofHeapTupleHeader,
-                             old_key_tuple->t_len
-                             - SizeofHeapTupleHeader);
-        }
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
-
+        recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+                                 changingPart, all_visible_cleared);
         PageSetLSN(page, recptr);
     }
 
@@ -7045,6 +6940,148 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
     return recptr;
 }
 
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+    xl_heap_insert xlrec;
+    xl_heap_header xlhdr;
+    uint8        info = XLOG_HEAP_INSERT;
+    int            bufflags = 0;
+    Page        page = BufferGetPage(buffer);
+
+    /*
+     * If this is a catalog, we need to transmit combocids to properly
+     * decode, so log that as well.
+     */
+    if (RelationIsAccessibleInLogicalDecoding(relation))
+        log_heap_new_cid(relation, heaptup);
+
+    /*
+     * If this is the single and first tuple on page, we can reinit the
+     * page instead of restoring the whole thing.  Set flag, and hide
+     * buffer references from XLogInsert.
+     */
+    if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+        PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+    {
+        info |= XLOG_HEAP_INIT_PAGE;
+        bufflags |= REGBUF_WILL_INIT;
+    }
+
+    xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+    if (options & HEAP_INSERT_SPECULATIVE)
+        xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+    Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+    /*
+     * For logical decoding, we need the tuple even if we're doing a full
+     * page write, so make sure it's included even if we take a full-page
+     * image. (XXX We could alternatively store a pointer into the FPW).
+     */
+    if (RelationIsLogicallyLogged(relation) &&
+        !(options & HEAP_INSERT_NO_LOGICAL))
+    {
+        xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+        bufflags |= REGBUF_KEEP_DATA;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+    xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+    xlhdr.t_infomask = heaptup->t_data->t_infomask;
+    xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+    /*
+     * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+     * write the whole page to the xlog, we don't need to store
+     * xl_heap_header in the xlog.
+     */
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+    /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+    XLogRegisterBufData(0,
+                        (char *) heaptup->t_data + SizeofHeapTupleHeader,
+                        heaptup->t_len - SizeofHeapTupleHeader);
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared)
+{
+    xl_heap_delete xlrec;
+    xl_heap_header xlhdr;
+
+    /* For logical decode we need combocids to properly decode the catalog */
+    if (RelationIsAccessibleInLogicalDecoding(relation))
+        log_heap_new_cid(relation, tp);
+
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+    if (changingPart)
+        xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+    xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+                                          tp->t_data->t_infomask2);
+    xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+    xlrec.xmax = new_xmax;
+
+    if (old_key_tuple != NULL)
+    {
+        if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+        else
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+    /*
+     * Log replica identity of the deleted tuple if there is one
+     */
+    if (old_key_tuple != NULL)
+    {
+        xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+        xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+        xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+        XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+        XLogRegisterData((char *) old_key_tuple->t_data
+                         + SizeofHeapTupleHeader,
+                         old_key_tuple->t_len
+                         - SizeofHeapTupleHeader);
+    }
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
 /*
  * Perform XLogInsert for a heap-update operation.  Caller must already
  * have modified the buffer(s) and marked them dirty.
-- 
2.16.3

From 2778e0aa67ccaf58f03da59e9c31706907c2b7e6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:00:44 +0900
Subject: [PATCH 4/7] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction truncations.
heap_register_sync() should be used to start tracking before batch
operations like COPY and CLUSTER, and use BufferNeedsWAL() instead of
RelationNeedsWAL() at the places related to WAL-logging about
heap-modifying operations.
---
 src/backend/access/heap/heapam.c    |  31 ++++
 src/backend/access/transam/xact.c   |  11 ++
 src/backend/catalog/storage.c       | 344 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   3 +-
 src/backend/storage/buffer/bufmgr.c |  40 ++++-
 src/backend/utils/cache/relcache.c  |   3 +
 src/include/access/heapam.h         |   1 +
 src/include/catalog/storage.h       |   5 +
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   8 +
 10 files changed, 417 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fe5d939c45..024620ddc1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -8829,3 +8830,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordWALSkipping(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordWALSkipping(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c3214d4f4d..32a6a877f3 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2022,6 +2022,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2254,6 +2257,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2579,6 +2585,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandone pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessWALRequirementInval(s->subTransactionId, true);
+
     /*
      * Mark "commit pending" all subtransactions up to the target
      * subtransaction.  The actual commits will happen when control gets to
@@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessWALRequirementInval(s->subTransactionId, false);
+
     /*
      * Mark "abort pending" all subtransactions up to the target
      * subtransaction.  The actual aborts will happen when control gets to
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..a0cf8d3e27 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +63,54 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a PendingRelSync entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any operations
+ * on blocks < skip_wal_min_blk need to be WAL-logged as usual, but for
+ * operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalRequirement
+{
+    RelFileNode relnode;        /* relation created in same xact */
+    BlockNumber skip_wal_min_blk;/* WAL-logging skipped for blocks >=
+                                  * skip_wal_min_blk */
+    BlockNumber wal_log_min_blk; /* The minimum blk number that requires
+                                  * WAL-logging even if skipped by the above*/
+    SubTransactionId invalidate_sxid; /* subxid where this entry is
+                                       * invalidated */
+}    RelWalRequirement;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *relWalRequirements = NULL;
+static int     walreq_pending_invals = 0;
+
+static RelWalRequirement *getWalRequirementEntry(Relation rel, bool create);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -259,37 +308,114 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        RelWalRequirement *walreq;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+        /* get pending sync entry, create if not yet */
+        walreq = getWalRequirementEntry(rel, true);
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+            walreq->skip_wal_min_blk < nblocks)
+        {
+            /*
+             * This is the first time truncation of this relation in this
+             * transaction or truncation that leaves pages that need at-commit
+             * fsync.  Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            /* no longer skip WAL-logging for the blocks */
+            rel->rd_walrequirement->wal_log_min_blk = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/*
+ * getWalRequirementEntry: get WAL requirement entry.
+ *
+ * Returns WAL requirement entry for the relation. The entry tracks
+ * WAL-skipping blocks for the relation.  The WAL-skipped blocks need fsync at
+ * commit time.  Creates one if needed when create is true.
+ */
+static RelWalRequirement *
+getWalRequirementEntry(Relation rel, bool create)
+{
+    RelWalRequirement *walreq_entry = NULL;
+    bool            found;
+
+    if (rel->rd_walrequirement)
+        return rel->rd_walrequirement;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->rd_nowalrequirement)
+        return NULL;
+
+    if (!relWalRequirements)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(RelWalRequirement);
+        ctl.hash = tag_hash;
+        relWalRequirements = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    walreq_entry = (RelWalRequirement *)
+        hash_search(relWalRequirements, (void *) &rel->rd_node,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!walreq_entry)
+    {
+        /* prevent further hash lookup */
+        rel->rd_nowalrequirement = true;
+        return NULL;
+    }
+
+    /* new entry created */
+    if (!found)
+    {
+        walreq_entry->wal_log_min_blk = InvalidBlockNumber;
+        walreq_entry->skip_wal_min_blk = InvalidBlockNumber;
+        walreq_entry->invalidate_sxid = InvalidSubTransactionId;
+    }
+
+    /* hold shortcut in Relation */
+    rel->rd_nowalrequirement = false;
+    rel->rd_walrequirement = walreq_entry;
+
+    return walreq_entry;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -367,6 +493,34 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ * RelationInvalidateWALRequirements() -- invalidate wal requirement entry
+ */
+void
+RelationInvalidateWALRequirements(Relation rel)
+{
+    RelWalRequirement *walreq;
+
+    /* we know we don't have one */
+    if (rel->rd_nowalrequirement)
+        return;
+
+    walreq = getWalRequirementEntry(rel, false);
+    
+    if (!walreq)
+        return;
+
+    /*
+     * The state is reset at subtransaction commit/abort. No invalidation
+     * request must not come for the same relation in the same subtransaction.
+     */
+    Assert(walreq->invalidate_sxid == InvalidSubTransactionId);
+
+    walreq_pending_invals++;
+    walreq->invalidate_sxid = GetCurrentSubTransactionId();
+}
+
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
@@ -418,6 +572,154 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and the blocks are going to be sync'd at
+ * commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+    BlockNumber nblocks;
+    RelWalRequirement *walreq;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    walreq = getWalRequirementEntry(rel, true);
+
+    nblocks = RelationGetNumberOfBlocks(rel);
+
+    /*
+     *  Record only the first registration.
+     */
+    if (walreq->skip_wal_min_blk != InvalidBlockNumber)
+        return;
+
+    walreq->skip_wal_min_blk = nblocks;
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    RelWalRequirement *walreq;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    walreq = getWalRequirementEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have special
+     * WAL requirement
+     */
+    if (!walreq)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+        walreq->skip_wal_min_blk > blkno)
+        return true;
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+        walreq->wal_log_min_blk <= blkno)
+        return true;
+
+    return false;
+}
+
+/*
+ * Sync to disk any relations that we have skipped WAL-logging earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!relWalRequirements)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        RelWalRequirement *walreq;
+
+        hash_seq_init(&status, relWalRequirements);
+
+        while ((walreq = hash_seq_search(&status)) != NULL)
+        {
+            if (walreq->skip_wal_min_blk != InvalidBlockNumber)
+            {
+                FlushRelationBuffersWithoutRelCache(walreq->relnode, false);
+                smgrimmedsync(smgropen(walreq->relnode, InvalidBackendId),
+                              MAIN_FORKNUM);
+            }
+        }
+    }
+
+    hash_destroy(relWalRequirements);
+    relWalRequirements = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL requirements happened in the
+ * subtransaction
+ */
+void
+smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
+{
+    HASH_SEQ_STATUS status;
+    RelWalRequirement *walreq;
+
+    if (!relWalRequirements || walreq_pending_invals == 0)
+        return;
+
+    /*
+     * It may take some time when there're many relWalRequirements entries.  We
+     * expect that we don't have relWalRequirements in almost all cases.
+     */
+    hash_seq_init(&status, relWalRequirements);
+
+    while ((walreq = hash_seq_search(&status)) != NULL)
+    {
+        if (walreq->invalidate_sxid == sxid)
+        {
+            Assert(walreq_pending_invals > 0);
+            walreq->invalidate_sxid = InvalidSubTransactionId;
+            walreq_pending_invals--;
+            if (isCommit)
+            {
+                walreq->skip_wal_min_blk = InvalidBlockNumber;
+                walreq->wal_log_min_blk = InvalidBlockNumber;
+            }
+        }        
+    }    
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3183b2aaa1..45bb0b5614 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11587,11 +11587,12 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. Pending syncs for the old node is no longer needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationInvalidateWALRequirements(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..a9741f138c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,41 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3183,7 +3205,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3213,18 +3235,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 84609e0725..95e834d45e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -5625,6 +5626,8 @@ load_relcache_init_file(bool shared)
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+        rel->rd_nowalrequirement = false;
+        rel->rd_walrequirement = NULL;
 
         /*
          * Recompute lock and physical addressing info.  This is needed in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 945ca50616..509394bb35 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -174,6 +174,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 
 /* in heap/pruneheap.c */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..76178b87f2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -22,6 +22,7 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
+extern void RelationInvalidateWALRequirements(Relation rel);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
@@ -29,6 +30,10 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit);
+extern void RecordWALSkipping(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 54028515a7..30f0d5bd83 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * rd_nowalrequirement is true if this relation is known not to have
+     * special WAL requirements.  Otherwise we need to ask smgr for an entry
+     * if rd_walrequirement is NULL.
+     */
+    bool                        rd_nowalrequirement;
+    struct RelWalRequirement   *rd_walrequirement;
 } RelationData;
 
 
-- 
2.16.3

From b7fd4d56f808f98d39861b8d04d2be7839c28202 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 5/7] Fix WAL skipping feature.

This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c        | 104 ++++++++++++++++++++++++--------
 src/backend/access/heap/pruneheap.c     |   3 +-
 src/backend/access/heap/rewriteheap.c   |   3 -
 src/backend/access/heap/vacuumlazy.c    |   6 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/commands/copy.c             |  13 ++--
 src/backend/commands/createas.c         |   9 ++-
 src/backend/commands/matview.c          |   6 +-
 src/backend/commands/tablecmds.c        |   5 +-
 src/include/access/heapam.h             |   3 +-
 src/include/access/tableam.h            |  11 +---
 11 files changed, 104 insertions(+), 62 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 024620ddc1..96f2cde3ce 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,28 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -1963,7 +1985,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2073,7 +2095,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2081,7 +2102,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2123,6 +2143,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2134,6 +2155,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -2686,7 +2708,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2820,6 +2842,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
                 vmbuffer = InvalidBuffer,
                 vmbuffer_new = InvalidBuffer;
     bool        need_toast;
+    bool        oldbuf_needs_wal,
+                newbuf_needs_wal;
     Size        newtupsize,
                 pagefree;
     bool        have_tuple_lock = false;
@@ -3371,7 +3395,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -3585,8 +3609,20 @@ l2:
         MarkBufferDirty(newbuf);
     MarkBufferDirty(buffer);
 
-    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    /*
+     *  XLOG stuff
+     *
+     * Emit heap-update log. When wal_level = minimal, we may emit insert or
+     * delete record according to wal-optimization.
+     */
+    oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+    if (newbuf == buffer)
+        newbuf_needs_wal = oldbuf_needs_wal;
+    else
+        newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+    if (oldbuf_needs_wal || newbuf_needs_wal)
     {
         XLogRecPtr    recptr;
 
@@ -3596,15 +3632,26 @@ l2:
          */
         if (RelationIsAccessibleInLogicalDecoding(relation))
         {
-            log_heap_new_cid(relation, &oldtup);
-            log_heap_new_cid(relation, heaptup);
+            if (oldbuf_needs_wal)
+                log_heap_new_cid(relation, &oldtup);
+            if (newbuf_needs_wal)
+                log_heap_new_cid(relation, heaptup);
         }
 
-        recptr = log_heap_update(relation, buffer,
-                                 newbuf, &oldtup, heaptup,
-                                 old_key_tuple,
-                                 all_visible_cleared,
-                                 all_visible_cleared_new);
+        if (oldbuf_needs_wal && newbuf_needs_wal)
+            recptr = log_heap_update(relation, buffer, newbuf,
+                                     &oldtup, heaptup,
+                                     old_key_tuple,
+                                     all_visible_cleared,
+                                     all_visible_cleared_new);
+        else if (oldbuf_needs_wal)
+            recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+                                     xmax_old_tuple, false,
+                                     all_visible_cleared);
+        else
+            recptr = log_heap_insert(relation, buffer, newtup,
+                                     0, all_visible_cleared_new);
+
         if (newbuf != buffer)
         {
             PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4482,7 +4529,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5234,7 +5281,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -5394,7 +5441,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
     htup->t_ctid = *tid;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -5526,7 +5573,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -5635,7 +5682,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -6832,7 +6879,7 @@ log_heap_clean(Relation reln, Buffer buffer,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -6880,7 +6927,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     XLogRecPtr    recptr;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7107,7 +7154,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
     int            bufflags;
 
     /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8711,9 +8758,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5c554f9465..3f5df63df8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -929,7 +929,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1193,7 +1193,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 705df8900b..1074320a5a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2438,7 +2437,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3091,11 +3090,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 36e3d44aad..8cba15fd3c 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -557,8 +557,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,9 +605,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5a47be4b33..5f447c6d94 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,9 +509,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 45bb0b5614..242311b0d7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4666,8 +4666,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4958,8 +4959,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         table_close(newrel, NoLock);
     }
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 509394bb35..a9aec90e86 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c2baa9d7a8..268e672470 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -94,10 +94,9 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
-#define TABLE_INSERT_SKIP_FSM        0x0002
-#define TABLE_INSERT_FROZEN            0x0004
-#define TABLE_INSERT_NO_LOGICAL        0x0008
+#define TABLE_INSERT_SKIP_FSM        0x0001
+#define TABLE_INSERT_FROZEN            0x0002
+#define TABLE_INSERT_NO_LOGICAL        0x0004
 
 /* flag bits fortable_lock_tuple */
 /* Follow tuples whose update is in progress if lock modes don't conflict  */
@@ -634,10 +633,6 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot, Snapshot snap
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
-- 
2.16.3

From b703b1287f9cb6ab1c556909f90473fa3fe25877 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 18:29:37 +0900
Subject: [PATCH 6/7] Change cluster to use the new pending sync infrastructure

When wal_level is minimal, CLUSTER gets benefit by moving file sync
from command end to transaction end by the pending-sync infrastructure
that file sync is performed at commit time.
---
 src/backend/access/heap/rewriteheap.c | 25 +++++-------------------
 src/backend/catalog/storage.c         | 36 +++++++++++++++++++++++++++++++++++
 src/backend/commands/cluster.c        | 13 +++++--------
 src/include/access/rewriteheap.h      |  2 +-
 src/include/catalog/storage.h         |  3 ++-
 5 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 1ac77f7c14..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
 #include "access/xloginsert.h"
 
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 
 #include "lib/ilist.h"
 
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * min_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
                    (char *) state->rs_buffer, true);
     }
 
-    /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
-     * reason is the same as in tablecmds.c's copy_relation_data(): we're
-     * writing data that's not in shared buffers, and so a CHECKPOINT
-     * occurring during the rewriteheap operation won't have fsync'd data we
-     * wrote before the checkpoint.
-     */
-    if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     logical_end_heap_rewrite(state);
 
@@ -692,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a0cf8d3e27..cd623eb3bb 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -613,6 +613,42 @@ RecordWALSkipping(Relation rel)
  * must WAL-log any changes to the once-truncated blocks, because replaying
  * the truncation record will destroy them.
  */
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+    RelWalRequirement *walreq;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    walreq = getWalRequirementEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have special
+     * WAL requirement
+     */
+    if (!walreq)
+        return true;
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+        walreq->skip_wal_min_blk > blkno)
+        return true;
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+        walreq->wal_log_min_blk <= blkno)
+        return true;
+
+    return false;
+}
+
 bool
 BufferNeedsWAL(Relation rel, Buffer buf)
 {
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3e2a807640..e2c4897d07 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,7 +767,6 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     TransactionId OldestXmin;
     TransactionId FreezeXid;
@@ -826,13 +825,11 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
         LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * If wal_level is minimal, we skip WAL-logging even for WAL-requiring
+     * relations.  Otherwise follow whether it's a WAL-logged rel.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
-    Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
+    if (!XLogIsNeeded())
+        heap_register_sync(NewHeap);
 
     /*
      * If both tables have TOAST tables, perform toast swap by content.  It is
@@ -899,7 +896,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
-                                 MultiXactCutoff, use_wal);
+                                 MultiXactCutoff);
 
     /*
      * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                    TransactionId OldestXmin, TransactionId FreezeXid,
-                   MultiXactId MultiXactCutoff, bool use_wal);
+                   MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                    HeapTuple newTuple);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 76178b87f2..e8edbe5d71 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -33,7 +33,8 @@ extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void smgrDoPendingSyncs(bool isCommit);
 extern void smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit);
 extern void RecordWALSkipping(Relation rel);
-bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
-- 
2.16.3

From 945ab5fa80089d489c204a010f2f5d551e6bec79 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 20:39:21 +0900
Subject: [PATCH 7/7] Add a comment to ATExecSetTableSpace.

We use heap_register_sync() stuff to control WAL-logging and file sync
on bulk insertion, but we cannot use it because the function lacks the
ability to handle forks explicitly. Add a comment to explain that.
---
 src/backend/commands/tablecmds.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 242311b0d7..c7c7bcb308 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11594,7 +11594,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
     RelationInvalidateWALRequirements(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
-    /* copy main fork */
+    /*
+     * copy main fork
+     *
+     * You might think that we could use heap_register_sync() to control file
+     * sync and WAL-logging, but we cannot because the sutff lacks the ability
+     * to handle each fork explicitly.
+     */
     copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM,
                        rel->rd_rel->relpersistence);
 
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

26 марта 2019 г., 10:35:07

Hello. I revised the patch I think addressing all your comments.

Differences from v7 patch are:

v9-0001:

 - Renamed the script from 016_ to 017_.

 - Added some additional tests.


v9-0002:
 - Fixed _bt_blwritepage().
   It is re-modified by v9-0007.


v9-0003: New patch.
 - Refactors out xlog sutff from heap_insert/delete.
   (log_heap_insert(), log_heap_udpate())


v9-0004: (v7-0003, v8-0004)
 - Renamed some struct names and member names.
   (PendingRelSync -> RelWalRequirement
     .sync_above -> skip_wal_min_blk, .truncated_to -> wal_log_min_blk)

 - Rename the addtional members in RelationData to rd_*.

 - Explicitly initialize the additional members only in
   load_relcache_init_file().

 - Added new interface functions that accept block number and
   SMgrRelation.
   (BlockNeedsWAL(), RecordPendingSync())

 - Support subtransaction, (or invalidation).
   (RelWalRequirement.create_sxid, invalidate_sxid,
    RelationInvalidateWALRequirements(), smgrDoPendingSyncs())

 - Support forks.
   (RelWalRequirement.forks, smgrDoPendingSyncs(), RecordPendingSync())

 - Removd elog(LOG)s and a leftover comment.

v9-0005: (v7-0004, v8-0005)

 - Fixed heap_update().
   (heap_update())

v9-0006: New patch.

 - Modifies CLUSTER to skip WAL logging.

v9-0007: New patch.

 - Modifies ALTER TABLE SET TABLESPACE to skip WAL logging.

v9-0008: New patch.

 - Modifies btbuild to skip WAL logging.

 - Modifies btinsertonpg to skip WAL logging after truncation.

 - Overrites on v9-0002's change.


ALL:

 - Rebased.

 - Fixed typos and mistakes in comments.
   

> At Wed, 20 Mar 2019 17:17:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190320.171754.171896368.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > We still use heap_sync() in CLUSTER.  Can we migrate CLUSTER to the newer
> > > heap_register_sync()?  Patch v7 makes some commands use the new way (COPY,
> > > CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> > > commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> > > TABLESPACE, CLUSTER).  It would make the system simpler to understand if we
> > > eliminated the old way.  If that creates more problems than it solves, please
> > > at least write down a coding rule to explain why certain commands shouldn't
> > > use the old way.
> > 
> > Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
> > CREATE INDEX. I'll consider them.
> 
> I added the CLUSTER case in the new patchset.  For the SET
> TABLESPACE case, it works on SMGR layer and manipulates fork
> files explicitly but this stuff is Relation based and doesn't
> distinguish forks. We can modify this stuff to work on smgr and
> make it fork-aware but I don't think it is worth doing.
> 
> CREATE INDEX is not changed in this version. I continue to
> consider it.

I managed to simplify the change. Please look at v9-0008.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 13fe16c4527273426d93429986700ac66810945d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/8] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/017_wal_optimize.pl | 254 ++++++++++++++++++++++++++++++++
 1 file changed, 254 insertions(+)
 create mode 100644 src/test/recovery/t/017_wal_optimize.pl

diff --git a/src/test/recovery/t/017_wal_optimize.pl b/src/test/recovery/t/017_wal_optimize.pl
new file mode 100644
index 0000000000..5d67548b54
--- /dev/null
+++ b/src/test/recovery/t/017_wal_optimize.pl
@@ -0,0 +1,254 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+                $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 01691f5cf36e3bc75952b630088788c0da36b594 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/8] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 46e0831834..e65d4aab0f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -622,8 +622,15 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even if minimal mode, WAL is required here if truncation happened after
+     * being created in the same transaction. It is not needed otherwise but
+     * we don't bother identifying the case precisely.
+     */
+    if (wstate->btws_use_wal ||
+        (RelationNeedsWAL(wstate->index) &&
+         (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-- 
2.16.3

From 09ecd87dee4187d1266799c8cc68e2ea9f700c9b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/8] Move XLOG stuff from heap_insert and heap_delete

Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
 src/backend/access/heap/heapam.c | 277 ++++++++++++++++++++++-----------------
 1 file changed, 157 insertions(+), 120 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 137cc9257d..c6e71dba6b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -71,6 +71,11 @@
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
                     TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
                 Buffer newbuf, HeapTuple oldtup,
                 HeapTuple newtup, HeapTuple old_key_tup,
@@ -1860,6 +1865,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     TransactionId xid = GetCurrentTransactionId();
     HeapTuple    heaptup;
     Buffer        buffer;
+    Page        page;
     Buffer        vmbuffer = InvalidBuffer;
     bool        all_visible_cleared = false;
 
@@ -1896,16 +1902,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
      */
     CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
+    page = BufferGetPage(buffer);
+
     /* NO EREPORT(ERROR) from here till changes are logged */
     START_CRIT_SECTION();
 
     RelationPutHeapTuple(relation, buffer, heaptup,
                          (options & HEAP_INSERT_SPECULATIVE) != 0);
 
-    if (PageIsAllVisible(BufferGetPage(buffer)))
+    if (PageIsAllVisible(page))
     {
         all_visible_cleared = true;
-        PageClearAllVisible(BufferGetPage(buffer));
+        PageClearAllVisible(page);
         visibilitymap_clear(relation,
                             ItemPointerGetBlockNumber(&(heaptup->t_self)),
                             vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1927,76 +1935,11 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     /* XLOG stuff */
     if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
     {
-        xl_heap_insert xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
-        Page        page = BufferGetPage(buffer);
-        uint8        info = XLOG_HEAP_INSERT;
-        int            bufflags = 0;
-
-        /*
-         * If this is a catalog, we need to transmit combocids to properly
-         * decode, so log that as well.
-         */
-        if (RelationIsAccessibleInLogicalDecoding(relation))
-            log_heap_new_cid(relation, heaptup);
-
-        /*
-         * If this is the single and first tuple on page, we can reinit the
-         * page instead of restoring the whole thing.  Set flag, and hide
-         * buffer references from XLogInsert.
-         */
-        if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
-            PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
-        {
-            info |= XLOG_HEAP_INIT_PAGE;
-            bufflags |= REGBUF_WILL_INIT;
-        }
-
-        xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
-        if (options & HEAP_INSERT_SPECULATIVE)
-            xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
-        Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
-        /*
-         * For logical decoding, we need the tuple even if we're doing a full
-         * page write, so make sure it's included even if we take a full-page
-         * image. (XXX We could alternatively store a pointer into the FPW).
-         */
-        if (RelationIsLogicallyLogged(relation) &&
-            !(options & HEAP_INSERT_NO_LOGICAL))
-        {
-            xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
-            bufflags |= REGBUF_KEEP_DATA;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
-        xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
-        xlhdr.t_infomask = heaptup->t_data->t_infomask;
-        xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
-        /*
-         * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
-         * write the whole page to the xlog, we don't need to store
-         * xl_heap_header in the xlog.
-         */
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
-        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-        /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-        XLogRegisterBufData(0,
-                            (char *) heaptup->t_data + SizeofHeapTupleHeader,
-                            heaptup->t_len - SizeofHeapTupleHeader);
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, info);
 
+        recptr = log_heap_insert(relation, buffer, heaptup,
+                                 options, all_visible_cleared);
+            
         PageSetLSN(page, recptr);
     }
 
@@ -2715,58 +2658,10 @@ l1:
      */
     if (RelationNeedsWAL(relation))
     {
-        xl_heap_delete xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
 
-        /* For logical decode we need combocids to properly decode the catalog */
-        if (RelationIsAccessibleInLogicalDecoding(relation))
-            log_heap_new_cid(relation, &tp);
-
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
-        if (changingPart)
-            xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
-        xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
-                                              tp.t_data->t_infomask2);
-        xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
-        xlrec.xmax = new_xmax;
-
-        if (old_key_tuple != NULL)
-        {
-            if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
-            else
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-        /*
-         * Log replica identity of the deleted tuple if there is one
-         */
-        if (old_key_tuple != NULL)
-        {
-            xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
-            xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
-            xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
-            XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
-            XLogRegisterData((char *) old_key_tuple->t_data
-                             + SizeofHeapTupleHeader,
-                             old_key_tuple->t_len
-                             - SizeofHeapTupleHeader);
-        }
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
-
+        recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+                                 changingPart, all_visible_cleared);
         PageSetLSN(page, recptr);
     }
 
@@ -7016,6 +6911,148 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
     return recptr;
 }
 
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+    xl_heap_insert xlrec;
+    xl_heap_header xlhdr;
+    uint8        info = XLOG_HEAP_INSERT;
+    int            bufflags = 0;
+    Page        page = BufferGetPage(buffer);
+
+    /*
+     * If this is a catalog, we need to transmit combocids to properly
+     * decode, so log that as well.
+     */
+    if (RelationIsAccessibleInLogicalDecoding(relation))
+        log_heap_new_cid(relation, heaptup);
+
+    /*
+     * If this is the single and first tuple on page, we can reinit the
+     * page instead of restoring the whole thing.  Set flag, and hide
+     * buffer references from XLogInsert.
+     */
+    if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+        PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+    {
+        info |= XLOG_HEAP_INIT_PAGE;
+        bufflags |= REGBUF_WILL_INIT;
+    }
+
+    xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+    if (options & HEAP_INSERT_SPECULATIVE)
+        xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+    Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+    /*
+     * For logical decoding, we need the tuple even if we're doing a full
+     * page write, so make sure it's included even if we take a full-page
+     * image. (XXX We could alternatively store a pointer into the FPW).
+     */
+    if (RelationIsLogicallyLogged(relation) &&
+        !(options & HEAP_INSERT_NO_LOGICAL))
+    {
+        xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+        bufflags |= REGBUF_KEEP_DATA;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+    xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+    xlhdr.t_infomask = heaptup->t_data->t_infomask;
+    xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+    /*
+     * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+     * write the whole page to the xlog, we don't need to store
+     * xl_heap_header in the xlog.
+     */
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+    /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+    XLogRegisterBufData(0,
+                        (char *) heaptup->t_data + SizeofHeapTupleHeader,
+                        heaptup->t_len - SizeofHeapTupleHeader);
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared)
+{
+    xl_heap_delete xlrec;
+    xl_heap_header xlhdr;
+
+    /* For logical decode we need combocids to properly decode the catalog */
+    if (RelationIsAccessibleInLogicalDecoding(relation))
+        log_heap_new_cid(relation, tp);
+
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+    if (changingPart)
+        xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+    xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+                                          tp->t_data->t_infomask2);
+    xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+    xlrec.xmax = new_xmax;
+
+    if (old_key_tuple != NULL)
+    {
+        if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+        else
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+    /*
+     * Log replica identity of the deleted tuple if there is one
+     */
+    if (old_key_tuple != NULL)
+    {
+        xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+        xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+        xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+        XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+        XLogRegisterData((char *) old_key_tuple->t_data
+                         + SizeofHeapTupleHeader,
+                         old_key_tuple->t_len
+                         - SizeofHeapTupleHeader);
+    }
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
 /*
  * Perform XLogInsert for a heap-update operation.  Caller must already
  * have modified the buffer(s) and marked them dirty.
-- 
2.16.3

From ed3a737a571b268503804372ebac3a31247493be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Mar 2019 15:34:48 +0900
Subject: [PATCH 4/8] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction
truncations. heap_register_sync() should be used to start tracking
before batch operations like COPY and CLUSTER, and use
BufferNeedsWAL() instead of RelationNeedsWAL() at the places related
to WAL-logging about heap-modifying operations.
---
 src/backend/access/heap/heapam.c    |  31 +++
 src/backend/access/transam/xact.c   |  11 +
 src/backend/catalog/storage.c       | 418 ++++++++++++++++++++++++++++++++++--
 src/backend/commands/tablecmds.c    |   4 +-
 src/backend/storage/buffer/bufmgr.c |  39 +++-
 src/backend/utils/cache/relcache.c  |   3 +
 src/include/access/heapam.h         |   1 +
 src/include/catalog/storage.h       |   8 +
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   8 +
 10 files changed, 493 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c6e71dba6b..5a8627507f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -8800,3 +8801,33 @@ heap_mask(char *pagedata, BlockNumber blkno)
         }
     }
 }
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordWALSkipping(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordWALSkipping(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c3214d4f4d..ad7cb3bcb9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2022,6 +2022,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2254,6 +2257,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrDoPendingSyncs(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2579,6 +2585,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrDoPendingSyncs(false);    /* abandon pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessWALRequirementInval(s->subTransactionId, true);
+
     /*
      * Mark "commit pending" all subtransactions up to the target
      * subtransaction.  The actual commits will happen when control gets to
@@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessWALRequirementInval(s->subTransactionId, false);
+
     /*
      * Mark "abort pending" all subtransactions up to the target
      * subtransaction.  The actual aborts will happen when control gets to
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0302507e6f..be37174ef2 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -27,7 +27,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
-#include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +62,58 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a RelWalRequirement entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any operations
+ * on blocks < skip_wal_min_blk need to be WAL-logged as usual, but for
+ * operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalRequirement
+{
+    RelFileNode relnode;            /* relation created in same xact */
+    bool        forks[MAX_FORKNUM + 1];    /* target forknums */
+    BlockNumber skip_wal_min_blk;    /* WAL-logging skipped for blocks >=
+                                     * skip_wal_min_blk */
+    BlockNumber wal_log_min_blk;     /* The minimum blk number that requires
+                                     * WAL-logging even if skipped by the
+                                     * above*/
+    SubTransactionId create_sxid;    /* subxid where this entry is created */
+    SubTransactionId invalidate_sxid; /* subxid where this entry is
+                                       * invalidated */
+}    RelWalRequirement;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *walRequirements = NULL;
+
+static RelWalRequirement *getWalRequirementEntry(Relation rel, bool create);
+static RelWalRequirement *getWalRequirementEntryRNode(RelFileNode *node,
+                                                      bool create);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -259,37 +311,290 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        RelWalRequirement *walreq;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
+        /* get pending sync entry, create if not yet */
+        walreq = getWalRequirementEntry(rel, true);
 
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+        if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+            walreq->skip_wal_min_blk < nblocks)
+        {
+            /*
+             * This is the first time truncation of this relation in this
+             * transaction or truncation that leaves pages that need at-commit
+             * fsync.  Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
 
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
 
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            /* no longer skip WAL-logging for the blocks */
+            rel->rd_walrequirement->wal_log_min_blk = nblocks;
+        }
     }
 
     /* Do the real work */
     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 }
 
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    RelWalRequirement *walreq;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch existing pending sync entry */
+    walreq = getWalRequirementEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have special
+     * WAL requirement
+     */
+    if (!walreq)
+        return true;
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+        walreq->skip_wal_min_blk > blkno)
+        return true;
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+        walreq->wal_log_min_blk <= blkno)
+        return true;
+
+    return false;
+}
+
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+    RelWalRequirement *walreq;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    walreq = getWalRequirementEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't have special
+     * WAL requirement
+     */
+    if (!walreq)
+        return true;
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walreq->skip_wal_min_blk == InvalidBlockNumber ||
+        walreq->skip_wal_min_blk > blkno)
+        return true;
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walreq->wal_log_min_blk != InvalidBlockNumber &&
+        walreq->wal_log_min_blk <= blkno)
+        return true;
+
+    return false;
+}
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and fo the blocks that are going to be synced
+ * at commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+    RelWalRequirement *walreq;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    walreq = getWalRequirementEntry(rel, true);
+
+    /*
+     *  Record only the first registration.
+     */
+    if (walreq->skip_wal_min_blk != InvalidBlockNumber)
+        return;
+
+    walreq->skip_wal_min_blk = RelationGetNumberOfBlocks(rel);
+}
+
+/*
+ * Record commit-time file sync. This shouldn't be used mixing with
+ * RecordWALSkipping.
+ */
+void
+RecordPendingSync(SMgrRelation rel, ForkNumber forknum)
+{
+    RelWalRequirement *walreq;
+
+    walreq = getWalRequirementEntryRNode(&rel->smgr_rnode.node, true);
+    walreq->forks[forknum] = true;
+    walreq->skip_wal_min_blk = 0;
+}
+
+/*
+ * RelationInvalidateWALRequirements() -- invalidate wal requirement entry
+ */
+void
+RelationInvalidateWALRequirements(Relation rel)
+{
+    RelWalRequirement *walreq;
+
+    /* we know we don't have one */
+    if (rel->rd_nowalrequirement)
+        return;
+
+    walreq = getWalRequirementEntry(rel, false);
+    
+    if (!walreq)
+        return;
+
+    /*
+     * The state is reset at subtransaction commit/abort. No invalidation
+     * request must not come for the same relation in the same subtransaction.
+     */
+    Assert(walreq->invalidate_sxid == InvalidSubTransactionId);
+
+    walreq->invalidate_sxid = GetCurrentSubTransactionId();
+}
+
+/*
+ * getWalRequirementEntry: get WAL requirement entry.
+ *
+ * Returns WAL requirement entry for the relation. The entry tracks
+ * WAL-skipping blocks for the relation.  The WAL-skipped blocks need fsync at
+ * commit time.  Creates one if needed when create is true.
+ */
+static RelWalRequirement *
+getWalRequirementEntry(Relation rel, bool create)
+{
+    RelWalRequirement *walreq_entry = NULL;
+
+    if (rel->rd_walrequirement)
+        return rel->rd_walrequirement;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->rd_nowalrequirement)
+        return NULL;
+
+    walreq_entry = getWalRequirementEntryRNode(&rel->rd_node, create);
+
+    if (!walreq_entry)
+    {
+        /* prevent further hash lookup */
+        rel->rd_nowalrequirement = true;
+        return NULL;
+    }
+
+    walreq_entry->forks[MAIN_FORKNUM] = true;
+
+    /* hold shortcut in Relation */
+    rel->rd_nowalrequirement = false;
+    rel->rd_walrequirement = walreq_entry;
+
+    return walreq_entry;
+}
+
+/*
+ * getWalRequirementEntryRNode: get WAL requirement entry by rnode
+ *
+ * Returns WAL requirement entry for the RelFileNode.
+ */
+static RelWalRequirement *
+getWalRequirementEntryRNode(RelFileNode *rnode, bool create)
+{
+    RelWalRequirement *walreq_entry = NULL;
+    bool            found;
+
+    if (!walRequirements)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(RelWalRequirement);
+        ctl.hash = tag_hash;
+        walRequirements = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    walreq_entry = (RelWalRequirement *)
+        hash_search(walRequirements, (void *) rnode,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!walreq_entry)
+        return NULL;
+
+    /* new entry created */
+    if (!found)
+    {
+        memset(&walreq_entry->forks, 0, sizeof(sizeof(walreq_entry->forks)));
+        walreq_entry->wal_log_min_blk = InvalidBlockNumber;
+        walreq_entry->skip_wal_min_blk = InvalidBlockNumber;
+        walreq_entry->create_sxid = GetCurrentSubTransactionId();
+        walreq_entry->invalidate_sxid = InvalidSubTransactionId;
+    }
+
+    return walreq_entry;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -418,6 +723,75 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+/*
+ * Sync to disk any relations that we have skipped WAL-logging earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    if (!walRequirements)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        RelWalRequirement *walreq;
+
+        hash_seq_init(&status, walRequirements);
+
+        while ((walreq = hash_seq_search(&status)) != NULL)
+        {
+            if (walreq->skip_wal_min_blk != InvalidBlockNumber &&
+                walreq->invalidate_sxid == InvalidSubTransactionId)
+            {
+                int f;
+
+                FlushRelationBuffersWithoutRelCache(walreq->relnode, false);
+
+                /* flush all requested forks  */
+                for (f = MAIN_FORKNUM ; f <= MAX_FORKNUM ; f++)
+                {
+                    if (walreq->forks[f])
+                        smgrimmedsync(smgropen(walreq->relnode,
+                                               InvalidBackendId), f);
+                }
+            }
+        }
+    }
+
+    hash_destroy(walRequirements);
+    walRequirements = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL requirements happened in the
+ * subtransaction
+ */
+void
+smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
+{
+    HASH_SEQ_STATUS status;
+    RelWalRequirement *walreq;
+
+    if (!walRequirements)
+        return;
+
+    /* We expect that we don't have walRequirements in almost all cases */
+    hash_seq_init(&status, walRequirements);
+
+    while ((walreq = hash_seq_search(&status)) != NULL)
+    {
+        /* remove useless entry */
+        if (isCommit ?
+            walreq->invalidate_sxid == sxid :
+            walreq->create_sxid == sxid)
+            hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);
+        /* or cancel invalidation  */
+        else if (!isCommit && walreq->invalidate_sxid == sxid)
+            walreq->invalidate_sxid = InvalidSubTransactionId;
+    }
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3183b2aaa1..c9a0e02168 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11587,11 +11587,13 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
-     * old physical files.
+     * old physical files. WAL requirements for the old node is no longer
+     * needed.
      *
      * NOTE: any conflict in relfilenode value will be caught in
      * RelationCreateStorage().
      */
+    RelationInvalidateWALRequirements(rel);
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..f00826712a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,40 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3183,7 +3204,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3213,18 +3234,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 84609e0725..95e834d45e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -5625,6 +5626,8 @@ load_relcache_init_file(bool shared)
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+        rel->rd_nowalrequirement = false;
+        rel->rd_walrequirement = NULL;
 
         /*
          * Recompute lock and physical addressing info.  This is needed in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3773a4df85..3d4fb7f3c3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -172,6 +172,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                    HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 
 /* in heap/pruneheap.c */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9f638be924..9034465001 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -16,12 +16,18 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
 #include "utils/relcache.h"
 
 extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern void RecordWALSkipping(Relation rel);
+extern void RecordPendingSync(SMgrRelation rel, ForkNumber forknum);
+extern void RelationInvalidateWALRequirements(Relation rel);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
@@ -29,6 +35,8 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingSyncs(bool isCommit);
+extern void smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 54028515a7..30f0d5bd83 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,14 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * rd_nowalrequirement is true if this relation is known not to have
+     * special WAL requirements.  Otherwise we need to ask smgr for an entry
+     * if rd_walrequirement is NULL.
+     */
+    bool                        rd_nowalrequirement;
+    struct RelWalRequirement   *rd_walrequirement;
 } RelationData;
 
 
-- 
2.16.3

From 24e392a8423c1bed350b58e5cda56a140d2730ce Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 5/8] Fix WAL skipping feature.

This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c        | 109 ++++++++++++++++++++++++--------
 src/backend/access/heap/pruneheap.c     |   3 +-
 src/backend/access/heap/rewriteheap.c   |   3 -
 src/backend/access/heap/vacuumlazy.c    |   6 +-
 src/backend/access/heap/visibilitymap.c |   3 +-
 src/backend/commands/copy.c             |  13 ++--
 src/backend/commands/createas.c         |   9 ++-
 src/backend/commands/matview.c          |   6 +-
 src/backend/commands/tablecmds.c        |   6 +-
 src/include/access/heapam.h             |   3 +-
 src/include/access/tableam.h            |  11 +---
 11 files changed, 106 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5a8627507f..00416c4a99 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,27 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or WAL
+ *      archival purposes (i.e. if wal_level=minimal), and we fsync() the file
+ *      to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transaction, because for
+ *      a small number of changes, it's cheaper to just create the WAL records
+ *      than fsync()ing the whole relation at COMMIT. It is only worthwhile for
+ *      (presumably) large operations like COPY, CLUSTER, or VACUUM FULL. Use
+ *      heap_register_sync() to initiate such an operation; it will cause any
+ *      subsequent updates to the table to skip WAL-logging, if possible, and
+ *      cause the heap to be synced to disk at COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -1934,7 +1955,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2044,7 +2065,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2052,7 +2072,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2094,6 +2113,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2105,6 +2125,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -2657,7 +2678,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2791,6 +2812,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
                 vmbuffer = InvalidBuffer,
                 vmbuffer_new = InvalidBuffer;
     bool        need_toast;
+    bool        oldbuf_needs_wal,
+                newbuf_needs_wal;
     Size        newtupsize,
                 pagefree;
     bool        have_tuple_lock = false;
@@ -3342,7 +3365,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -3556,8 +3579,20 @@ l2:
         MarkBufferDirty(newbuf);
     MarkBufferDirty(buffer);
 
-    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    /*
+     *  XLOG stuff
+     *
+     * Emit heap-update log. When wal_level = minimal, we may emit insert or
+     * delete record according to wal-optimization.
+     */
+    oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+    if (newbuf == buffer)
+        newbuf_needs_wal = oldbuf_needs_wal;
+    else
+        newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+    if (oldbuf_needs_wal || newbuf_needs_wal)
     {
         XLogRecPtr    recptr;
 
@@ -3567,15 +3602,26 @@ l2:
          */
         if (RelationIsAccessibleInLogicalDecoding(relation))
         {
-            log_heap_new_cid(relation, &oldtup);
-            log_heap_new_cid(relation, heaptup);
+            if (oldbuf_needs_wal)
+                log_heap_new_cid(relation, &oldtup);
+            if (newbuf_needs_wal)
+                log_heap_new_cid(relation, heaptup);
         }
 
-        recptr = log_heap_update(relation, buffer,
-                                 newbuf, &oldtup, heaptup,
-                                 old_key_tuple,
-                                 all_visible_cleared,
-                                 all_visible_cleared_new);
+        if (oldbuf_needs_wal && newbuf_needs_wal)
+            recptr = log_heap_update(relation, buffer, newbuf,
+                                     &oldtup, heaptup,
+                                     old_key_tuple,
+                                     all_visible_cleared,
+                                     all_visible_cleared_new);
+        else if (oldbuf_needs_wal)
+            recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+                                     xmax_old_tuple, false,
+                                     all_visible_cleared);
+        else
+            recptr = log_heap_insert(relation, buffer, newtup,
+                                     0, all_visible_cleared_new);
+
         if (newbuf != buffer)
         {
             PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4453,7 +4499,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5205,7 +5251,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -5365,7 +5411,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
     htup->t_ctid = *tid;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -5497,7 +5543,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -5606,7 +5652,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -6802,8 +6848,8 @@ log_heap_clean(Relation reln, Buffer buffer,
     xl_heap_clean xlrec;
     XLogRecPtr    recptr;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me on non-WAL-logged buffers */
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -6850,8 +6896,8 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     xl_heap_freeze_page xlrec;
     XLogRecPtr    recptr;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me on non-WAL-logged buffers */
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7077,8 +7123,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
     bool        init;
     int            bufflags;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me when no buffer needs WAL-logging */
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8682,9 +8728,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5c554f9465..3f5df63df8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -929,7 +929,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1193,7 +1193,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1575,7 +1575,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 705df8900b..1074320a5a 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2391,8 +2391,7 @@ CopyFrom(CopyState cstate)
      *    - data is being written to relfilenode created in this transaction
      * then we can skip writing WAL.  It's safe because if the transaction
      * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().
      *
      * As mentioned in comments in utils/rel.h, the in-same-transaction test
      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
@@ -2438,7 +2437,7 @@ CopyFrom(CopyState cstate)
     {
         hi_options |= HEAP_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);
     }
 
     /*
@@ -3091,11 +3090,11 @@ CopyFrom(CopyState cstate)
     FreeExecutorState(estate);
 
     /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
      */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 3bdb67c697..b4431f2af3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,9 +605,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 5b2cbc7c89..45e693129d 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,7 +463,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -508,9 +508,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c9a0e02168..54ce52eaae 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4664,10 +4664,10 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         hi_options = HEAP_INSERT_SKIP_FSM;
+
         if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);
     }
     else
     {
@@ -4958,8 +4958,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         FreeBulkInsertState(bistate);
 
         /* If we skipped writing WAL, then we need to sync the heap. */
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);
 
         table_close(newrel, NoLock);
     }
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3d4fb7f3c3..97114aed3e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4699335cdf..cf7f8e7da0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -94,10 +94,9 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
-#define TABLE_INSERT_SKIP_FSM        0x0002
-#define TABLE_INSERT_FROZEN            0x0004
-#define TABLE_INSERT_NO_LOGICAL        0x0008
+#define TABLE_INSERT_SKIP_FSM        0x0001
+#define TABLE_INSERT_FROZEN            0x0002
+#define TABLE_INSERT_NO_LOGICAL        0x0004
 
 /* flag bits fortable_lock_tuple */
 /* Follow tuples whose update is in progress if lock modes don't conflict  */
@@ -702,10 +701,6 @@ table_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot, Snapshot snap
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
-- 
2.16.3

From 5b047a9514613c42c9ef1fb395ca401b55d7e2de Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Mar 2019 15:32:28 +0900
Subject: [PATCH 6/8] Change cluster to use the new pending sync infrastructure

Apply the pending-sync infrastructure to CLUSTER command. It gets
benefits from moving file sync from command end to transaction end
when wal_level is minimal.
---
 src/backend/access/heap/rewriteheap.c | 25 +++++--------------------
 src/backend/commands/cluster.c        | 13 +++++--------
 src/include/access/rewriteheap.h      |  2 +-
 3 files changed, 11 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 1ac77f7c14..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
 #include "access/xloginsert.h"
 
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 
 #include "lib/ilist.h"
 
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * min_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
                    (char *) state->rs_buffer, true);
     }
 
-    /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
-     * reason is the same as in tablecmds.c's copy_relation_data(): we're
-     * writing data that's not in shared buffers, and so a CHECKPOINT
-     * occurring during the rewriteheap operation won't have fsync'd data we
-     * wrote before the checkpoint.
-     */
-    if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     logical_end_heap_rewrite(state);
 
@@ -692,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 205070b83d..34c1a5e96c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -788,7 +788,6 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     TransactionId OldestXmin;
     TransactionId FreezeXid;
@@ -847,13 +846,11 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
         LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * If wal_level is minimal, we skip WAL-logging even for WAL-logging
+     * relations. The heap will be synced at commit.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
-    Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
+    if (!XLogIsNeeded())
+        heap_register_sync(NewHeap);
 
     /*
      * If both tables have TOAST tables, perform toast swap by content.  It is
@@ -920,7 +917,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
-                                 MultiXactCutoff, use_wal);
+                                 MultiXactCutoff);
 
     /*
      * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                    TransactionId OldestXmin, TransactionId FreezeXid,
-                   MultiXactId MultiXactCutoff, bool use_wal);
+                   MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                    HeapTuple newTuple);
-- 
2.16.3

From 50740e21bcb34b89334f7e5756d757b469a087c9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 20:39:21 +0900
Subject: [PATCH 7/8] Change ALTER TABLESPACE to use the pending-sync
 infrastructure

Apply heap_register_sync() to ATLER TABLESPACE stuff.
---
 src/backend/commands/tablecmds.c | 54 +++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 54ce52eaae..aabb3806f6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -84,7 +84,6 @@
 #include "storage/lmgr.h"
 #include "storage/lock.h"
 #include "storage/predicate.h"
-#include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -11891,7 +11890,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 {
     PGAlignedBlock buf;
     Page        page;
-    bool        use_wal;
+    bool        use_wal = false;
     bool        copying_initfork;
     BlockNumber nblocks;
     BlockNumber blkno;
@@ -11906,12 +11905,33 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
     copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
         forkNum == INIT_FORKNUM;
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
-     */
-    use_wal = XLogIsNeeded() &&
-        (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    {
+        /*
+         * We need to log the copied data in WAL iff WAL archiving/streaming
+         * is enabled AND it's a permanent relation.
+         */
+        if (XLogIsNeeded())
+            use_wal = true;
+
+        /*
+         * If the rel is WAL-logged, must fsync at commit.  We do the same to
+         * ensure that the toast table gets fsync'd too.  (For a temp or
+         * unlogged rel we don't care since the data will be gone after a
+         * crash anyway.)
+         *
+         * It's obvious that we must do this when not WAL-logging the
+         * copy. It's less obvious that we have to do it even if we did
+         * WAL-log the copied pages. The reason is that since we're copying
+         * outside shared buffers, a CHECKPOINT occurring during the copy has
+         * no way to flush the previously written data to disk (indeed it
+         * won't know the new rel even exists).  A crash later on would replay
+         * WAL from the checkpoint, therefore it wouldn't replay our earlier
+         * WAL entries. If we do not fsync those pages here, they might still
+         * not be on disk when the crash occurs.
+         */
+        RecordPendingSync(dst, forkNum);
+    }
 
     nblocks = smgrnblocks(src, forkNum);
 
@@ -11948,24 +11968,6 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
          */
         smgrextend(dst, forkNum, blkno, buf.data, true);
     }
-
-    /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
-     */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-        smgrimmedsync(dst, forkNum);
 }
 
 /*
-- 
2.16.3

From 2928ccd4197d237294215e4b9f0c9a6e8aa42eae Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Mar 2019 14:48:26 +0900
Subject: [PATCH 8/8] Optimize WAL-logging on btree bulk insertion

Likewise the heap case, bulk insertion into btree can be optimized to
omit WAL-logging on certain conditions.
---
 src/backend/access/heap/heapam.c      | 13 +++++++++++++
 src/backend/access/nbtree/nbtinsert.c |  5 ++++-
 src/backend/access/nbtree/nbtsort.c   | 23 +++++++----------------
 3 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 00416c4a99..c28b479141 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8870,6 +8870,8 @@ heap_mask(char *pagedata, BlockNumber blkno)
 void
 heap_register_sync(Relation rel)
 {
+    ListCell   *indlist;
+
     /* non-WAL-logged tables never need fsync */
     if (!RelationNeedsWAL(rel))
         return;
@@ -8883,4 +8885,15 @@ heap_register_sync(Relation rel)
         RecordWALSkipping(toastrel);
         heap_close(toastrel, AccessShareLock);
     }
+
+    /* Do the same to all index relations */
+    foreach(indlist, RelationGetIndexList(rel))
+    {
+        Oid            indexId = lfirst_oid(indlist);
+        Relation    indexRel;
+
+        indexRel = index_open(indexId, AccessShareLock);
+        RecordWALSkipping(indexRel);
+        index_close(indexRel, NoLock);
+    }
 }
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 96b7593fc1..fadcc09cb1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -20,6 +20,7 @@
 #include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xloginsert.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
             cachedBlock = BufferGetBlockNumber(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf) ||
+            (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) ||
+            (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))
         {
             xl_btree_insert xlrec;
             xl_btree_metadata xlmeta;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e65d4aab0f..90a5d6ae13 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,6 +66,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/index.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/smgr.h"
@@ -264,7 +265,6 @@ typedef struct BTWriteState
     Relation    heap;
     Relation    index;
     BTScanInsert inskey;        /* generic insertion scankey */
-    bool        btws_use_wal;    /* dump pages to WAL? */
     BlockNumber btws_pages_alloced; /* # pages allocated */
     BlockNumber btws_pages_written; /* # pages written out */
     Page        btws_zeropage;    /* workspace for filling zeroes */
@@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
     reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
 
+    /* Skip WAL-logging if wal_level = minimal */
+    if (!XLogIsNeeded())
+        RecordWALSkipping(index);
+
     /*
      * Finish the build by (1) completing the sort of the spool file, (2)
      * inserting the sorted tuples into btree pages and (3) building the upper
@@ -543,12 +547,6 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
-
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
     wstate.btws_pages_written = 0;
@@ -622,15 +620,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff
-     *
-     * Even if minimal mode, WAL is required here if truncation happened after
-     * being created in the same transaction. It is not needed otherwise but
-     * we don't bother identifying the case precisely.
-     */
-    if (wstate->btws_use_wal ||
-        (RelationNeedsWAL(wstate->index) &&
-         (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
+    /* XLOG stuff */
+    if (BlockNeedsWAL(wstate->index, blkno))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

01 апреля 2019 г., 01:31:58

On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> last paragraph, and I suspect it would have been no harder to back-patch.  I
> wonder if it would have been simpler and better, but I'm not asking anyone to
> investigate that.

Now I am asking for that.  Would anyone like to try implementing that other
design, to see how much simpler it would be?  I now expect the already-drafted
design to need several more iterations before it reaches a finished patch.

Separately, I reviewed v9 of the already-drafted design:

> On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > +/*
> > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > + */
> > +void
> > +RelationRemovePendingSync(Relation rel)
> 
> What is the coding rule for deciding when to call this?  Currently, only
> ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
> much like ALTER TABLE SET TABLESPACE behaves.

This question still applies.  (The function name did change from
RelationRemovePendingSync() to RelationInvalidateWALRequirements().)

On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
> At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190321054835.GB3842129@rfd.leadboat.com>
> > On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> > > At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190311022708.GA2189728@rfd.leadboat.com>
> > > > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > > +        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
> > > 
> > > As you mention upthread, you have many debugging elog()s.  These are too
> > > detailed to include in every binary, but I do want them in the code.  See
> > > CACHE_elog() for a good example of achieving that.
> > 
> > Agreed will do. They were need to check the behavior precisely
> > but usually not needed.
> 
> I removed all such elog()s.

Again, I do want them in the code.  Please restore them, but use a mechanism
like CACHE_elog() so they're built only if one defines a preprocessor symbol.

On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
> @@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
>                  (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
>                   errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
>  
> +    smgrProcessWALRequirementInval(s->subTransactionId, true);
> +
>      /*
>       * Mark "commit pending" all subtransactions up to the target
>       * subtransaction.  The actual commits will happen when control gets to
> @@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name)
>                  (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
>                   errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
>  
> +    smgrProcessWALRequirementInval(s->subTransactionId, false);

The smgrProcessWALRequirementInval() calls almost certainly belong in
CommitSubTransaction() and AbortSubTransaction(), not in these functions.  By
doing it here, you'd get the wrong behavior in a subtransaction created via a
plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.

> +/*
> + * Process pending invalidation of WAL requirements happened in the
> + * subtransaction
> + */
> +void
> +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
> +{
> +    HASH_SEQ_STATUS status;
> +    RelWalRequirement *walreq;
> +
> +    if (!walRequirements)
> +        return;
> +
> +    /* We expect that we don't have walRequirements in almost all cases */
> +    hash_seq_init(&status, walRequirements);
> +
> +    while ((walreq = hash_seq_search(&status)) != NULL)
> +    {
> +        /* remove useless entry */
> +        if (isCommit ?
> +            walreq->invalidate_sxid == sxid :
> +            walreq->create_sxid == sxid)
> +            hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);

Do not remove entries during subtransaction commit, because a parent
subtransaction might still abort.  See other CommitSubTransaction() callees
for examples of correct subtransaction handling.  AtEOSubXact_Files() is one
simple example.

> @@ -3567,15 +3602,26 @@ heap_update
>           */
>          if (RelationIsAccessibleInLogicalDecoding(relation))
>          {
> -            log_heap_new_cid(relation, &oldtup);
> -            log_heap_new_cid(relation, heaptup);
> +            if (oldbuf_needs_wal)
> +                log_heap_new_cid(relation, &oldtup);
> +            if (newbuf_needs_wal)
> +                log_heap_new_cid(relation, heaptup);

These if(...) conditions are always true, since they're redundant with
RelationIsAccessibleInLogicalDecoding(relation).  Remove the conditions or
replace them with asserts.

>          }
>  
> -        recptr = log_heap_update(relation, buffer,
> -                                 newbuf, &oldtup, heaptup,
> -                                 old_key_tuple,
> -                                 all_visible_cleared,
> -                                 all_visible_cleared_new);
> +        if (oldbuf_needs_wal && newbuf_needs_wal)
> +            recptr = log_heap_update(relation, buffer, newbuf,
> +                                     &oldtup, heaptup,
> +                                     old_key_tuple,
> +                                     all_visible_cleared,
> +                                     all_visible_cleared_new);
> +        else if (oldbuf_needs_wal)
> +            recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
> +                                     xmax_old_tuple, false,
> +                                     all_visible_cleared);
> +        else
> +            recptr = log_heap_insert(relation, buffer, newtup,
> +                                     0, all_visible_cleared_new);

By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
chain and infomask bits that were present before crash recovery.  If that's
okay in these circumstances, please write a comment explaining why.

> @@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
>              cachedBlock = BufferGetBlockNumber(buf);
>  
>          /* XLOG stuff */
> -        if (RelationNeedsWAL(rel))
> +        if (BufferNeedsWAL(rel, buf) ||
> +            (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) ||
> +            (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))

This appears to have the same problem that heap_update() had in v7; if
BufferNeedsWAL(rel, buf) is false and BufferNeedsWAL(rel, metabuf) is true, we
emit WAL for both buffers.  If that can't actually happen today, use asserts.

I don't want the btree code to get significantly more complicated in order to
participate in the RelWalRequirement system.  If btree code would get more
complicated, it's better to have btree continue using the old system.  If
btree's complexity would be essentially unchanged, it's still good to use the
new system.

> @@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
>  
>      reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
>  
> +    /* Skip WAL-logging if wal_level = minimal */
> +    if (!XLogIsNeeded())
> +        RecordWALSkipping(index);

_bt_load() still has an smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM),
which should be unnecessary after you add this end-of-transaction sync.  Also,
this code can reach an assertion failure at wal_level=minimal:

910024 2019-03-31 19:12:13.728 GMT LOG:  statement: create temp table x (c int primary key)
910024 2019-03-31 19:12:13.729 GMT DEBUG:  CREATE TABLE / PRIMARY KEY will create implicit index "x_pkey" for table
"x"
910024 2019-03-31 19:12:13.730 GMT DEBUG:  building index "x_pkey" on table "x" serially
TRAP: FailedAssertion("!(((rel)->rd_rel->relpersistence == 'p'))", File: "storage.c", Line: 460)

Also, please fix whitespace problems that "git diff --check master" reports.

nm

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

02 апреля 2019 г., 13:54:06

Thank you for reviewing.

At Sun, 31 Mar 2019 15:31:58 -0700, Noah Misch <noah@leadboat.com> wrote in <20190331223158.GB891537@rfd.leadboat.com>
> On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > +/*
> > > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > > + */
> > > +void
> > > +RelationRemovePendingSync(Relation rel)
> > 
> > What is the coding rule for deciding when to call this?  Currently, only
> > ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
> > much like ALTER TABLE SET TABLESPACE behaves.
> 
> This question still applies.  (The function name did change from
> RelationRemovePendingSync() to RelationInvalidateWALRequirements().)

It is called for heap_register_sync()'ed relations to avoid
syncing useless or trying to sync nonexistent files. I modifed
all CLUSTER, COPY FROM, CREATE AS, REFRESH MATVIEW and SET
TABLESPACE uses the function. (The function is renamed to
table_relation_invalidate_walskip()).

I noticed that heap_register_sync and friends are now a kind of
Table-AM function. So I added .relation_register_walskip and
.relation_invalidate_walskip in TableAMRoutine and moved the
heap_register_sync stuff as heapam_relation_register_walskip and
friends. .finish_bulk_insert() is modified to be used only
WAL-skip is active on the relation. (0004, 0005) But I'm not sure
that is the right direction.

(RelWALRequirements is renamed to RelWALSkip)

The change made smgrFinishBulkInsert (known as smgrDoPendingSync)
need to call a tableam interface. Relation is required to call it
in the designed way but relcache cannot live until there. In the
attached patch 0005, a new member TableAmRoutine *tableam is
added to RelWalSkip and calls finish_bulk_insert() via the
tableAm. But I'm quite uneasy with that...

> On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
> > At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190321054835.GB3842129@rfd.leadboat.com>
> Again, I do want them in the code.  Please restore them, but use a mechanism
> like CACHE_elog() so they're built only if one defines a preprocessor symbol.

Ah, sorry. I restored the messages using STORAGE_elog(). I also
needed this. (SMGR_ might be better but I'm not sure.)

> On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
> > +    smgrProcessWALRequirementInval(s->subTransactionId, false);
> 
> The smgrProcessWALRequirementInval() calls almost certainly belong in
> CommitSubTransaction() and AbortSubTransaction(), not in these functions.  By
> doing it here, you'd get the wrong behavior in a subtransaction created via a
> plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.

Thanks. Moved it to AtSubAbort_smgr() and AtSubCommit_smgr(). (0005)

> > +/*
> > + * Process pending invalidation of WAL requirements happened in the
> > + * subtransaction
> > + */
> > +void
> > +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
> > +{
> > +    HASH_SEQ_STATUS status;
> > +    RelWalRequirement *walreq;
> > +
> > +    if (!walRequirements)
> > +        return;
> > +
> > +    /* We expect that we don't have walRequirements in almost all cases */
> > +    hash_seq_init(&status, walRequirements);
> > +
> > +    while ((walreq = hash_seq_search(&status)) != NULL)
> > +    {
> > +        /* remove useless entry */
> > +        if (isCommit ?
> > +            walreq->invalidate_sxid == sxid :
> > +            walreq->create_sxid == sxid)
> > +            hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);
> 
> Do not remove entries during subtransaction commit, because a parent
> subtransaction might still abort.  See other CommitSubTransaction() callees
> for examples of correct subtransaction handling.  AtEOSubXact_Files() is one
> simple example.

Thanks. smgrProcessWALSkipInval() (0005) is changed so that:

 - If a RelWalSkip entry is created in aborted subtransaction,
   remove it.

 - If a RelWalSkip entry is created then invalidated in committed
   subtransaction, remove it.

 - If a RelWalSkip entry is created and committed, change the
   creator subtransaction to the parent subtransaction.

 - If a RelWalSkip entry is create elsewhere and invalidated in
   committed subtransaction, move the invalidation to the parent
   subtransaction.

 - If a RelWalSkip entry is created elsewhere and invalidated in
   aborted subtransaction, cancel the invalidation.

Test is added as test3a2 and test3a3. (0001)

> > @@ -3567,15 +3602,26 @@ heap_update
> >           */
> >          if (RelationIsAccessibleInLogicalDecoding(relation))
> >          {
> > -            log_heap_new_cid(relation, &oldtup);
> > -            log_heap_new_cid(relation, heaptup);
> > +            if (oldbuf_needs_wal)
> > +                log_heap_new_cid(relation, &oldtup);
> > +            if (newbuf_needs_wal)
> > +                log_heap_new_cid(relation, heaptup);
> 
> These if(...) conditions are always true, since they're redundant with
> RelationIsAccessibleInLogicalDecoding(relation).  Remove the conditions or
> replace them with asserts.

Ah.. I see. It is not the minimal case. Added a comment and an
assertion. (0006)

+  * catalog. Both oldbuf_needs_wal and newbuf_needs_wal must be true
+  * when logical decoding is active.

> By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
> chain and infomask bits that were present before crash recovery.  If that's
> okay in these circumstances, please write a comment explaining why.

Sounds reasonable. Added a comment. (Honestly I completely forgot
about that.. Thanks!) (0006)

+  * Insert log record. Using delete or insert log loses HOT chain
+  * information but that happens only when newbuf is different from
+  * buffer, where HOT cannot happen.


> > @@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
> >   |  |  | cachedBlock = BufferGetBlockNumber(buf);
> >  
> >   |  | /* XLOG stuff */
> > - |  | if (RelationNeedsWAL(rel))
> > + |  | if (BufferNeedsWAL(rel, buf) ||
> > + |  |  | (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) ||
> > + |  |  | (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))
> 
> This appears to have the same problem that heap_update() had in v7; if
> BufferNeedsWAL(rel, buf) is false and BufferNeedsWAL(rel, metabuf) is true, we
> emit WAL for both buffers.  If that can't actually happen today, use asserts.
> 
> I don't want the btree code to get significantly more complicated in order to
> participate in the RelWalRequirement system.  If btree code would get more
> complicated, it's better to have btree continue using the old system.  If
> btree's complexity would be essentially unchanged, it's still good to use the
> new system.

It was broken. I tried to fix it but page split baffled me. I
reverted it and added a comment there explaining the reason for
not applying BufferNeedsWAL stuff to nbtree. WAL-logging skip
feature is now restricted to work only on non-index
heaps. (getWalSkipEntry and RecordPendingSync in 0005)

> > @@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
> >  
> >   | reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
> >  
> > + | /* Skip WAL-logging if wal_level = minimal */
> > + | if (!XLogIsNeeded())
> > + |  | RecordWALSkipping(index);
> 
> _bt_load() still has an smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM),
> which should be unnecessary after you add this end-of-transaction sync.  Also,
> this code can reach an assertion failure at wal_level=minimal:
> 
> 910024 2019-03-31 19:12:13.728 GMT LOG:  statement: create temp table x (c int primary key)
> 910024 2019-03-31 19:12:13.729 GMT DEBUG:  CREATE TABLE / PRIMARY KEY will create implicit index "x_pkey" for table
"x"
> 910024 2019-03-31 19:12:13.730 GMT DEBUG:  building index "x_pkey" on table "x" serially
> TRAP: FailedAssertion("!(((rel)->rd_rel->relpersistence == 'p'))", File: "storage.c", Line: 460)

This is what I mentioned as "broken" above. Sorry for the
silly mistake.

> Also, please fix whitespace problems that "git diff --check master" reports.

Thanks. Good to know the command.


After all, this patch set contains the following files.

v10-0001-TAP-test-for-copy-truncation-optimization.patch

 Tap test script. Multi-level subtransaction case is added.

v10-0002-Write-WAL-for-empty-nbtree-index-build.patch

 As mentioned above, nbtree patch has been shrinked to the
 initial state of a workaround. Comment is rewrited. (v9-0002 +
 v9-0008)

v10-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch

 Not substantially changed.

v10-0004-Add-new-interface-to-TableAmRoutine.patch

 New file. Adds two new interfaces to TableAmRoutine and modified
 one interface.

v10-0005-Add-infrastructure-to-WAL-logging-skip-feature.patch

 Heavily revised version of v9-0004.
   Some functions are renamed.
   Fixed subtransaction handling.
   Added STORAGE_elog() stuff.
   Uses table-am functions.
   Changes heapam stuff.

v10-0006-Fix-WAL-skipping-feature.patch

  Revised version of v9-0005 + v9-0006 + v9-0007.

    Added comment and assertion in heap_insert().

v10-0007-Remove-TABLE-HEAP_INSERT_SKIP_WAL.patch

 Separated from v9-0005 so that subsequent patches are sane.

 Removes TABLE/HEAP_ISNERT_SKIP_WAL.
 
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 55c85f06a9dc0a77f4cc6b02d4538b2e7169b3dc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/7] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/017_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/017_wal_optimize.pl

diff --git a/src/test/recovery/t/017_wal_optimize.pl b/src/test/recovery/t/017_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/017_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From fda405f0f0f9a5fa816c426adc5eb8850f20f6eb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/7] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 14d9545768..5551a9c227 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -622,8 +622,16 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even when wal_level is minimal, WAL is required here if truncation
+     * happened after being created in the same transaction. This is hacky but
+     * we cannot use BufferNeedsWAL() stuff for nbtree since it can emit
+     * atomic WAL records on multiple buffers.
+     */
+    if (wstate->btws_use_wal ||
+        (RelationNeedsWAL(wstate->index) &&
+         (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-- 
2.16.3

From d15655d7bfe0b44c3b027ccdcc36fe0087f823c1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/7] Move XLOG stuff from heap_insert and heap_delete

Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
 src/backend/access/heap/heapam.c | 275 ++++++++++++++++++++++-----------------
 1 file changed, 156 insertions(+), 119 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 05ceb6550d..267570b461 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -72,6 +72,11 @@
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
                     TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
                 Buffer newbuf, HeapTuple oldtup,
                 HeapTuple newtup, HeapTuple old_key_tup,
@@ -1875,6 +1880,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     TransactionId xid = GetCurrentTransactionId();
     HeapTuple    heaptup;
     Buffer        buffer;
+    Page        page;
     Buffer        vmbuffer = InvalidBuffer;
     bool        all_visible_cleared = false;
 
@@ -1911,16 +1917,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
      */
     CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
+    page = BufferGetPage(buffer);
+
     /* NO EREPORT(ERROR) from here till changes are logged */
     START_CRIT_SECTION();
 
     RelationPutHeapTuple(relation, buffer, heaptup,
                          (options & HEAP_INSERT_SPECULATIVE) != 0);
 
-    if (PageIsAllVisible(BufferGetPage(buffer)))
+    if (PageIsAllVisible(page))
     {
         all_visible_cleared = true;
-        PageClearAllVisible(BufferGetPage(buffer));
+        PageClearAllVisible(page);
         visibilitymap_clear(relation,
                             ItemPointerGetBlockNumber(&(heaptup->t_self)),
                             vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1942,75 +1950,10 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     /* XLOG stuff */
     if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
     {
-        xl_heap_insert xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
-        Page        page = BufferGetPage(buffer);
-        uint8        info = XLOG_HEAP_INSERT;
-        int            bufflags = 0;
 
-        /*
-         * If this is a catalog, we need to transmit combocids to properly
-         * decode, so log that as well.
-         */
-        if (RelationIsAccessibleInLogicalDecoding(relation))
-            log_heap_new_cid(relation, heaptup);
-
-        /*
-         * If this is the single and first tuple on page, we can reinit the
-         * page instead of restoring the whole thing.  Set flag, and hide
-         * buffer references from XLogInsert.
-         */
-        if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
-            PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
-        {
-            info |= XLOG_HEAP_INIT_PAGE;
-            bufflags |= REGBUF_WILL_INIT;
-        }
-
-        xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
-        if (options & HEAP_INSERT_SPECULATIVE)
-            xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
-        Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
-        /*
-         * For logical decoding, we need the tuple even if we're doing a full
-         * page write, so make sure it's included even if we take a full-page
-         * image. (XXX We could alternatively store a pointer into the FPW).
-         */
-        if (RelationIsLogicallyLogged(relation) &&
-            !(options & HEAP_INSERT_NO_LOGICAL))
-        {
-            xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
-            bufflags |= REGBUF_KEEP_DATA;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
-        xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
-        xlhdr.t_infomask = heaptup->t_data->t_infomask;
-        xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
-        /*
-         * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
-         * write the whole page to the xlog, we don't need to store
-         * xl_heap_header in the xlog.
-         */
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
-        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-        /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-        XLogRegisterBufData(0,
-                            (char *) heaptup->t_data + SizeofHeapTupleHeader,
-                            heaptup->t_len - SizeofHeapTupleHeader);
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, info);
+        recptr = log_heap_insert(relation, buffer, heaptup,
+                                 options, all_visible_cleared);
 
         PageSetLSN(page, recptr);
     }
@@ -2730,58 +2673,10 @@ l1:
      */
     if (RelationNeedsWAL(relation))
     {
-        xl_heap_delete xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
 
-        /* For logical decode we need combocids to properly decode the catalog */
-        if (RelationIsAccessibleInLogicalDecoding(relation))
-            log_heap_new_cid(relation, &tp);
-
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
-        if (changingPart)
-            xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
-        xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
-                                              tp.t_data->t_infomask2);
-        xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
-        xlrec.xmax = new_xmax;
-
-        if (old_key_tuple != NULL)
-        {
-            if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
-            else
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-        /*
-         * Log replica identity of the deleted tuple if there is one
-         */
-        if (old_key_tuple != NULL)
-        {
-            xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
-            xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
-            xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
-            XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
-            XLogRegisterData((char *) old_key_tuple->t_data
-                             + SizeofHeapTupleHeader,
-                             old_key_tuple->t_len
-                             - SizeofHeapTupleHeader);
-        }
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
-
+        recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+                                 changingPart, all_visible_cleared);
         PageSetLSN(page, recptr);
     }
 
@@ -7245,6 +7140,148 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
     return recptr;
 }
 
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+    xl_heap_insert xlrec;
+    xl_heap_header xlhdr;
+    uint8        info = XLOG_HEAP_INSERT;
+    int            bufflags = 0;
+    Page        page = BufferGetPage(buffer);
+
+    /*
+     * If this is a catalog, we need to transmit combocids to properly
+     * decode, so log that as well.
+     */
+    if (RelationIsAccessibleInLogicalDecoding(relation))
+        log_heap_new_cid(relation, heaptup);
+
+    /*
+     * If this is the single and first tuple on page, we can reinit the
+     * page instead of restoring the whole thing.  Set flag, and hide
+     * buffer references from XLogInsert.
+     */
+    if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+        PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+    {
+        info |= XLOG_HEAP_INIT_PAGE;
+        bufflags |= REGBUF_WILL_INIT;
+    }
+
+    xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+    if (options & HEAP_INSERT_SPECULATIVE)
+        xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+    Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+    /*
+     * For logical decoding, we need the tuple even if we're doing a full
+     * page write, so make sure it's included even if we take a full-page
+     * image. (XXX We could alternatively store a pointer into the FPW).
+     */
+    if (RelationIsLogicallyLogged(relation) &&
+        !(options & HEAP_INSERT_NO_LOGICAL))
+    {
+        xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+        bufflags |= REGBUF_KEEP_DATA;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+    xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+    xlhdr.t_infomask = heaptup->t_data->t_infomask;
+    xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+    /*
+     * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+     * write the whole page to the xlog, we don't need to store
+     * xl_heap_header in the xlog.
+     */
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+    /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+    XLogRegisterBufData(0,
+                        (char *) heaptup->t_data + SizeofHeapTupleHeader,
+                        heaptup->t_len - SizeofHeapTupleHeader);
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared)
+{
+    xl_heap_delete xlrec;
+    xl_heap_header xlhdr;
+
+    /* For logical decode we need combocids to properly decode the catalog */
+    if (RelationIsAccessibleInLogicalDecoding(relation))
+        log_heap_new_cid(relation, tp);
+
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+    if (changingPart)
+        xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+    xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+                                          tp->t_data->t_infomask2);
+    xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+    xlrec.xmax = new_xmax;
+
+    if (old_key_tuple != NULL)
+    {
+        if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+        else
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+    /*
+     * Log replica identity of the deleted tuple if there is one
+     */
+    if (old_key_tuple != NULL)
+    {
+        xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+        xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+        xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+        XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+        XLogRegisterData((char *) old_key_tuple->t_data
+                         + SizeofHeapTupleHeader,
+                         old_key_tuple->t_len
+                         - SizeofHeapTupleHeader);
+    }
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
 /*
  * Perform XLogInsert for a heap-update operation.  Caller must already
  * have modified the buffer(s) and marked them dirty.
-- 
2.16.3

From 255e3b3d5998318a9aa7abd0d3f9dab67dd0053a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 11:53:36 +0900
Subject: [PATCH 4/7] Add new interface to TableAmRoutine

Add two interface functions to TableAmRoutine, which are related to
WAL-skipping feature.
---
 src/backend/access/table/tableamapi.c |  4 ++
 src/include/access/tableam.h          | 79 +++++++++++++++++++++++------------
 2 files changed, 56 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 51c0deaaf2..fef4e523e8 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -94,6 +94,10 @@ GetTableAmRoutine(Oid amhandler)
            (routine->scan_bitmap_next_tuple == NULL));
     Assert(routine->scan_sample_next_block != NULL);
     Assert(routine->scan_sample_next_tuple != NULL);
+    Assert((routine->relation_register_walskip == NULL) ==
+           (routine->relation_invalidate_walskip == NULL) &&
+           (routine->relation_register_walskip == NULL) ==
+           (routine->finish_bulk_insert == NULL));
 
     return routine;
 }
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4efe178ed1..1a3a3c6711 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -382,19 +382,15 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * e.g. may e.g. used to flush the relation when inserting with
-     * TABLE_INSERT_SKIP_WAL specified.
+     * tuple_insert and multi_insert or page-level copying performed by ALTER
+     * TABLE rewrite. This is called at commit time if WAL-skipping is
+     * activated and the caller decided that any finish work is required to
+     * the file.
      *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags the apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert
-     * that make sense for a specific AM.
-     *
-     * Optional callback.
+     * Optional callback. Must be provided when relation_register_walskip is
+     * provided.
      */
-    void        (*finish_bulk_insert) (Relation rel, int options);
-
+    void        (*finish_bulk_insert) (RelFileNode rnode, ForkNumber forkNum);
 
     /* ------------------------------------------------------------------------
      * DDL related functionality.
@@ -447,6 +443,26 @@ typedef struct TableAmRoutine
                                               double *tups_vacuumed,
                                               double *tups_recently_dead);
 
+    /*
+     * Register WAL-skipping on the current storage of rel. WAL-logging on the
+     * relation is skipped and the storage will be synced at commit. Returns
+     * true if successfully registered, and finish_bulk_insert() is called at
+     * commit.
+     *
+     * Optional callback.
+     */
+    void        (*relation_register_walskip) (Relation rel);
+
+    /*
+     * Invalidate registered WAL skipping on the current storage of rel. The
+     * function is called when the storage of the relation is going to be
+     * out-of-use after commit.
+     *
+     * Optional callback. Must be provided when relation_register_walskip is
+     * provided.
+     */
+    void        (*relation_invalidate_walskip) (Relation rel);
+
     /*
      * React to VACUUM command on the relation. The VACUUM might be user
      * triggered or by autovacuum. The specific actions performed by the AM
@@ -1026,8 +1042,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  *
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1201,20 +1216,6 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
                                        flags, tmfd);
 }
 
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
-    /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related functionality.
@@ -1298,6 +1299,30 @@ table_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
                                                    tups_recently_dead);
 }
 
+/*
+ * Register WAL-skipping to the relation. WAL-logging is skipped for the new
+ * pages after this call and the relation file is going to be synced at
+ * commit.
+ */
+static inline void
+table_relation_register_walskip(Relation rel)
+{
+    if (rel->rd_tableam && rel->rd_tableam->relation_register_walskip)
+        rel->rd_tableam->relation_register_walskip(rel);
+}
+
+/*
+ * Unregister WAL-skipping to the relation. Call this when the relation is
+ * going to be out-of-use after commit. WAL-skipping continues but the
+ * relation won't be synced at commit.
+ */
+static inline void
+table_relation_invalidate_walskip(Relation rel)
+{
+    if (rel->rd_tableam && rel->rd_tableam->relation_invalidate_walskip)
+        rel->rd_tableam->relation_invalidate_walskip(rel);
+}
+
 /*
  * Perform VACUUM on the relation. The VACUUM can be user triggered or by
  * autovacuum. The specific actions performed by the AM will depend heavily on
-- 
2.16.3

From 24c9b0b9b9698d86fce3ad129400e3042a2e0afd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 18:05:10 +0900
Subject: [PATCH 5/7] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction
truncations. table_relation_register_walskip() should be used to start
tracking before batch operations like COPY and CLUSTER, and use
BufferNeedsWAL() instead of RelationNeedsWAL() at the places related
to WAL-logging about heap-modifying operations, then remove
call to table_finish_bulk_insert() and the tableam intaface.
---
 src/backend/access/transam/xact.c   |  12 +-
 src/backend/catalog/storage.c       | 612 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   6 +-
 src/backend/storage/buffer/bufmgr.c |  39 ++-
 src/backend/utils/cache/relcache.c  |   3 +
 src/include/catalog/storage.h       |  17 +-
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   7 +
 8 files changed, 631 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e9ed92b70b..33a83dc784 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2102,6 +2102,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrFinishBulkInsert(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2334,6 +2337,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrFinishBulkInsert(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2659,6 +2665,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrFinishBulkInsert(false);    /* abandon pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4792,8 +4799,7 @@ CommitSubTransaction(void)
     AtEOSubXact_RelationCache(true, s->subTransactionId,
                               s->parent->subTransactionId);
     AtEOSubXact_Inval(true);
-    AtSubCommit_smgr();
-
+    AtSubCommit_smgr(s->subTransactionId, s->parent->subTransactionId);
     /*
      * The only lock we actually release here is the subtransaction XID lock.
      */
@@ -4970,7 +4976,7 @@ AbortSubTransaction(void)
         ResourceOwnerRelease(s->curTransactionOwner,
                              RESOURCE_RELEASE_AFTER_LOCKS,
                              false, false);
-        AtSubAbort_smgr();
+        AtSubAbort_smgr(s->subTransactionId, s->parent->subTransactionId);
 
         AtEOXact_GUC(false, s->gucNestLevel);
         AtEOSubXact_SPI(false, s->subTransactionId);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 72242b2476..4cd112f86c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -21,6 +21,7 @@
 
 #include "miscadmin.h"
 
+#include "access/tableam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
 #include "access/xlog.h"
@@ -29,10 +30,18 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
-#include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+ /* #define STORAGEDEBUG */    /* turns DEBUG elogs on */
+
+#ifdef STORAGEDEBUG
+#define STORAGE_elog(...)                elog(__VA_ARGS__)
+#else
+#define STORAGE_elog(...)
+#endif
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -64,6 +73,61 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a RelWalSkip entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any
+ * operations on blocks < skip_wal_min_blk need to be WAL-logged as usual, but
+ * for operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalSkip
+{
+    RelFileNode relnode;            /* relation created in same xact */
+    bool        forks[MAX_FORKNUM + 1];    /* target forknums */
+    BlockNumber skip_wal_min_blk;    /* WAL-logging skipped for blocks >=
+                                     * skip_wal_min_blk */
+    BlockNumber wal_log_min_blk;     /* The minimum blk number that requires
+                                     * WAL-logging even if skipped by the
+                                     * above*/
+    SubTransactionId create_sxid;    /* subxid where this entry is created */
+    SubTransactionId invalidate_sxid; /* subxid where this entry is
+                                       * invalidated */
+    const TableAmRoutine *tableam;    /* Table access routine */
+}    RelWalSkip;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *walSkipHash = NULL;
+
+static RelWalSkip *getWalSkipEntry(Relation rel, bool create);
+static RelWalSkip *getWalSkipEntryRNode(RelFileNode *node,
+                                                      bool create);
+static void smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+                        SubTransactionId parentSubid);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -261,31 +325,59 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        RelWalSkip *walskip;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+        /* get pending sync entry, create if not yet */
+        walskip = getWalSkipEntry(rel, true);
 
         /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
+         * walskip is null here if rel doesn't support WAL-logging skip,
+         * otherwise check for WAL-skipping status.
          */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (walskip == NULL ||
+            walskip->skip_wal_min_blk == InvalidBlockNumber ||
+            walskip->skip_wal_min_blk < nblocks)
+        {
+            /*
+             * If WAL-skipping is enabled, this is the first time truncation
+             * of this relation in this transaction or truncation that leaves
+             * pages that need at-commit fsync.  Make an XLOG entry reporting
+             * the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            STORAGE_elog(DEBUG2,
+                         "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                         rel->rd_node.spcNode, rel->rd_node.dbNode,
+                         rel->rd_node.relNode, nblocks);
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            if (walskip)
+            {
+                /* no longer skip WAL-logging for the blocks */
+                walskip->wal_log_min_blk = nblocks;
+            }
+        }
     }
 
     /* Do the real work */
@@ -296,8 +388,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
  * Copy a fork's data, block by block.
  */
 void
-RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
-                    ForkNumber forkNum, char relpersistence)
+RelationCopyStorage(Relation srcrel, SMgrRelation dst, ForkNumber forkNum)
 {
     PGAlignedBlock buf;
     Page        page;
@@ -305,6 +396,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     bool        copying_initfork;
     BlockNumber nblocks;
     BlockNumber blkno;
+    SMgrRelation src = srcrel->rd_smgr;
+    char         relpersistence = srcrel->rd_rel->relpersistence;
 
     page = (Page) buf.data;
 
@@ -316,12 +409,33 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
         forkNum == INIT_FORKNUM;
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
-     */
-    use_wal = XLogIsNeeded() &&
-        (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    {
+        /*
+         * We need to log the copied data in WAL iff WAL archiving/streaming
+         * is enabled AND it's a permanent relation.
+         */
+        if (XLogIsNeeded())
+            use_wal = true;
+
+        /*
+         * If the rel is WAL-logged, must fsync before commit.  We use
+         * heap_sync to ensure that the toast table gets fsync'd too.  (For a
+         * temp or unlogged rel we don't care since the data will be gone
+         * after a crash anyway.)
+         *
+         * It's obvious that we must do this when not WAL-logging the
+         * copy. It's less obvious that we have to do it even if we did
+         * WAL-log the copied pages. The reason is that since we're copying
+         * outside shared buffers, a CHECKPOINT occurring during the copy has
+         * no way to flush the previously written data to disk (indeed it
+         * won't know the new rel even exists).  A crash later on would replay
+         * WAL from the checkpoint, therefore it wouldn't replay our earlier
+         * WAL entries. If we do not fsync those pages here, they might still
+         * not be on disk when the crash occurs.
+         */
+        RecordPendingSync(srcrel, dst, forkNum);
+    }
 
     nblocks = smgrnblocks(src, forkNum);
 
@@ -358,24 +472,321 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
          */
         smgrextend(dst, forkNum, blkno, buf.data, true);
     }
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    RelWalSkip *walskip;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch existing pending sync entry */
+    walskip = getWalSkipEntry(rel, false);
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * no point in doing further work if we know that we don't skip
+     * WAL-logging.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-        smgrimmedsync(dst, forkNum);
+    if (!walskip)
+    {
+        STORAGE_elog(DEBUG2,
+                     "not skipping WAL-logging for rel %u/%u/%u block %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, BufferGetBlockNumber(buf));
+        return true;
+    }
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+        walskip->skip_wal_min_blk > blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+        return true;
+    }
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+        walskip->wal_log_min_blk <= blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+        return true;
+    }
+
+    STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode,
+                 rel->rd_node.relNode, blkno);
+
+    return false;
+}
+
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+    RelWalSkip *walskip;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    walskip = getWalSkipEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't skip
+     * WAL-logging.
+     */
+    if (!walskip)
+        return true;
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+        walskip->skip_wal_min_blk > blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+        return true;
+    }
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+        walskip->wal_log_min_blk <= blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+
+        return true;
+    }
+
+    STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode,
+                 rel->rd_node.relNode, blkno);
+
+    return false;
+}
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and for the blocks that are going to be synced
+ * at commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+    RelWalSkip *walskip;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    walskip = getWalSkipEntry(rel, true);
+
+    if (walskip == NULL)
+        return;
+
+    /*
+     *  Record only the first registration.
+     */
+    if (walskip->skip_wal_min_blk != InvalidBlockNumber)
+    {
+        STORAGE_elog(DEBUG2, "WAL skipping for rel %u/%u/%u was already registered at block %u (new %u)",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, walskip->skip_wal_min_blk,
+                     RelationGetNumberOfBlocks(rel));
+        return;
+    }
+
+    STORAGE_elog(DEBUG2, "registering new WAL skipping rel %u/%u/%u at block %u",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode,
+                 rel->rd_node.relNode, RelationGetNumberOfBlocks(rel));
+
+    walskip->skip_wal_min_blk = RelationGetNumberOfBlocks(rel);
+}
+
+/*
+ * Record commit-time file sync. This shouldn't be used mixing with
+ * RecordWALSkipping.
+ */
+void
+RecordPendingSync(Relation rel, SMgrRelation targetsrel, ForkNumber forknum)
+{
+    RelWalSkip *walskip;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* check for support for this feature */
+    if (rel->rd_tableam == NULL ||
+        rel->rd_tableam->relation_register_walskip == NULL)
+        return;
+
+    walskip = getWalSkipEntryRNode(&targetsrel->smgr_rnode.node, true);
+    walskip->forks[forknum] = true;
+    walskip->skip_wal_min_blk = 0;
+    walskip->tableam = rel->rd_tableam;
+
+    STORAGE_elog(DEBUG2,
+                 "registering new pending sync for rel %u/%u/%u at block %u",
+                 walskip->relnode.spcNode, walskip->relnode.dbNode,
+                 walskip->relnode.relNode, 0);
+}
+
+/*
+ * RelationInvalidateWALSkip() -- invalidate WAL-skip entry
+ */
+void
+RelationInvalidateWALSkip(Relation rel)
+{
+    RelWalSkip *walskip;
+
+    /* we know we don't have one */
+    if (rel->rd_nowalskip)
+        return;
+
+    walskip = getWalSkipEntry(rel, false);
+
+    if (!walskip)
+        return;
+
+    /*
+     * The state is reset at subtransaction commit/abort. No invalidation
+     * request must not come for the same relation in the same subtransaction.
+     */
+    Assert(walskip->invalidate_sxid == InvalidSubTransactionId);
+
+    walskip->invalidate_sxid = GetCurrentSubTransactionId();
+
+    STORAGE_elog(DEBUG2,
+                 "WAL skip of rel %u/%u/%u invalidated by sxid %d",
+                 walskip->relnode.spcNode, walskip->relnode.dbNode,
+                 walskip->relnode.relNode, walskip->invalidate_sxid);
+}
+
+/*
+ * getWalSkipEntry: get WAL skip entry.
+ *
+ * Returns WAL skip entry for the relation. The entry tracks WAL-skipping
+ * blocks for the relation.  The WAL-skipped blocks need fsync at commit time.
+ * Creates one if needed when create is true. If rel doesn't support this
+ * feature, returns true even if create is true.
+ */
+static inline RelWalSkip *
+getWalSkipEntry(Relation rel, bool create)
+{
+    RelWalSkip *walskip_entry = NULL;
+
+    if (rel->rd_walskip)
+        return rel->rd_walskip;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->rd_nowalskip)
+        return NULL;
+
+    /* check for support for this feature */
+    if (rel->rd_tableam == NULL ||
+        rel->rd_tableam->relation_register_walskip == NULL)
+    {
+        rel->rd_nowalskip = true;
+        return NULL;
+    }
+
+    walskip_entry = getWalSkipEntryRNode(&rel->rd_node, create);
+
+    if (!walskip_entry)
+    {
+        /* prevent further hash lookup */
+        rel->rd_nowalskip = true;
+        return NULL;
+    }
+
+    walskip_entry->forks[MAIN_FORKNUM] = true;
+    walskip_entry->tableam = rel->rd_tableam;
+
+    /* hold shortcut in Relation */
+    rel->rd_nowalskip = false;
+    rel->rd_walskip = walskip_entry;
+
+    return walskip_entry;
+}
+
+/*
+ * getWalSkipEntryRNode: get WAL skip entry by rnode
+ *
+ * Returns a WAL skip entry for the RelFileNode.
+ */
+static RelWalSkip *
+getWalSkipEntryRNode(RelFileNode *rnode, bool create)
+{
+    RelWalSkip *walskip_entry = NULL;
+    bool            found;
+
+    if (!walSkipHash)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(RelWalSkip);
+        ctl.hash = tag_hash;
+        walSkipHash = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    walskip_entry = (RelWalSkip *)
+        hash_search(walSkipHash, (void *) rnode,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!walskip_entry)
+        return NULL;
+
+    /* new entry created */
+    if (!found)
+    {
+        memset(&walskip_entry->forks, 0, sizeof(walskip_entry->forks));
+        walskip_entry->wal_log_min_blk = InvalidBlockNumber;
+        walskip_entry->skip_wal_min_blk = InvalidBlockNumber;
+        walskip_entry->create_sxid = GetCurrentSubTransactionId();
+        walskip_entry->invalidate_sxid = InvalidSubTransactionId;
+        walskip_entry->tableam = NULL;
+    }
+
+    return walskip_entry;
 }
 
 /*
@@ -506,6 +917,107 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+/*
+ * Finish bulk insert of files.
+ */
+void
+smgrFinishBulkInsert(bool isCommit)
+{
+    if (!walSkipHash)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        RelWalSkip *walskip;
+
+        hash_seq_init(&status, walSkipHash);
+
+        while ((walskip = hash_seq_search(&status)) != NULL)
+        {
+            /*
+             * On commit, process valid entreis. Rollback doesn't need sync on
+             * all changes during the transaction.
+             */
+            if (walskip->skip_wal_min_blk != InvalidBlockNumber &&
+                walskip->invalidate_sxid == InvalidSubTransactionId)
+            {
+                int f;
+
+                FlushRelationBuffersWithoutRelCache(walskip->relnode, false);
+
+                /*
+                 * We mustn't create an entry when the table AM doesn't
+                 * support WAL-skipping.
+                 */
+                Assert (walskip->tableam->finish_bulk_insert);
+
+                /* flush all requested forks  */
+                for (f = MAIN_FORKNUM ; f <= MAX_FORKNUM ; f++)
+                {
+                    if (walskip->forks[f])
+                    {
+                        walskip->tableam->finish_bulk_insert(walskip->relnode, f);
+                        STORAGE_elog(DEBUG2, "finishing bulk insert to rel %u/%u/%u fork %d",
+                                     walskip->relnode.spcNode,
+                                     walskip->relnode.dbNode,
+                                     walskip->relnode.relNode, f);
+                    }
+                }
+            }
+        }
+    }
+
+    hash_destroy(walSkipHash);
+    walSkipHash = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL skip happened in the subtransaction
+ */
+void
+smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+                        SubTransactionId parentSubid)
+{
+    HASH_SEQ_STATUS status;
+    RelWalSkip *walskip;
+
+    if (!walSkipHash)
+        return;
+
+    /* We expect that we don't have walSkipHash in almost all cases */
+    hash_seq_init(&status, walSkipHash);
+
+    while ((walskip = hash_seq_search(&status)) != NULL)
+    {
+        if (walskip->create_sxid == mySubid)
+        {
+            /*
+             * The entry was created in this subxact. Remove it on abort, or
+             * on commit after invalidation.
+             */
+            if (!isCommit || walskip->invalidate_sxid == mySubid)
+                hash_search(walSkipHash, &walskip->relnode,
+                            HASH_REMOVE, NULL);
+            /* Treat committing valid entry as creation by the parent. */
+            else if (walskip->invalidate_sxid == InvalidSubTransactionId)
+                walskip->create_sxid = parentSubid;
+        }
+        else if (walskip->invalidate_sxid == mySubid)
+        {
+            /*
+             * This entry was created elsewhere then invalidated by this
+             * subxact. Treat commit as invalidation by the parent. Otherwise
+             * cancel invalidation.
+             */
+            if (isCommit)
+                walskip->invalidate_sxid = parentSubid;
+            else
+                walskip->invalidate_sxid = InvalidSubTransactionId;
+        }
+    }
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
@@ -535,7 +1047,7 @@ PostPrepare_smgr(void)
  * Reassign all items in the pending-deletes list to the parent transaction.
  */
 void
-AtSubCommit_smgr(void)
+AtSubCommit_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
     PendingRelDelete *pending;
@@ -545,6 +1057,9 @@ AtSubCommit_smgr(void)
         if (pending->nestLevel >= nestLevel)
             pending->nestLevel = nestLevel - 1;
     }
+
+    /* Remove invalidated WAL skip in this subtransaction */
+    smgrProcessWALSkipInval(true, mySubid, parentSubid);
 }
 
 /*
@@ -555,9 +1070,12 @@ AtSubCommit_smgr(void)
  * subtransaction will not commit.
  */
 void
-AtSubAbort_smgr(void)
+AtSubAbort_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
 {
     smgrDoPendingDeletes(false);
+
+    /* Remove invalidated WAL skip in this subtransaction */
+    smgrProcessWALSkipInval(false, mySubid, parentSubid);
 }
 
 void
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 654179297c..8908b77d98 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11983,8 +11983,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
-    RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
-                        rel->rd_rel->relpersistence);
+    RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
 
     /* copy those extra forks that exist */
     for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -12002,8 +12001,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
                 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
                  forkNum == INIT_FORKNUM))
                 log_smgrcreate(&newrnode, forkNum);
-            RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
-                                rel->rd_rel->relpersistence);
+            RelationCopyStorage(rel, dstrel, forkNum);
         }
     }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..f00826712a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,40 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3183,7 +3204,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3213,18 +3234,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 64f3c2e887..f06d55a8fe 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -5644,6 +5645,8 @@ load_relcache_init_file(bool shared)
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+        rel->rd_nowalskip = false;
+        rel->rd_walskip = NULL;
 
         /*
          * Recompute lock and physical addressing info.  This is needed in
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 882dc65c89..83fee7dbfe 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,8 +23,14 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
-                                ForkNumber forkNum, char relpersistence);
+extern void RelationCopyStorage(Relation srcrel, SMgrRelation dst,
+                                ForkNumber forkNum);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern void RecordWALSkipping(Relation rel);
+extern void RecordPendingSync(Relation rel, SMgrRelation srel,
+                              ForkNumber forknum);
+extern void RelationInvalidateWALSkip(Relation rel);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
@@ -32,8 +38,11 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
-extern void AtSubCommit_smgr(void);
-extern void AtSubAbort_smgr(void);
+extern void smgrFinishBulkInsert(bool isCommit);
+extern void AtSubCommit_smgr(SubTransactionId mySubid,
+                             SubTransactionId parentSubid);
+extern void AtSubAbort_smgr(SubTransactionId mySubid,
+                             SubTransactionId parentSubid);
 extern void PostPrepare_smgr(void);
 
 #endif                            /* STORAGE_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 54028515a7..b2b46322b2 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,13 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * rd_nowalskip is true if this relation is known not to skip WAL.
+     * Otherwise we need to ask smgr for an entry if rd_walskip is NULL.
+     */
+    bool                rd_nowalskip;
+    struct RelWalSkip   *rd_walskip;
 } RelationData;
 
 
-- 
2.16.3

From 3e816b09365dc8d388832460820a3ee2ca58dc5b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:29:23 +0900
Subject: [PATCH 6/7] Fix WAL skipping feature.

This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
the new infrastructure.
---
 src/backend/access/heap/heapam.c         | 114 +++++++++++++++++++++++--------
 src/backend/access/heap/heapam_handler.c |  88 ++++++++++++++++++------
 src/backend/access/heap/pruneheap.c      |   3 +-
 src/backend/access/heap/rewriteheap.c    |  28 ++------
 src/backend/access/heap/vacuumlazy.c     |   6 +-
 src/backend/access/heap/visibilitymap.c  |   3 +-
 src/backend/commands/cluster.c           |  27 ++++++++
 src/backend/commands/copy.c              |  15 +++-
 src/backend/commands/createas.c          |   7 +-
 src/backend/commands/matview.c           |   7 +-
 src/backend/commands/tablecmds.c         |   8 ++-
 src/include/access/rewriteheap.h         |   2 +-
 12 files changed, 219 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 267570b461..cc516e599d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,27 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or WAL
+ *      archival purposes (i.e. if wal_level=minimal), and we fsync() the file
+ *      to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transaction, because for
+ *      a small number of changes, it's cheaper to just create the WAL records
+ *      than fsync()ing the whole relation at COMMIT. It is only worthwhile for
+ *      (presumably) large operations like COPY, CLUSTER, or VACUUM FULL. Use
+ *      table_relation_register_sync() to initiate such an operation; it will
+ *      cause any subsequent updates to the table to skip WAL-logging, if
+ *      possible, and cause the heap to be synced to disk at COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -51,6 +72,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -1948,7 +1970,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2058,7 +2080,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2066,7 +2087,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2108,6 +2128,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2119,6 +2140,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -2671,7 +2693,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2805,6 +2827,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
                 vmbuffer = InvalidBuffer,
                 vmbuffer_new = InvalidBuffer;
     bool        need_toast;
+    bool        oldbuf_needs_wal,
+                newbuf_needs_wal;
     Size        newtupsize,
                 pagefree;
     bool        have_tuple_lock = false;
@@ -3356,7 +3380,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -3570,26 +3594,55 @@ l2:
         MarkBufferDirty(newbuf);
     MarkBufferDirty(buffer);
 
-    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    /*
+     *  XLOG stuff
+     *
+     * Emit heap-update log. When wal_level = minimal, we may emit insert or
+     * delete record according to wal-optimization.
+     */
+    oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+    if (newbuf == buffer)
+        newbuf_needs_wal = oldbuf_needs_wal;
+    else
+        newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+    if (oldbuf_needs_wal || newbuf_needs_wal)
     {
         XLogRecPtr    recptr;
 
         /*
          * For logical decoding we need combocids to properly decode the
-         * catalog.
+         * catalog. Both oldbuf_needs_wal and newbuf_needs_wal must be true
+         * when logical decoding is active.
          */
         if (RelationIsAccessibleInLogicalDecoding(relation))
         {
+            Assert(oldbuf_needs_wal && newbuf_needs_wal);
+
             log_heap_new_cid(relation, &oldtup);
             log_heap_new_cid(relation, heaptup);
         }
 
-        recptr = log_heap_update(relation, buffer,
-                                 newbuf, &oldtup, heaptup,
-                                 old_key_tuple,
-                                 all_visible_cleared,
-                                 all_visible_cleared_new);
+        /*
+         * Insert log record. Using delete or insert log loses HOT chain
+         * information but that happens only when newbuf is different from
+         * buffer, where HOT cannot happen.
+         */
+        if (oldbuf_needs_wal && newbuf_needs_wal)
+            recptr = log_heap_update(relation, buffer, newbuf,
+                                     &oldtup, heaptup,
+                                     old_key_tuple,
+                                     all_visible_cleared,
+                                     all_visible_cleared_new);
+        else if (oldbuf_needs_wal)
+            recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+                                     xmax_old_tuple, false,
+                                     all_visible_cleared);
+        else
+            recptr = log_heap_insert(relation, buffer, newtup,
+                                     0, all_visible_cleared_new);
+
         if (newbuf != buffer)
         {
             PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4467,7 +4520,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5219,7 +5272,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -5379,7 +5432,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
     htup->t_ctid = *tid;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -5511,7 +5564,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -5620,7 +5673,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7030,8 +7083,8 @@ log_heap_clean(Relation reln, Buffer buffer,
     xl_heap_clean xlrec;
     XLogRecPtr    recptr;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me on non-WAL-logged buffers */
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7078,8 +7131,8 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     xl_heap_freeze_page xlrec;
     XLogRecPtr    recptr;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me on non-WAL-logged buffers */
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7305,8 +7358,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
     bool        init;
     int            bufflags;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me when no buffer needs WAL-logging */
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8910,9 +8963,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. table_relation_register_sync() should
+ * be used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5c96fc91b7..bddf026b81 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -57,6 +57,9 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
                        HeapTuple tuple,
                        OffsetNumber tupoffset);
 
+static void heapam_relation_register_walskip(Relation rel);
+static void heapam_relation_invalidate_walskip(Relation rel);
+
 static const TableAmRoutine heapam_methods;
 
 
@@ -541,14 +544,10 @@ tuple_lock_retry:
 }
 
 static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_finish_bulk_insert(RelFileNode rnode, ForkNumber forkNum)
 {
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
+    /* Sync the file immedately */
+    smgrimmedsync(smgropen(rnode, InvalidBackendId), forkNum);
 }
 
 
@@ -616,6 +615,12 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
     dstrel = smgropen(newrnode, rel->rd_backend);
     RelationOpenSmgr(rel);
 
+    /*
+     * Register WAL-skipping for the relation. WAL-logging is skipped and sync
+     * the file at commit if the AM supports the feature.
+     */
+    table_relation_register_walskip(rel);
+
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
      * old physical files.
@@ -626,8 +631,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
-    RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
-                        rel->rd_rel->relpersistence);
+    RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
 
     /* copy those extra forks that exist */
     for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -645,8 +649,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
                 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
                  forkNum == INIT_FORKNUM))
                 log_smgrcreate(&newrnode, forkNum);
-            RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
-                                rel->rd_rel->relpersistence);
+            RelationCopyStorage(rel, dstrel, forkNum);
         }
     }
 
@@ -670,7 +673,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -684,15 +686,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     /* Remember if it's a system catalog */
     is_system_catalog = IsSystemRelation(OldHeap);
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
-     */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
-    Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
-
     /* Preallocate values/isnull arrays */
     natts = newTupDesc->natts;
     values = (Datum *) palloc(natts * sizeof(Datum));
@@ -700,7 +693,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
-                                 MultiXactCutoff, use_wal);
+                                 MultiXactCutoff);
 
 
     /* Set up sorting if wanted */
@@ -946,6 +939,55 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     pfree(isnull);
 }
 
+/*
+ *    heapam_relation_register_walskip - register a heap to be WAL-skipped then
+ *                                       synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file. This makes
+ * note of the current size of the relation, and ensures that when the
+ * relation is extended, any changes to the new blocks in the heap, in the
+ * same transaction, will not be WAL-logged. Instead, the heap contents are
+ * flushed to disk at commit.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+static void
+heapam_relation_register_walskip(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordWALSkipping(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordWALSkipping(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+
+    return;
+}
+
+/*
+ *    heapam_relation_invalidate_walskip    - invalidate registered WAL skipping
+ *
+ *  After some file-replacing operations like CLUSTER, the old file no longe
+ *  needs to be synced to disk. This function invalidates the registered
+ *  WAL-skipping on the current relfilenode of the relation.
+ */
+static void
+heapam_relation_invalidate_walskip(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RelationInvalidateWALSkip(rel);
+}
+
 static bool
 heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
                                BufferAccessStrategy bstrategy)
@@ -2423,6 +2465,8 @@ static const TableAmRoutine heapam_methods = {
     .relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
     .relation_copy_data = heapam_relation_copy_data,
     .relation_copy_for_cluster = heapam_relation_copy_for_cluster,
+    .relation_register_walskip = heapam_relation_register_walskip,
+    .relation_invalidate_walskip = heapam_relation_invalidate_walskip,
     .relation_vacuum = heap_vacuum_rel,
     .scan_analyze_next_block = heapam_scan_analyze_next_block,
     .scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
 #include "access/xloginsert.h"
 
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 
 #include "lib/ilist.h"
 
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * min_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
                    (char *) state->rs_buffer, true);
     }
 
-    /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
-     * reason is the same as in tablecmds.c's copy_relation_data(): we're
-     * writing data that's not in shared buffers, and so a CHECKPOINT
-     * occurring during the rewriteheap operation won't have fsync'd data we
-     * wrote before the checkpoint.
-     */
-    if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     logical_end_heap_rewrite(state);
 
@@ -654,9 +639,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b5b464e4a9..45139ec70e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -945,7 +945,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1209,7 +1209,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1591,7 +1591,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4f4be1efbf..b5db26fda5 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -612,6 +612,18 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
                                relpersistence,
                                AccessExclusiveLock);
 
+    /*
+     * If wal_level is minimal, we skip WAL-logging even for WAL-logging
+     * relations. The filenode is synced at commit.
+     */
+    if (!XLogIsNeeded())
+    {
+        /* make_new_heap doesn't lock OIDNewHeap */
+        Relation newheap = table_open(OIDNewHeap, AccessShareLock);
+        table_relation_register_walskip(newheap);
+        table_close(newheap, AccessShareLock);
+    }
+
     /* Copy the heap data into the new table in the desired order */
     copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
                    &swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -1355,6 +1367,21 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
     /* Zero out possible results from swapped_relation_files */
     memset(mapped_tables, 0, sizeof(mapped_tables));
 
+    /*
+     * Unregister useless pending file-sync. table_relation_unregister_sync
+     * relies on a premise that relation cache has the correct relfilenode and
+     * related members. After swap_relation_files, the relcache entry for the
+     * heaps gets inconsistent with pg_class entry so we should do this before
+     * the call.
+     */
+    if (!XLogIsNeeded())
+    {
+        Relation oldheap = table_open(OIDOldHeap, AccessShareLock);
+
+        table_relation_invalidate_walskip(oldheap);
+        table_close(oldheap, AccessShareLock);
+    }
+
     /*
      * Swap the contents of the heap relations (including any toast tables).
      * Also set old heap's relfrozenxid to frozenXid.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c1fd7b78ce..6a85ab890e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2437,9 +2437,13 @@ CopyFrom(CopyState cstate)
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
          cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
     {
-        ti_options |= TABLE_INSERT_SKIP_FSM;
+        /*
+         * We can skip WAL-logging the insertions, unless PITR or streaming
+         * replication is in use. We can skip the FSM in any case.
+         */
         if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
+            table_relation_register_walskip(cstate->rel);
+        ti_options |= TABLE_INSERT_SKIP_FSM;
     }
 
     /*
@@ -3106,7 +3110,12 @@ CopyFrom(CopyState cstate)
 
     FreeExecutorState(estate);
 
-    table_finish_bulk_insert(cstate->rel, ti_options);
+    /*
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
+     */
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..8b73654413 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        table_relation_register_walskip(intoRelationDesc);
+    myState->ti_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,7 +605,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 2aac63296b..33b7bc4c16 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -462,9 +462,10 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
+        table_relation_register_walskip(transientrel);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
+
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,7 +510,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8908b77d98..deb147c45a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4716,7 +4716,11 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
         ti_options = TABLE_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
+        {
+            /* Forget old relation's registerd sync */
+            table_relation_invalidate_walskip(oldrel);
+            table_relation_register_walskip(newrel);
+        }
     }
     else
     {
@@ -5000,7 +5004,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
+        /* If we skipped writing WAL, then it will be done at commit. */
 
         table_close(newrel, NoLock);
     }
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                    TransactionId OldestXmin, TransactionId FreezeXid,
-                   MultiXactId MultiXactCutoff, bool use_wal);
+                   MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                    HeapTuple newTuple);
-- 
2.16.3

From f4a0cc5382805500c3db3d4ec2231cee383841f3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:31:33 +0900
Subject: [PATCH 7/7] Remove TABLE/HEAP_INSERT_SKIP_WAL

Remove no-longer-used symbol TABLE/HEAP_INSERT_SKIP_WAL.
---
 src/include/access/heapam.h  |  3 +--
 src/include/access/tableam.h | 11 +++--------
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4c077755d5..5b084c2f5a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1a3a3c6711..b5203dd485 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -100,10 +100,9 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
-#define TABLE_INSERT_SKIP_FSM        0x0002
-#define TABLE_INSERT_FROZEN            0x0004
-#define TABLE_INSERT_NO_LOGICAL        0x0008
+#define TABLE_INSERT_SKIP_FSM        0x0001
+#define TABLE_INSERT_FROZEN            0x0002
+#define TABLE_INSERT_NO_LOGICAL        0x0004
 
 /* flag bits fortable_lock_tuple */
 /* Follow tuples whose update is in progress if lock modes don't conflict  */
@@ -1017,10 +1016,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

03 апреля 2019 г., 17:16:02

On Tue, Apr 2, 2019 at 6:54 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
> > chain and infomask bits that were present before crash recovery.  If that's
> > okay in these circumstances, please write a comment explaining why.
>
> Sounds reasonable. Added a comment. (Honestly I completely forgot
> about that.. Thanks!) (0006)

If you haven't already, I think you should set up a master and a
standby and wal_consistency_checking=all and run tests of this feature
on the master and see if anything breaks on the master or the standby.
I'm not sure that emitting an insert or delete record is going to
reproduce the exact same state on the standby that exists on the
master.

+ * Insert log record. Using delete or insert log loses HOT chain
+ * information but that happens only when newbuf is different from
+ * buffer, where HOT cannot happen.

"HOT chain information" seems pretty vague.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

04 апреля 2019 г., 05:03:20

Thank you for looking this.

At Wed, 3 Apr 2019 10:16:02 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYEST4xYaU10gM=XXeA-oxbFh=qSfy0X4PXDCWubcgj=g@mail.gmail.com>
> On Tue, Apr 2, 2019 at 6:54 AM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
> > > chain and infomask bits that were present before crash recovery.  If that's
> > > okay in these circumstances, please write a comment explaining why.
> >
> > Sounds reasonable. Added a comment. (Honestly I completely forgot
> > about that.. Thanks!) (0006)
> 
> If you haven't already, I think you should set up a master and a
> standby and wal_consistency_checking=all and run tests of this feature
> on the master and see if anything breaks on the master or the standby.
> I'm not sure that emitting an insert or delete record is going to
> reproduce the exact same state on the standby that exists on the
> master.

All of this patch is for wal_level = minimal. Doesn't make
changes in other cases. Updates are always replicated as
XLOG_HEAP_(HOT_)UPDATE. Crash recovery cases involving log_insert
or log_update are exercised by the TAP test.

> + * Insert log record. Using delete or insert log loses HOT chain
> + * information but that happens only when newbuf is different from
> + * buffer, where HOT cannot happen.
> 
> "HOT chain information" seems pretty vague.

Thanks. Actually I was a bit uneasy with "information". Does the
following make sense?

> * Insert log record, using delete or insert instead of update log
> * when only one of the two buffers needs WAL-logging. If this were a
> * HOT-update, redoing the WAL record would result in a broken
> * hot-chain. However, that never happens because updates complete on
> * a single page always use log_update.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

04 апреля 2019 г., 17:52:59

On Wed, Apr 3, 2019 at 10:03 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > * Insert log record, using delete or insert instead of update log
> > * when only one of the two buffers needs WAL-logging. If this were a
> > * HOT-update, redoing the WAL record would result in a broken
> > * hot-chain. However, that never happens because updates complete on
> > * a single page always use log_update.

It makes sense grammatically, but I'm not sure I believe that it's
sound technically.  Even though it's only used in the non-HOT case,
it's still important that the CTID, XMIN, and XMAX fields are set
correctly during both normal operation and recovery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

05 апреля 2019 г., 06:55:20

At Thu, 4 Apr 2019 10:52:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZE0jW0jbQxAtoJgJNwrR1hyx3x8pUjQr=ggenLxnPoEQ@mail.gmail.com>
> On Wed, Apr 3, 2019 at 10:03 PM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > * Insert log record, using delete or insert instead of update log
> > > * when only one of the two buffers needs WAL-logging. If this were a
> > > * HOT-update, redoing the WAL record would result in a broken
> > > * hot-chain. However, that never happens because updates complete on
> > > * a single page always use log_update.
> 
> It makes sense grammatically, but I'm not sure I believe that it's

Great to hear that!  I rewrote it as the following.

+ * Insert log record. When we are not running WAL-skipping, always use
+ * update log. Otherwise use delete or insert log instead when only
+ * one of the two buffers needs WAL-logging. If this were a
+ * HOT-update, redoing the WAL record would result in a broken
+ * hot-chain. However, that never happens because updates complete on
+ * a single page always use log_update.
+ *
+ * Using delete or insert log in place of udpate log leads to
+ * inconsistent series of WAL records. But note that WAL-skipping
+ * happens only when we are updating a tuple in a relation that has
+ * been create in the same transaction. Once commited, the WAL records
+ * recovers the same state of the relation as the synced state at the
+ * commit. Or the maybe-broken relation due to a crash before commit
+ * will be removed in recovery.

> sound technically.  Even though it's only used in the non-HOT case,
> it's still important that the CTID, XMIN, and XMAX fields are set
> correctly during both normal operation and recovery.

log_heap_delete()/log_heap_update() record the infomasks of the
deleted tuple as is. Xmax is stored from the same
variable. offnum is taken from the deleted tuple and buffer is
registered and xlrec.flags is set to the same value. As the
result Xmax, infomasks and ctid are restored to the same state by
heap_xlog_xlog_delete(). I didn't add a comment about that.

log_heap_insert()/log_heap_update() record the infomasks of the
inserted tuple as is. Xmin/Cmin and ctid related info are handled
the same way. But log_heap_insert() assumes that Xmax =
invalid. But that happens only when another transaction can see
it, which is not the case here. I added a command and assertion
before calling log_heap_insert().

+   * Coming here means that the old tuple is invisible and
+   * inoperable to another transaction. So xmax_new_tuple is
+   * expected to be InvalidTransactionId here.
+   */
+  Assert (xmax_new_tuple == InvalidTransactionId);
+  recptr = log_heap_insert(relation, buffer, newtup,


I noticed that I accidentally moved log_heap_new_cid stuff to
log_heap_insert/delete(). I restored them.

The attached v11 is the new version addressing the aboves and
rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 12a6bc81a98bd15d7c8059c797fdca558d82f0d7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/7] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 694d146936a0fe0943854b7ca81a59b251fa9c2a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:08 +0900
Subject: [PATCH 2/7] Write WAL for empty nbtree index build

After relation truncation indexes are also rebuild. It doesn't emit
WAL in minimal mode and if truncation happened within its creation
transaction, crash recovery leaves an empty index heap, which is
considered broken. This patch forces to emit WAL when an index_build
turns into empty nbtree index.
---
 src/backend/access/nbtree/nbtsort.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9ac4c1e1c0..a31d58025f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -654,8 +654,16 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
     /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
     RelationOpenSmgr(wstate->index);
 
-    /* XLOG stuff */
-    if (wstate->btws_use_wal)
+    /* XLOG stuff
+     *
+     * Even when wal_level is minimal, WAL is required here if truncation
+     * happened after being created in the same transaction. This is hacky but
+     * we cannot use BufferNeedsWAL() stuff for nbtree since it can emit
+     * atomic WAL records on multiple buffers.
+     */
+    if (wstate->btws_use_wal ||
+        (RelationNeedsWAL(wstate->index) &&
+         (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
-- 
2.16.3

From 0d2a38f20dabec2d87d7d021b3d0cc12c3fa016b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 25 Mar 2019 13:29:50 +0900
Subject: [PATCH 3/7] Move XLOG stuff from heap_insert and heap_delete

Succeeding commit makes heap_update emit insert and delete WAL
records. Move out XLOG stuff for insert and delete so that heap_update
can use the stuff.
---
 src/backend/access/heap/heapam.c | 252 ++++++++++++++++++++++-----------------
 1 file changed, 145 insertions(+), 107 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a05b6a07ad..223be30eb3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -72,6 +72,11 @@
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
                     TransactionId xid, CommandId cid, int options);
+static XLogRecPtr log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared);
+static XLogRecPtr log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
                 Buffer newbuf, HeapTuple oldtup,
                 HeapTuple newtup, HeapTuple old_key_tup,
@@ -1875,6 +1880,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     TransactionId xid = GetCurrentTransactionId();
     HeapTuple    heaptup;
     Buffer        buffer;
+    Page        page;
     Buffer        vmbuffer = InvalidBuffer;
     bool        all_visible_cleared = false;
 
@@ -1911,16 +1917,18 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
      */
     CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
+    page = BufferGetPage(buffer);
+
     /* NO EREPORT(ERROR) from here till changes are logged */
     START_CRIT_SECTION();
 
     RelationPutHeapTuple(relation, buffer, heaptup,
                          (options & HEAP_INSERT_SPECULATIVE) != 0);
 
-    if (PageIsAllVisible(BufferGetPage(buffer)))
+    if (PageIsAllVisible(page))
     {
         all_visible_cleared = true;
-        PageClearAllVisible(BufferGetPage(buffer));
+        PageClearAllVisible(page);
         visibilitymap_clear(relation,
                             ItemPointerGetBlockNumber(&(heaptup->t_self)),
                             vmbuffer, VISIBILITYMAP_VALID_BITS);
@@ -1942,12 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     /* XLOG stuff */
     if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
     {
-        xl_heap_insert xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
-        Page        page = BufferGetPage(buffer);
-        uint8        info = XLOG_HEAP_INSERT;
-        int            bufflags = 0;
 
         /*
          * If this is a catalog, we need to transmit combocids to properly
@@ -1956,61 +1959,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
         if (RelationIsAccessibleInLogicalDecoding(relation))
             log_heap_new_cid(relation, heaptup);
 
-        /*
-         * If this is the single and first tuple on page, we can reinit the
-         * page instead of restoring the whole thing.  Set flag, and hide
-         * buffer references from XLogInsert.
-         */
-        if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
-            PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
-        {
-            info |= XLOG_HEAP_INIT_PAGE;
-            bufflags |= REGBUF_WILL_INIT;
-        }
-
-        xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
-        if (options & HEAP_INSERT_SPECULATIVE)
-            xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
-        Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
-
-        /*
-         * For logical decoding, we need the tuple even if we're doing a full
-         * page write, so make sure it's included even if we take a full-page
-         * image. (XXX We could alternatively store a pointer into the FPW).
-         */
-        if (RelationIsLogicallyLogged(relation) &&
-            !(options & HEAP_INSERT_NO_LOGICAL))
-        {
-            xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
-            bufflags |= REGBUF_KEEP_DATA;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
-
-        xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
-        xlhdr.t_infomask = heaptup->t_data->t_infomask;
-        xlhdr.t_hoff = heaptup->t_data->t_hoff;
-
-        /*
-         * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
-         * write the whole page to the xlog, we don't need to store
-         * xl_heap_header in the xlog.
-         */
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
-        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-        /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-        XLogRegisterBufData(0,
-                            (char *) heaptup->t_data + SizeofHeapTupleHeader,
-                            heaptup->t_len - SizeofHeapTupleHeader);
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, info);
+        recptr = log_heap_insert(relation, buffer, heaptup,
+                                 options, all_visible_cleared);
 
         PageSetLSN(page, recptr);
     }
@@ -2733,58 +2683,15 @@ l1:
      */
     if (RelationNeedsWAL(relation))
     {
-        xl_heap_delete xlrec;
-        xl_heap_header xlhdr;
         XLogRecPtr    recptr;
 
         /* For logical decode we need combocids to properly decode the catalog */
         if (RelationIsAccessibleInLogicalDecoding(relation))
             log_heap_new_cid(relation, &tp);
 
-        xlrec.flags = 0;
-        if (all_visible_cleared)
-            xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
-        if (changingPart)
-            xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
-        xlrec.infobits_set = compute_infobits(tp.t_data->t_infomask,
-                                              tp.t_data->t_infomask2);
-        xlrec.offnum = ItemPointerGetOffsetNumber(&tp.t_self);
-        xlrec.xmax = new_xmax;
-
-        if (old_key_tuple != NULL)
-        {
-            if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
-            else
-                xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
-        }
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
-
-        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-
-        /*
-         * Log replica identity of the deleted tuple if there is one
-         */
-        if (old_key_tuple != NULL)
-        {
-            xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
-            xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
-            xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
-
-            XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
-            XLogRegisterData((char *) old_key_tuple->t_data
-                             + SizeofHeapTupleHeader,
-                             old_key_tuple->t_len
-                             - SizeofHeapTupleHeader);
-        }
-
-        /* filtering by origin on a row level is much more efficient */
-        XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
-
-        recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
 
+        recptr = log_heap_delete(relation, buffer, &tp, old_key_tuple, new_xmax,
+                                 changingPart, all_visible_cleared);
         PageSetLSN(page, recptr);
     }
 
@@ -7248,6 +7155,137 @@ log_heap_visible(RelFileNode rnode, Buffer heap_buffer, Buffer vm_buffer,
     return recptr;
 }
 
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ */
+XLogRecPtr
+log_heap_insert(Relation relation, Buffer buffer,
+                HeapTuple heaptup, int options, bool all_visible_cleared)
+{
+    xl_heap_insert xlrec;
+    xl_heap_header xlhdr;
+    uint8        info = XLOG_HEAP_INSERT;
+    int            bufflags = 0;
+    Page        page = BufferGetPage(buffer);
+
+    /*
+     * If this is the single and first tuple on page, we can reinit the
+     * page instead of restoring the whole thing.  Set flag, and hide
+     * buffer references from XLogInsert.
+     */
+    if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
+        PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+    {
+        info |= XLOG_HEAP_INIT_PAGE;
+        bufflags |= REGBUF_WILL_INIT;
+    }
+
+    xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
+    if (options & HEAP_INSERT_SPECULATIVE)
+        xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
+    Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
+
+    /*
+     * For logical decoding, we need the tuple even if we're doing a full
+     * page write, so make sure it's included even if we take a full-page
+     * image. (XXX We could alternatively store a pointer into the FPW).
+     */
+    if (RelationIsLogicallyLogged(relation) &&
+        !(options & HEAP_INSERT_NO_LOGICAL))
+    {
+        xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
+        bufflags |= REGBUF_KEEP_DATA;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
+
+    xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+    xlhdr.t_infomask = heaptup->t_data->t_infomask;
+    xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+    /*
+     * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
+     * write the whole page to the xlog, we don't need to store
+     * xl_heap_header in the xlog.
+     */
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
+    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+    /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
+    XLogRegisterBufData(0,
+                        (char *) heaptup->t_data + SizeofHeapTupleHeader,
+                        heaptup->t_len - SizeofHeapTupleHeader);
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, info);
+}
+
+/*
+ * Perform XLogInsert for a heap-insert operation.  Caller must already
+ * have modified the buffer and marked it dirty.
+ *
+ * NB: heap_abort_speculative() uses the same xlog record and replay
+ * routines.
+ */
+static XLogRecPtr
+log_heap_delete(Relation relation, Buffer buffer,
+                HeapTuple tp, HeapTuple old_key_tuple, TransactionId new_xmax,
+                bool changingPart, bool all_visible_cleared)
+{
+    xl_heap_delete xlrec;
+    xl_heap_header xlhdr;
+
+    xlrec.flags = 0;
+    if (all_visible_cleared)
+        xlrec.flags |= XLH_DELETE_ALL_VISIBLE_CLEARED;
+    if (changingPart)
+        xlrec.flags |= XLH_DELETE_IS_PARTITION_MOVE;
+    xlrec.infobits_set = compute_infobits(tp->t_data->t_infomask,
+                                          tp->t_data->t_infomask2);
+    xlrec.offnum = ItemPointerGetOffsetNumber(&tp->t_self);
+    xlrec.xmax = new_xmax;
+
+    if (old_key_tuple != NULL)
+    {
+        if (relation->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_TUPLE;
+        else
+            xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
+    }
+
+    XLogBeginInsert();
+    XLogRegisterData((char *) &xlrec, SizeOfHeapDelete);
+
+    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+    /*
+     * Log replica identity of the deleted tuple if there is one
+     */
+    if (old_key_tuple != NULL)
+    {
+        xlhdr.t_infomask2 = old_key_tuple->t_data->t_infomask2;
+        xlhdr.t_infomask = old_key_tuple->t_data->t_infomask;
+        xlhdr.t_hoff = old_key_tuple->t_data->t_hoff;
+
+        XLogRegisterData((char *) &xlhdr, SizeOfHeapHeader);
+        XLogRegisterData((char *) old_key_tuple->t_data
+                         + SizeofHeapTupleHeader,
+                         old_key_tuple->t_len
+                         - SizeofHeapTupleHeader);
+    }
+
+    /* filtering by origin on a row level is much more efficient */
+    XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
+
+    return XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+}
+
 /*
  * Perform XLogInsert for a heap-update operation.  Caller must already
  * have modified the buffer(s) and marked them dirty.
-- 
2.16.3

From 9e1295b47c3a55298b96e183f158328c29d1adf8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 11:53:36 +0900
Subject: [PATCH 4/7] Add new interface to TableAmRoutine

Add two interface functions to TableAmRoutine, which are related to
WAL-skipping feature.
---
 src/backend/access/table/tableamapi.c |  4 ++
 src/include/access/tableam.h          | 79 +++++++++++++++++++++++------------
 2 files changed, 56 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index bfd713f3af..56b5d521de 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -93,6 +93,10 @@ GetTableAmRoutine(Oid amhandler)
            (routine->scan_bitmap_next_tuple == NULL));
     Assert(routine->scan_sample_next_block != NULL);
     Assert(routine->scan_sample_next_tuple != NULL);
+    Assert((routine->relation_register_walskip == NULL) ==
+           (routine->relation_invalidate_walskip == NULL) &&
+           (routine->relation_register_walskip == NULL) ==
+           (routine->finish_bulk_insert == NULL));
 
     return routine;
 }
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a647e7db32..38a00d8823 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -389,19 +389,15 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * e.g. may e.g. used to flush the relation when inserting with
-     * TABLE_INSERT_SKIP_WAL specified.
+     * tuple_insert and multi_insert or page-level copying performed by ALTER
+     * TABLE rewrite. This is called at commit time if WAL-skipping is
+     * activated and the caller decided that any finish work is required to
+     * the file.
      *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags the apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert
-     * that make sense for a specific AM.
-     *
-     * Optional callback.
+     * Optional callback. Must be provided when relation_register_walskip is
+     * provided.
      */
-    void        (*finish_bulk_insert) (Relation rel, int options);
-
+    void        (*finish_bulk_insert) (RelFileNode rnode, ForkNumber forkNum);
 
     /* ------------------------------------------------------------------------
      * DDL related functionality.
@@ -454,6 +450,26 @@ typedef struct TableAmRoutine
                                               double *tups_vacuumed,
                                               double *tups_recently_dead);
 
+    /*
+     * Register WAL-skipping on the current storage of rel. WAL-logging on the
+     * relation is skipped and the storage will be synced at commit. Returns
+     * true if successfully registered, and finish_bulk_insert() is called at
+     * commit.
+     *
+     * Optional callback.
+     */
+    void        (*relation_register_walskip) (Relation rel);
+
+    /*
+     * Invalidate registered WAL skipping on the current storage of rel. The
+     * function is called when the storage of the relation is going to be
+     * out-of-use after commit.
+     *
+     * Optional callback. Must be provided when relation_register_walskip is
+     * provided.
+     */
+    void        (*relation_invalidate_walskip) (Relation rel);
+
     /*
      * React to VACUUM command on the relation. The VACUUM might be user
      * triggered or by autovacuum. The specific actions performed by the AM
@@ -1034,8 +1050,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  *
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1231,20 +1246,6 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
                                        flags, tmfd);
 }
 
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
-    /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related functionality.
@@ -1328,6 +1329,30 @@ table_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
                                                    tups_recently_dead);
 }
 
+/*
+ * Register WAL-skipping to the relation. WAL-logging is skipped for the new
+ * pages after this call and the relation file is going to be synced at
+ * commit.
+ */
+static inline void
+table_relation_register_walskip(Relation rel)
+{
+    if (rel->rd_tableam && rel->rd_tableam->relation_register_walskip)
+        rel->rd_tableam->relation_register_walskip(rel);
+}
+
+/*
+ * Unregister WAL-skipping to the relation. Call this when the relation is
+ * going to be out-of-use after commit. WAL-skipping continues but the
+ * relation won't be synced at commit.
+ */
+static inline void
+table_relation_invalidate_walskip(Relation rel)
+{
+    if (rel->rd_tableam && rel->rd_tableam->relation_invalidate_walskip)
+        rel->rd_tableam->relation_invalidate_walskip(rel);
+}
+
 /*
  * Perform VACUUM on the relation. The VACUUM can be user triggered or by
  * autovacuum. The specific actions performed by the AM will depend heavily on
-- 
2.16.3

From 0b8b3ce573e27941692ac5462db7cd6f8d0b2209 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 18:05:10 +0900
Subject: [PATCH 5/7] Add infrastructure to WAL-logging skip feature

We used to optimize WAL-logging for truncation of in-transaction
crated tables in minimal mode by just signaling by
HEAP_INSERT_SKIP_WAL option on heap operations. This mechanism can
emit WAL records that results in corrupt state for certain series of
in-transaction operations. This patch provides infrastructure to track
pending at-commit fsyncs for a relation and in-transaction
truncations. table_relation_register_walskip() should be used to start
tracking before batch operations like COPY and CLUSTER, and use
BufferNeedsWAL() instead of RelationNeedsWAL() at the places related
to WAL-logging about heap-modifying operations, then remove
call to table_finish_bulk_insert() and the tableam intaface.
---
 src/backend/access/transam/xact.c   |  12 +-
 src/backend/catalog/storage.c       | 612 +++++++++++++++++++++++++++++++++---
 src/backend/commands/tablecmds.c    |   6 +-
 src/backend/storage/buffer/bufmgr.c |  39 ++-
 src/backend/utils/cache/relcache.c  |   3 +
 src/include/catalog/storage.h       |  17 +-
 src/include/storage/bufmgr.h        |   2 +
 src/include/utils/rel.h             |   7 +
 8 files changed, 631 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd5024ef00..a2c689f414 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2111,6 +2111,9 @@ CommitTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrFinishBulkInsert(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2343,6 +2346,9 @@ PrepareTransaction(void)
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
+    /* Flush updates to relations that we didn't WAL-logged */
+    smgrFinishBulkInsert(true);
+
     /*
      * Mark serializable transaction as complete for predicate locking
      * purposes.  This should be done as late as we can put it and still allow
@@ -2668,6 +2674,7 @@ AbortTransaction(void)
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
     AtAbort_Twophase();
+    smgrFinishBulkInsert(false);    /* abandon pending syncs */
 
     /*
      * Advertise the fact that we aborted in pg_xact (assuming that we got as
@@ -4801,8 +4808,7 @@ CommitSubTransaction(void)
     AtEOSubXact_RelationCache(true, s->subTransactionId,
                               s->parent->subTransactionId);
     AtEOSubXact_Inval(true);
-    AtSubCommit_smgr();
-
+    AtSubCommit_smgr(s->subTransactionId, s->parent->subTransactionId);
     /*
      * The only lock we actually release here is the subtransaction XID lock.
      */
@@ -4979,7 +4985,7 @@ AbortSubTransaction(void)
         ResourceOwnerRelease(s->curTransactionOwner,
                              RESOURCE_RELEASE_AFTER_LOCKS,
                              false, false);
-        AtSubAbort_smgr();
+        AtSubAbort_smgr(s->subTransactionId, s->parent->subTransactionId);
 
         AtEOXact_GUC(false, s->gucNestLevel);
         AtEOSubXact_SPI(false, s->subTransactionId);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 72242b2476..4cd112f86c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -21,6 +21,7 @@
 
 #include "miscadmin.h"
 
+#include "access/tableam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
 #include "access/xlog.h"
@@ -29,10 +30,18 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
-#include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+ /* #define STORAGEDEBUG */    /* turns DEBUG elogs on */
+
+#ifdef STORAGEDEBUG
+#define STORAGE_elog(...)                elog(__VA_ARGS__)
+#else
+#define STORAGE_elog(...)
+#endif
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -64,6 +73,61 @@ typedef struct PendingRelDelete
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
+/*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action (an optimization possible with wal_level=minimal). When we are
+ * about to skip WAL-logging, a RelWalSkip entry is created, and
+ * 'skip_wal_min_blk' is set to the current size of the relation. Any
+ * operations on blocks < skip_wal_min_blk need to be WAL-logged as usual, but
+ * for operations on higher blocks, WAL-logging is skipped.
+
+ *
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+typedef struct RelWalSkip
+{
+    RelFileNode relnode;            /* relation created in same xact */
+    bool        forks[MAX_FORKNUM + 1];    /* target forknums */
+    BlockNumber skip_wal_min_blk;    /* WAL-logging skipped for blocks >=
+                                     * skip_wal_min_blk */
+    BlockNumber wal_log_min_blk;     /* The minimum blk number that requires
+                                     * WAL-logging even if skipped by the
+                                     * above*/
+    SubTransactionId create_sxid;    /* subxid where this entry is created */
+    SubTransactionId invalidate_sxid; /* subxid where this entry is
+                                       * invalidated */
+    const TableAmRoutine *tableam;    /* Table access routine */
+}    RelWalSkip;
+
+/* Relations that need to be fsync'd at commit */
+static HTAB *walSkipHash = NULL;
+
+static RelWalSkip *getWalSkipEntry(Relation rel, bool create);
+static RelWalSkip *getWalSkipEntryRNode(RelFileNode *node,
+                                                      bool create);
+static void smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+                        SubTransactionId parentSubid);
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -261,31 +325,59 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
      */
     if (RelationNeedsWAL(rel))
     {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
+        RelWalSkip *walskip;
 
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+        /* get pending sync entry, create if not yet */
+        walskip = getWalSkipEntry(rel, true);
 
         /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
+         * walskip is null here if rel doesn't support WAL-logging skip,
+         * otherwise check for WAL-skipping status.
          */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (walskip == NULL ||
+            walskip->skip_wal_min_blk == InvalidBlockNumber ||
+            walskip->skip_wal_min_blk < nblocks)
+        {
+            /*
+             * If WAL-skipping is enabled, this is the first time truncation
+             * of this relation in this transaction or truncation that leaves
+             * pages that need at-commit fsync.  Make an XLOG entry reporting
+             * the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+            xlrec.flags = SMGR_TRUNCATE_ALL;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            STORAGE_elog(DEBUG2,
+                         "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                         rel->rd_node.spcNode, rel->rd_node.dbNode,
+                         rel->rd_node.relNode, nblocks);
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            if (walskip)
+            {
+                /* no longer skip WAL-logging for the blocks */
+                walskip->wal_log_min_blk = nblocks;
+            }
+        }
     }
 
     /* Do the real work */
@@ -296,8 +388,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
  * Copy a fork's data, block by block.
  */
 void
-RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
-                    ForkNumber forkNum, char relpersistence)
+RelationCopyStorage(Relation srcrel, SMgrRelation dst, ForkNumber forkNum)
 {
     PGAlignedBlock buf;
     Page        page;
@@ -305,6 +396,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     bool        copying_initfork;
     BlockNumber nblocks;
     BlockNumber blkno;
+    SMgrRelation src = srcrel->rd_smgr;
+    char         relpersistence = srcrel->rd_rel->relpersistence;
 
     page = (Page) buf.data;
 
@@ -316,12 +409,33 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     copying_initfork = relpersistence == RELPERSISTENCE_UNLOGGED &&
         forkNum == INIT_FORKNUM;
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
-     */
-    use_wal = XLogIsNeeded() &&
-        (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
+    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    {
+        /*
+         * We need to log the copied data in WAL iff WAL archiving/streaming
+         * is enabled AND it's a permanent relation.
+         */
+        if (XLogIsNeeded())
+            use_wal = true;
+
+        /*
+         * If the rel is WAL-logged, must fsync before commit.  We use
+         * heap_sync to ensure that the toast table gets fsync'd too.  (For a
+         * temp or unlogged rel we don't care since the data will be gone
+         * after a crash anyway.)
+         *
+         * It's obvious that we must do this when not WAL-logging the
+         * copy. It's less obvious that we have to do it even if we did
+         * WAL-log the copied pages. The reason is that since we're copying
+         * outside shared buffers, a CHECKPOINT occurring during the copy has
+         * no way to flush the previously written data to disk (indeed it
+         * won't know the new rel even exists).  A crash later on would replay
+         * WAL from the checkpoint, therefore it wouldn't replay our earlier
+         * WAL entries. If we do not fsync those pages here, they might still
+         * not be on disk when the crash occurs.
+         */
+        RecordPendingSync(srcrel, dst, forkNum);
+    }
 
     nblocks = smgrnblocks(src, forkNum);
 
@@ -358,24 +472,321 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
          */
         smgrextend(dst, forkNum, blkno, buf.data, true);
     }
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous RecordPendingSync() requests.
+ *
+ * Note that it is required to check this before creating any WAL records for
+ * heap pages - it is not merely an optimization! WAL-logging a record, when
+ * we have already skipped a previous WAL record for the same page could lead
+ * to failure at WAL replay, as the "before" state expected by the record
+ * might not match what's on disk. Also, if the heap was truncated earlier, we
+ * must WAL-log any changes to the once-truncated blocks, because replaying
+ * the truncation record will destroy them.
+ */
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber        blkno = InvalidBlockNumber;
+    RelWalSkip *walskip;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch existing pending sync entry */
+    walskip = getWalSkipEntry(rel, false);
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * no point in doing further work if we know that we don't skip
+     * WAL-logging.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-        smgrimmedsync(dst, forkNum);
+    if (!walskip)
+    {
+        STORAGE_elog(DEBUG2,
+                     "not skipping WAL-logging for rel %u/%u/%u block %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, BufferGetBlockNumber(buf));
+        return true;
+    }
+
+    Assert(BufferIsValid(buf));
+
+    blkno = BufferGetBlockNumber(buf);
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+        walskip->skip_wal_min_blk > blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+        return true;
+    }
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+        walskip->wal_log_min_blk <= blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+        return true;
+    }
+
+    STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode,
+                 rel->rd_node.relNode, blkno);
+
+    return false;
+}
+
+bool
+BlockNeedsWAL(Relation rel, BlockNumber blkno)
+{
+    RelWalSkip *walskip;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    /* fetch exising pending sync entry */
+    walskip = getWalSkipEntry(rel, false);
+
+    /*
+     * no point in doing further work if we know that we don't skip
+     * WAL-logging.
+     */
+    if (!walskip)
+        return true;
+
+    /*
+     * We don't skip WAL-logging for pages that once done.
+     */
+    if (walskip->skip_wal_min_blk == InvalidBlockNumber ||
+        walskip->skip_wal_min_blk > blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because skip_wal_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->skip_wal_min_blk);
+        return true;
+    }
+
+    /*
+     * we don't skip WAL-logging for blocks that have got WAL-logged
+     * truncation
+     */
+    if (walskip->wal_log_min_blk != InvalidBlockNumber &&
+        walskip->wal_log_min_blk <= blkno)
+    {
+        STORAGE_elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because wal_log_min_blk is %u",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, blkno, walskip->wal_log_min_blk);
+
+        return true;
+    }
+
+    STORAGE_elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode,
+                 rel->rd_node.relNode, blkno);
+
+    return false;
+}
+
+/*
+ * Remember that the given relation doesn't need WAL-logging for the blocks
+ * after the current block size and for the blocks that are going to be synced
+ * at commit.
+ */
+void
+RecordWALSkipping(Relation rel)
+{
+    RelWalSkip *walskip;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* get pending sync entry, create if not yet  */
+    walskip = getWalSkipEntry(rel, true);
+
+    if (walskip == NULL)
+        return;
+
+    /*
+     *  Record only the first registration.
+     */
+    if (walskip->skip_wal_min_blk != InvalidBlockNumber)
+    {
+        STORAGE_elog(DEBUG2, "WAL skipping for rel %u/%u/%u was already registered at block %u (new %u)",
+                     rel->rd_node.spcNode, rel->rd_node.dbNode,
+                     rel->rd_node.relNode, walskip->skip_wal_min_blk,
+                     RelationGetNumberOfBlocks(rel));
+        return;
+    }
+
+    STORAGE_elog(DEBUG2, "registering new WAL skipping rel %u/%u/%u at block %u",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode,
+                 rel->rd_node.relNode, RelationGetNumberOfBlocks(rel));
+
+    walskip->skip_wal_min_blk = RelationGetNumberOfBlocks(rel);
+}
+
+/*
+ * Record commit-time file sync. This shouldn't be used mixing with
+ * RecordWALSkipping.
+ */
+void
+RecordPendingSync(Relation rel, SMgrRelation targetsrel, ForkNumber forknum)
+{
+    RelWalSkip *walskip;
+
+    Assert(RelationNeedsWAL(rel));
+
+    /* check for support for this feature */
+    if (rel->rd_tableam == NULL ||
+        rel->rd_tableam->relation_register_walskip == NULL)
+        return;
+
+    walskip = getWalSkipEntryRNode(&targetsrel->smgr_rnode.node, true);
+    walskip->forks[forknum] = true;
+    walskip->skip_wal_min_blk = 0;
+    walskip->tableam = rel->rd_tableam;
+
+    STORAGE_elog(DEBUG2,
+                 "registering new pending sync for rel %u/%u/%u at block %u",
+                 walskip->relnode.spcNode, walskip->relnode.dbNode,
+                 walskip->relnode.relNode, 0);
+}
+
+/*
+ * RelationInvalidateWALSkip() -- invalidate WAL-skip entry
+ */
+void
+RelationInvalidateWALSkip(Relation rel)
+{
+    RelWalSkip *walskip;
+
+    /* we know we don't have one */
+    if (rel->rd_nowalskip)
+        return;
+
+    walskip = getWalSkipEntry(rel, false);
+
+    if (!walskip)
+        return;
+
+    /*
+     * The state is reset at subtransaction commit/abort. No invalidation
+     * request must not come for the same relation in the same subtransaction.
+     */
+    Assert(walskip->invalidate_sxid == InvalidSubTransactionId);
+
+    walskip->invalidate_sxid = GetCurrentSubTransactionId();
+
+    STORAGE_elog(DEBUG2,
+                 "WAL skip of rel %u/%u/%u invalidated by sxid %d",
+                 walskip->relnode.spcNode, walskip->relnode.dbNode,
+                 walskip->relnode.relNode, walskip->invalidate_sxid);
+}
+
+/*
+ * getWalSkipEntry: get WAL skip entry.
+ *
+ * Returns WAL skip entry for the relation. The entry tracks WAL-skipping
+ * blocks for the relation.  The WAL-skipped blocks need fsync at commit time.
+ * Creates one if needed when create is true. If rel doesn't support this
+ * feature, returns true even if create is true.
+ */
+static inline RelWalSkip *
+getWalSkipEntry(Relation rel, bool create)
+{
+    RelWalSkip *walskip_entry = NULL;
+
+    if (rel->rd_walskip)
+        return rel->rd_walskip;
+
+    /* we know we don't have pending sync entry */
+    if (!create && rel->rd_nowalskip)
+        return NULL;
+
+    /* check for support for this feature */
+    if (rel->rd_tableam == NULL ||
+        rel->rd_tableam->relation_register_walskip == NULL)
+    {
+        rel->rd_nowalskip = true;
+        return NULL;
+    }
+
+    walskip_entry = getWalSkipEntryRNode(&rel->rd_node, create);
+
+    if (!walskip_entry)
+    {
+        /* prevent further hash lookup */
+        rel->rd_nowalskip = true;
+        return NULL;
+    }
+
+    walskip_entry->forks[MAIN_FORKNUM] = true;
+    walskip_entry->tableam = rel->rd_tableam;
+
+    /* hold shortcut in Relation */
+    rel->rd_nowalskip = false;
+    rel->rd_walskip = walskip_entry;
+
+    return walskip_entry;
+}
+
+/*
+ * getWalSkipEntryRNode: get WAL skip entry by rnode
+ *
+ * Returns a WAL skip entry for the RelFileNode.
+ */
+static RelWalSkip *
+getWalSkipEntryRNode(RelFileNode *rnode, bool create)
+{
+    RelWalSkip *walskip_entry = NULL;
+    bool            found;
+
+    if (!walSkipHash)
+    {
+        /* First time through: initialize the hash table */
+        HASHCTL        ctl;
+
+        if (!create)
+            return NULL;
+
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(RelFileNode);
+        ctl.entrysize = sizeof(RelWalSkip);
+        ctl.hash = tag_hash;
+        walSkipHash = hash_create("pending relation sync table", 5,
+                                   &ctl, HASH_ELEM | HASH_FUNCTION);
+    }
+
+    walskip_entry = (RelWalSkip *)
+        hash_search(walSkipHash, (void *) rnode,
+                    create ? HASH_ENTER: HASH_FIND,    &found);
+
+    if (!walskip_entry)
+        return NULL;
+
+    /* new entry created */
+    if (!found)
+    {
+        memset(&walskip_entry->forks, 0, sizeof(walskip_entry->forks));
+        walskip_entry->wal_log_min_blk = InvalidBlockNumber;
+        walskip_entry->skip_wal_min_blk = InvalidBlockNumber;
+        walskip_entry->create_sxid = GetCurrentSubTransactionId();
+        walskip_entry->invalidate_sxid = InvalidSubTransactionId;
+        walskip_entry->tableam = NULL;
+    }
+
+    return walskip_entry;
 }
 
 /*
@@ -506,6 +917,107 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+/*
+ * Finish bulk insert of files.
+ */
+void
+smgrFinishBulkInsert(bool isCommit)
+{
+    if (!walSkipHash)
+        return;
+
+    if (isCommit)
+    {
+        HASH_SEQ_STATUS status;
+        RelWalSkip *walskip;
+
+        hash_seq_init(&status, walSkipHash);
+
+        while ((walskip = hash_seq_search(&status)) != NULL)
+        {
+            /*
+             * On commit, process valid entreis. Rollback doesn't need sync on
+             * all changes during the transaction.
+             */
+            if (walskip->skip_wal_min_blk != InvalidBlockNumber &&
+                walskip->invalidate_sxid == InvalidSubTransactionId)
+            {
+                int f;
+
+                FlushRelationBuffersWithoutRelCache(walskip->relnode, false);
+
+                /*
+                 * We mustn't create an entry when the table AM doesn't
+                 * support WAL-skipping.
+                 */
+                Assert (walskip->tableam->finish_bulk_insert);
+
+                /* flush all requested forks  */
+                for (f = MAIN_FORKNUM ; f <= MAX_FORKNUM ; f++)
+                {
+                    if (walskip->forks[f])
+                    {
+                        walskip->tableam->finish_bulk_insert(walskip->relnode, f);
+                        STORAGE_elog(DEBUG2, "finishing bulk insert to rel %u/%u/%u fork %d",
+                                     walskip->relnode.spcNode,
+                                     walskip->relnode.dbNode,
+                                     walskip->relnode.relNode, f);
+                    }
+                }
+            }
+        }
+    }
+
+    hash_destroy(walSkipHash);
+    walSkipHash = NULL;
+}
+
+/*
+ * Process pending invalidation of WAL skip happened in the subtransaction
+ */
+void
+smgrProcessWALSkipInval(bool isCommit, SubTransactionId mySubid,
+                        SubTransactionId parentSubid)
+{
+    HASH_SEQ_STATUS status;
+    RelWalSkip *walskip;
+
+    if (!walSkipHash)
+        return;
+
+    /* We expect that we don't have walSkipHash in almost all cases */
+    hash_seq_init(&status, walSkipHash);
+
+    while ((walskip = hash_seq_search(&status)) != NULL)
+    {
+        if (walskip->create_sxid == mySubid)
+        {
+            /*
+             * The entry was created in this subxact. Remove it on abort, or
+             * on commit after invalidation.
+             */
+            if (!isCommit || walskip->invalidate_sxid == mySubid)
+                hash_search(walSkipHash, &walskip->relnode,
+                            HASH_REMOVE, NULL);
+            /* Treat committing valid entry as creation by the parent. */
+            else if (walskip->invalidate_sxid == InvalidSubTransactionId)
+                walskip->create_sxid = parentSubid;
+        }
+        else if (walskip->invalidate_sxid == mySubid)
+        {
+            /*
+             * This entry was created elsewhere then invalidated by this
+             * subxact. Treat commit as invalidation by the parent. Otherwise
+             * cancel invalidation.
+             */
+            if (isCommit)
+                walskip->invalidate_sxid = parentSubid;
+            else
+                walskip->invalidate_sxid = InvalidSubTransactionId;
+        }
+    }
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
@@ -535,7 +1047,7 @@ PostPrepare_smgr(void)
  * Reassign all items in the pending-deletes list to the parent transaction.
  */
 void
-AtSubCommit_smgr(void)
+AtSubCommit_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
     PendingRelDelete *pending;
@@ -545,6 +1057,9 @@ AtSubCommit_smgr(void)
         if (pending->nestLevel >= nestLevel)
             pending->nestLevel = nestLevel - 1;
     }
+
+    /* Remove invalidated WAL skip in this subtransaction */
+    smgrProcessWALSkipInval(true, mySubid, parentSubid);
 }
 
 /*
@@ -555,9 +1070,12 @@ AtSubCommit_smgr(void)
  * subtransaction will not commit.
  */
 void
-AtSubAbort_smgr(void)
+AtSubAbort_smgr(SubTransactionId mySubid, SubTransactionId parentSubid)
 {
     smgrDoPendingDeletes(false);
+
+    /* Remove invalidated WAL skip in this subtransaction */
+    smgrProcessWALSkipInval(false, mySubid, parentSubid);
 }
 
 void
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e842f9152b..013eb203f4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -12452,8 +12452,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
-    RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
-                        rel->rd_rel->relpersistence);
+    RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
 
     /* copy those extra forks that exist */
     for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -12471,8 +12470,7 @@ index_copy_data(Relation rel, RelFileNode newrnode)
                 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
                  forkNum == INIT_FORKNUM))
                 log_smgrcreate(&newrnode, forkNum);
-            RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
-                                rel->rd_rel->relpersistence);
+            RelationCopyStorage(rel, dstrel, forkNum);
         }
     }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 887023fc8a..0c6598d9af 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
             BufferAccessStrategy strategy,
             bool *foundPtr);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static void CheckForBufferLeaks(void);
 static int    rnode_comparator(const void *p1, const void *p2);
@@ -3153,20 +3154,40 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
     /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3183,7 +3204,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3213,18 +3234,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 64f3c2e887..f06d55a8fe 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -75,6 +75,7 @@
 #include "partitioning/partdesc.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/array.h"
@@ -5644,6 +5645,8 @@ load_relcache_init_file(bool shared)
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
+        rel->rd_nowalskip = false;
+        rel->rd_walskip = NULL;
 
         /*
          * Recompute lock and physical addressing info.  This is needed in
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 882dc65c89..83fee7dbfe 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,8 +23,14 @@ extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
-extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
-                                ForkNumber forkNum, char relpersistence);
+extern void RelationCopyStorage(Relation srcrel, SMgrRelation dst,
+                                ForkNumber forkNum);
+extern bool BufferNeedsWAL(Relation rel, Buffer buf);
+extern bool BlockNeedsWAL(Relation rel, BlockNumber blkno);
+extern void RecordWALSkipping(Relation rel);
+extern void RecordPendingSync(Relation rel, SMgrRelation srel,
+                              ForkNumber forknum);
+extern void RelationInvalidateWALSkip(Relation rel);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
@@ -32,8 +38,11 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
-extern void AtSubCommit_smgr(void);
-extern void AtSubAbort_smgr(void);
+extern void smgrFinishBulkInsert(bool isCommit);
+extern void AtSubCommit_smgr(SubTransactionId mySubid,
+                             SubTransactionId parentSubid);
+extern void AtSubAbort_smgr(SubTransactionId mySubid,
+                             SubTransactionId parentSubid);
 extern void PostPrepare_smgr(void);
 
 #endif                            /* STORAGE_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c5826f691d..8a9ea041dd 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                        ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 89a7fbf73a..0adc2aba06 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -198,6 +198,13 @@ typedef struct RelationData
 
     /* use "struct" here to avoid needing to include pgstat.h: */
     struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+    /*
+     * rd_nowalskip is true if this relation is known not to skip WAL.
+     * Otherwise we need to ask smgr for an entry if rd_walskip is NULL.
+     */
+    bool                rd_nowalskip;
+    struct RelWalSkip   *rd_walskip;
 } RelationData;
 
 
-- 
2.16.3

From 5f0b1c61b7f73b08000a5b4288662b13e6fe51f4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:29:23 +0900
Subject: [PATCH 6/7] Fix WAL skipping feature.

This patch replaces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
the new infrastructure.
---
 src/backend/access/heap/heapam.c         | 133 ++++++++++++++++++++++++-------
 src/backend/access/heap/heapam_handler.c |  87 +++++++++++++++-----
 src/backend/access/heap/pruneheap.c      |   3 +-
 src/backend/access/heap/rewriteheap.c    |  28 ++-----
 src/backend/access/heap/vacuumlazy.c     |   6 +-
 src/backend/access/heap/visibilitymap.c  |   3 +-
 src/backend/commands/cluster.c           |  27 +++++++
 src/backend/commands/copy.c              |  15 +++-
 src/backend/commands/createas.c          |   7 +-
 src/backend/commands/matview.c           |   7 +-
 src/backend/commands/tablecmds.c         |   8 +-
 src/include/access/rewriteheap.h         |   2 +-
 12 files changed, 237 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 223be30eb3..ae70798b3c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -28,6 +28,27 @@
  *      the POSTGRES heap access method used for all POSTGRES
  *      relations.
  *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or WAL
+ *      archival purposes (i.e. if wal_level=minimal), and we fsync() the file
+ *      to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transaction, because for
+ *      a small number of changes, it's cheaper to just create the WAL records
+ *      than fsync()ing the whole relation at COMMIT. It is only worthwhile for
+ *      (presumably) large operations like COPY, CLUSTER, or VACUUM FULL. Use
+ *      table_relation_register_sync() to initiate such an operation; it will
+ *      cause any subsequent updates to the table to skip WAL-logging, if
+ *      possible, and cause the heap to be synced to disk at COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -51,6 +72,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -1948,7 +1970,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2065,7 +2087,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     int            ndone;
     PGAlignedBlock scratch;
     Page        page;
-    bool        needwal;
     Size        saveFreeSpace;
     bool        need_tuple_data = RelationIsLogicallyLogged(relation);
     bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
@@ -2073,7 +2094,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -2122,6 +2142,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
         Buffer        vmbuffer = InvalidBuffer;
         bool        all_visible_cleared = false;
         int            nthispage;
+        bool        needwal;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -2133,6 +2154,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
                                            InvalidBuffer, options, bistate,
                                            &vmbuffer, NULL);
         page = BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);
 
         /* NO EREPORT(ERROR) from here till changes are logged */
         START_CRIT_SECTION();
@@ -2681,7 +2703,7 @@ l1:
      * NB: heap_abort_speculative() uses the same xlog record and replay
      * routines.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         XLogRecPtr    recptr;
 
@@ -2820,6 +2842,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
                 vmbuffer = InvalidBuffer,
                 vmbuffer_new = InvalidBuffer;
     bool        need_toast;
+    bool        oldbuf_needs_wal,
+                newbuf_needs_wal;
     Size        newtupsize,
                 pagefree;
     bool        have_tuple_lock = false;
@@ -3371,7 +3395,7 @@ l2:
 
         MarkBufferDirty(buffer);
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             xl_heap_lock xlrec;
             XLogRecPtr    recptr;
@@ -3585,26 +3609,74 @@ l2:
         MarkBufferDirty(newbuf);
     MarkBufferDirty(buffer);
 
-    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    /*
+     *  XLOG stuff
+     *
+     * Emit heap-update log. When wal_level = minimal, we may emit insert or
+     * delete record according to wal-optimization.
+     */
+    oldbuf_needs_wal = BufferNeedsWAL(relation, buffer);
+
+    if (newbuf == buffer)
+        newbuf_needs_wal = oldbuf_needs_wal;
+    else
+        newbuf_needs_wal = BufferNeedsWAL(relation, newbuf);
+
+    if (oldbuf_needs_wal || newbuf_needs_wal)
     {
         XLogRecPtr    recptr;
 
         /*
          * For logical decoding we need combocids to properly decode the
-         * catalog.
+         * catalog. Both oldbuf_needs_wal and newbuf_needs_wal must be true
+         * when logical decoding is active.
          */
         if (RelationIsAccessibleInLogicalDecoding(relation))
         {
+            Assert(oldbuf_needs_wal && newbuf_needs_wal);
+
             log_heap_new_cid(relation, &oldtup);
             log_heap_new_cid(relation, heaptup);
         }
 
-        recptr = log_heap_update(relation, buffer,
-                                 newbuf, &oldtup, heaptup,
-                                 old_key_tuple,
-                                 all_visible_cleared,
-                                 all_visible_cleared_new);
+        /*
+         * Insert log record. When we are not running WAL-skipping, always use
+         * update log. Otherwise use delete or insert log instead when only
+         * one of the two buffers needs WAL-logging. If this were a
+         * HOT-update, redoing the WAL record would result in a broken
+         * hot-chain. However, that never happens because updates complete on
+         * a single page always use log_update.
+         *
+         * Using delete or insert log in place of udpate log leads to
+         * inconsistent series of WAL records. But note that WAL-skipping
+         * happens only when we are updating a tuple in a relation that has
+         * been create in the same transaction. Once commited, the WAL records
+         * recovers the same state of the relation as the synced state at the
+         * commit. Or the maybe-broken relation due to a crash before commit
+         * will be removed in recovery.
+         */
+        if (oldbuf_needs_wal && newbuf_needs_wal)
+            recptr = log_heap_update(relation, buffer, newbuf,
+                                     &oldtup, heaptup,
+                                     old_key_tuple,
+                                     all_visible_cleared,
+                                     all_visible_cleared_new);
+        else if (oldbuf_needs_wal)
+            recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
+                                     xmax_old_tuple, false,
+                                     all_visible_cleared);
+        else
+        {
+            /*
+             * Coming here means that the old tuple is invisible and
+             * inoperable to another transaction. So xmax_new_tuple is
+             * expected to be InvalidTransactionId here.
+             */
+            Assert (xmax_new_tuple == InvalidTransactionId);
+            recptr = log_heap_insert(relation, buffer, newtup,
+                                     0, all_visible_cleared_new);
+        }
+
         if (newbuf != buffer)
         {
             PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4482,7 +4554,7 @@ failed:
      * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
      * entries for everything anyway.)
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))
     {
         xl_heap_lock xlrec;
         XLogRecPtr    recptr;
@@ -5234,7 +5306,7 @@ l4:
         MarkBufferDirty(buf);
 
         /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))
         {
             xl_heap_lock_updated xlrec;
             XLogRecPtr    recptr;
@@ -5394,7 +5466,7 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
     htup->t_ctid = *tid;
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_confirm xlrec;
         XLogRecPtr    recptr;
@@ -5526,7 +5598,7 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
      * The WAL records generated here match heap_delete().  The same recovery
      * routines are used.
      */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_delete xlrec;
         XLogRecPtr    recptr;
@@ -5635,7 +5707,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))
     {
         xl_heap_inplace xlrec;
         XLogRecPtr    recptr;
@@ -7045,8 +7117,8 @@ log_heap_clean(Relation reln, Buffer buffer,
     xl_heap_clean xlrec;
     XLogRecPtr    recptr;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me on non-WAL-logged buffers */
+    Assert(BufferNeedsWAL(reln, buffer));
 
     xlrec.latestRemovedXid = latestRemovedXid;
     xlrec.nredirected = nredirected;
@@ -7093,8 +7165,8 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
     xl_heap_freeze_page xlrec;
     XLogRecPtr    recptr;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me on non-WAL-logged buffers */
+    Assert(BufferNeedsWAL(reln, buffer));
     /* nor when there are no tuples to freeze */
     Assert(ntuples > 0);
 
@@ -7309,8 +7381,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
     bool        init;
     int            bufflags;
 
-    /* Caller should not call me on a non-WAL-logged relation */
-    Assert(RelationNeedsWAL(reln));
+    /* Caller should not call me unless both buffers need WAL-logging */
+    Assert(BufferNeedsWAL(reln, newbuf) && BufferNeedsWAL(reln, oldbuf));
 
     XLogBeginInsert();
 
@@ -8914,9 +8986,16 @@ heap2_redo(XLogReaderState *record)
  *    heap_sync        - sync a heap, for use when no WAL has been written
  *
  * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. table_relation_register_sync() should
+ * be used for that purpose instead.
  *
  * Indexes are not touched.  (Currently, index operations associated with
  * the commands that use this are WAL-logged and so do not need fsync.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index add0d65f81..0c763f3a33 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -58,6 +58,8 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
                        OffsetNumber tupoffset);
 
 static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
+static void heapam_relation_register_walskip(Relation rel);
+static void heapam_relation_invalidate_walskip(Relation rel);
 
 static const TableAmRoutine heapam_methods;
 
@@ -543,14 +545,10 @@ tuple_lock_retry:
 }
 
 static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_finish_bulk_insert(RelFileNode rnode, ForkNumber forkNum)
 {
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
+    /* Sync the file immedately */
+    smgrimmedsync(smgropen(rnode, InvalidBackendId), forkNum);
 }
 
 
@@ -618,6 +616,12 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
     dstrel = smgropen(newrnode, rel->rd_backend);
     RelationOpenSmgr(rel);
 
+    /*
+     * Register WAL-skipping for the relation. WAL-logging is skipped and sync
+     * the file at commit if the AM supports the feature.
+     */
+    table_relation_register_walskip(rel);
+
     /*
      * Create and copy all forks of the relation, and schedule unlinking of
      * old physical files.
@@ -628,8 +632,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
     RelationCreateStorage(newrnode, rel->rd_rel->relpersistence);
 
     /* copy main fork */
-    RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
-                        rel->rd_rel->relpersistence);
+    RelationCopyStorage(rel, dstrel, MAIN_FORKNUM);
 
     /* copy those extra forks that exist */
     for (ForkNumber forkNum = MAIN_FORKNUM + 1;
@@ -647,8 +650,7 @@ heapam_relation_copy_data(Relation rel, RelFileNode newrnode)
                 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
                  forkNum == INIT_FORKNUM))
                 log_smgrcreate(&newrnode, forkNum);
-            RelationCopyStorage(rel->rd_smgr, dstrel, forkNum,
-                                rel->rd_rel->relpersistence);
+            RelationCopyStorage(rel, dstrel, forkNum);
         }
     }
 
@@ -672,7 +674,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -686,15 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     /* Remember if it's a system catalog */
     is_system_catalog = IsSystemRelation(OldHeap);
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
-     */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
-    Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
-
     /* Preallocate values/isnull arrays */
     natts = newTupDesc->natts;
     values = (Datum *) palloc(natts * sizeof(Datum));
@@ -702,7 +694,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, FreezeXid,
-                                 MultiXactCutoff, use_wal);
+                                 MultiXactCutoff);
 
 
     /* Set up sorting if wanted */
@@ -948,6 +940,55 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     pfree(isnull);
 }
 
+/*
+ *    heapam_relation_register_walskip - register a heap to be WAL-skipped then
+ *                                       synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file. This makes
+ * note of the current size of the relation, and ensures that when the
+ * relation is extended, any changes to the new blocks in the heap, in the
+ * same transaction, will not be WAL-logged. Instead, the heap contents are
+ * flushed to disk at commit.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+static void
+heapam_relation_register_walskip(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordWALSkipping(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordWALSkipping(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+
+    return;
+}
+
+/*
+ *    heapam_relation_invalidate_walskip    - invalidate registered WAL skipping
+ *
+ *  After some file-replacing operations like CLUSTER, the old file no longe
+ *  needs to be synced to disk. This function invalidates the registered
+ *  WAL-skipping on the current relfilenode of the relation.
+ */
+static void
+heapam_relation_invalidate_walskip(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RelationInvalidateWALSkip(rel);
+}
+
 static bool
 heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
                                BufferAccessStrategy bstrategy)
@@ -2531,6 +2572,8 @@ static const TableAmRoutine heapam_methods = {
     .relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
     .relation_copy_data = heapam_relation_copy_data,
     .relation_copy_for_cluster = heapam_relation_copy_for_cluster,
+    .relation_register_walskip = heapam_relation_register_walskip,
+    .relation_invalidate_walskip = heapam_relation_invalidate_walskip,
     .relation_vacuum = heap_vacuum_rel,
     .scan_analyze_next_block = heapam_scan_analyze_next_block,
     .scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a3e51922d8..a05659b168 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -258,7 +259,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
         /*
          * Emit a WAL HEAP_CLEAN record showing what we did
          */
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))
         {
             XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..494f7fcd41 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -116,6 +116,7 @@
 #include "access/xloginsert.h"
 
 #include "catalog/catalog.h"
+#include "catalog/storage.h"
 
 #include "lib/ilist.h"
 
@@ -144,7 +145,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +238,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * min_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -344,19 +341,7 @@ end_heap_rewrite(RewriteState state)
                    (char *) state->rs_buffer, true);
     }
 
-    /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
-     * reason is the same as in tablecmds.c's copy_relation_data(): we're
-     * writing data that's not in shared buffers, and so a CHECKPOINT
-     * occurring during the rewriteheap operation won't have fsync'd data we
-     * wrote before the checkpoint.
-     */
-    if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     logical_end_heap_rewrite(state);
 
@@ -654,9 +639,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +677,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (BlockNeedsWAL(state->rs_new_rel, state->rs_blockno))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c9d83128d5..3d8d01b10f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -959,7 +959,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
                  * page has been previously WAL-logged, and if not, do that
                  * now.
                  */
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&
                     PageGetLSN(page) == InvalidXLogRecPtr)
                     log_newpage_buffer(buf, true);
 
@@ -1233,7 +1233,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
             }
 
             /* Now WAL-log freezing if necessary */
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))
             {
                 XLogRecPtr    recptr;
 
@@ -1644,7 +1644,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))
     {
         XLogRecPtr    recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 64dfe06b26..1f5f7d92dd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@
 #include "access/heapam_xlog.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
@@ -276,7 +277,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
         map[mapByte] |= (flags << mapOffset);
         MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))
         {
             if (XLogRecPtrIsInvalid(recptr))
             {
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 4f4be1efbf..b5db26fda5 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -612,6 +612,18 @@ rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose)
                                relpersistence,
                                AccessExclusiveLock);
 
+    /*
+     * If wal_level is minimal, we skip WAL-logging even for WAL-logging
+     * relations. The filenode is synced at commit.
+     */
+    if (!XLogIsNeeded())
+    {
+        /* make_new_heap doesn't lock OIDNewHeap */
+        Relation newheap = table_open(OIDNewHeap, AccessShareLock);
+        table_relation_register_walskip(newheap);
+        table_close(newheap, AccessShareLock);
+    }
+
     /* Copy the heap data into the new table in the desired order */
     copy_table_data(OIDNewHeap, tableOid, indexOid, verbose,
                    &swap_toast_by_content, &frozenXid, &cutoffMulti);
@@ -1355,6 +1367,21 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
     /* Zero out possible results from swapped_relation_files */
     memset(mapped_tables, 0, sizeof(mapped_tables));
 
+    /*
+     * Unregister useless pending file-sync. table_relation_unregister_sync
+     * relies on a premise that relation cache has the correct relfilenode and
+     * related members. After swap_relation_files, the relcache entry for the
+     * heaps gets inconsistent with pg_class entry so we should do this before
+     * the call.
+     */
+    if (!XLogIsNeeded())
+    {
+        Relation oldheap = table_open(OIDOldHeap, AccessShareLock);
+
+        table_relation_invalidate_walskip(oldheap);
+        table_close(oldheap, AccessShareLock);
+    }
+
     /*
      * Swap the contents of the heap relations (including any toast tables).
      * Also set old heap's relfrozenxid to frozenXid.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c39218f8db..046acc9fbf 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2762,9 +2762,13 @@ CopyFrom(CopyState cstate)
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
          cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
     {
-        ti_options |= TABLE_INSERT_SKIP_FSM;
+        /*
+         * We can skip WAL-logging the insertions, unless PITR or streaming
+         * replication is in use. We can skip the FSM in any case.
+         */
         if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
+            table_relation_register_walskip(cstate->rel);
+        ti_options |= TABLE_INSERT_SKIP_FSM;
     }
 
     /*
@@ -3369,7 +3373,12 @@ CopyFrom(CopyState cstate)
 
     FreeExecutorState(estate);
 
-    table_finish_bulk_insert(cstate->rel, ti_options);
+    /*
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.
+     */
 
     return processed;
 }
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..8b73654413 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        table_relation_register_walskip(intoRelationDesc);
+    myState->ti_options = HEAP_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,7 +605,7 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 2aac63296b..33b7bc4c16 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -462,9 +462,10 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
     if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
+        table_relation_register_walskip(transientrel);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
+
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,7 +510,7 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
+    /* If we skipped using WAL, we will sync the relation at commit */
 
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 013eb203f4..85555f87fb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4728,7 +4728,11 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
         ti_options = TABLE_INSERT_SKIP_FSM;
         if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
+        {
+            /* Forget old relation's registerd sync */
+            table_relation_invalidate_walskip(oldrel);
+            table_relation_register_walskip(newrel);
+        }
     }
     else
     {
@@ -5012,7 +5016,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
+        /* If we skipped writing WAL, then it will be done at commit. */
 
         table_close(newrel, NoLock);
     }
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 6006249d96..64efecf48b 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                    TransactionId OldestXmin, TransactionId FreezeXid,
-                   MultiXactId MultiXactCutoff, bool use_wal);
+                   MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                    HeapTuple newTuple);
-- 
2.16.3

From e3d5ca858c56678bb0ee6fbd9d9e89bef17667bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 2 Apr 2019 13:31:33 +0900
Subject: [PATCH 7/7] Remove TABLE/HEAP_INSERT_SKIP_WAL

Remove no-longer-used symbol TABLE/HEAP_INSERT_SKIP_WAL.
---
 src/include/access/heapam.h  |  3 +--
 src/include/access/tableam.h | 11 +++--------
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 77e5e603b0..f632e2758d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,11 +29,10 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
-#define HEAP_INSERT_SPECULATIVE 0x0010
+#define HEAP_INSERT_SPECULATIVE 0x0008
 
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 38a00d8823..9840bf0258 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -103,10 +103,9 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
-#define TABLE_INSERT_SKIP_FSM        0x0002
-#define TABLE_INSERT_FROZEN            0x0004
-#define TABLE_INSERT_NO_LOGICAL        0x0008
+#define TABLE_INSERT_SKIP_FSM        0x0001
+#define TABLE_INSERT_FROZEN            0x0002
+#define TABLE_INSERT_NO_LOGICAL        0x0004
 
 /* flag bits fortable_lock_tuple */
 /* Follow tuples whose update is in progress if lock modes don't conflict  */
@@ -1025,10 +1024,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple will not
- * necessarily logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

13 мая 2019 г., 03:37:05

On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > last paragraph, and I suspect it would have been no harder to back-patch.  I
> > wonder if it would have been simpler and better, but I'm not asking anyone to
> > investigate that.
> 
> Now I am asking for that.  Would anyone like to try implementing that other
> design, to see how much simpler it would be?

Anyone?  I've been deferring review of v10 and v11 in hopes of seeing the
above-described patch first.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

14 мая 2019 г., 07:59:10

Hello.

At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190513003705.GA1202614@rfd.leadboat.com>
> On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> > On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > > last paragraph, and I suspect it would have been no harder to back-patch.  I
> > > wonder if it would have been simpler and better, but I'm not asking anyone to
> > > investigate that.
> > 
> > Now I am asking for that.  Would anyone like to try implementing that other
> > design, to see how much simpler it would be?

Yeah, I think it is a bit too-complex for the value. But I think
it is the best way as far as we keep reusing a file on
truncation of the whole file.

> Anyone?  I've been deferring review of v10 and v11 in hopes of seeing the
> above-described patch first.

The siginificant portion of the complexity in this patch comes
from need to behave differently per block according to remebered
logged and truncated block numbers.

0005:
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.

If this consideration holds and given the optimizations on
WAL-skip and truncation, there's no way to avoid the per-block
behavior as far as we are allowing mixture of
logged-modifications and WAL-skipped COPY on the same relation
within a transaction.

We could avoid the per-block behavior change by making the
wal-inhibition per-relation basis. That will reduce the patch
size by the amount of BufferNeedsWALs and log_heap_update, but
not that large.

 inhibit wal-skipping after any wal-logged modifications in the relation.
 inhibit wal-logging after any wal-skipped modifications in the relation.
 wal-skipped relations are synced at commit-time.
 truncation of wal-skipped relation creates a new relfilenode.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

17 мая 2019 г., 09:50:50

On Tue, May 14, 2019 at 01:59:10PM +0900, Kyotaro HORIGUCHI wrote:
> At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190513003705.GA1202614@rfd.leadboat.com>
> > On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> > > On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > > > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > > > last paragraph, and I suspect it would have been no harder to back-patch.  I
> > > > wonder if it would have been simpler and better, but I'm not asking anyone to
> > > > investigate that.
> > > 
> > > Now I am asking for that.  Would anyone like to try implementing that other
> > > design, to see how much simpler it would be?
> 
> Yeah, I think it is a bit too-complex for the value. But I think
> it is the best way as far as we keep reusing a file on
> truncation of the whole file.

The design of v11-0006-Fix-WAL-skipping-feature.patch doesn't, in general,
work for WAL records touching more than one buffer.  For heapam, that patch
works around this problem by emitting XLOG_HEAP_INSERT or XLOG_HEAP_DELETE
when we'd normally emit XLOG_HEAP_UPDATE.  As a result, post-crash-recovery
heap page bits differ from the bits present when we don't crash.  Though I'm
85% confident this does not introduce a bug today, this is fragile.  That is
the main complexity I wish to avoid.

I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
paragraph will be simpler, not more complex.  In the implementation I'm
envisioning, smgrDoPendingDeletes() would change name, perhaps to
AtEOXact_Storage().  For every relfilenode it does not delete, it would ensure
durability by syncing (for large nodes) or by WAL-logging each page (for small
nodes).  RelationNeedsWAL() would return false whenever the applicable
relfilenode appears in pendingDeletes.  Access methods would remove their
smgrimmedsync() calls, but they would otherwise not change.  Would anyone like
to try implementing that?

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

20 мая 2019 г., 09:54:30

Hello.

At Thu, 16 May 2019 23:50:50 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190517065050.GA1298884@rfd.leadboat.com>
> On Tue, May 14, 2019 at 01:59:10PM +0900, Kyotaro HORIGUCHI wrote:
> > At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190513003705.GA1202614@rfd.leadboat.com>
> > > On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> > > > On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > > > > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > > > > last paragraph, and I suspect it would have been no harder to back-patch.  I
> > > > > wonder if it would have been simpler and better, but I'm not asking anyone to
> > > > > investigate that.
> > > > 
> > > > Now I am asking for that.  Would anyone like to try implementing that other
> > > > design, to see how much simpler it would be?
> > 
> > Yeah, I think it is a bit too-complex for the value. But I think
> > it is the best way as far as we keep reusing a file on
> > truncation of the whole file.
> 
> The design of v11-0006-Fix-WAL-skipping-feature.patch doesn't, in general,
> work for WAL records touching more than one buffer.  For heapam, that patch
> works around this problem by emitting XLOG_HEAP_INSERT or XLOG_HEAP_DELETE
> when we'd normally emit XLOG_HEAP_UPDATE.  As a result, post-crash-recovery
> heap page bits differ from the bits present when we don't crash.  Though I'm
> 85% confident this does not introduce a bug today, this is fragile.  That is
> the main complexity I wish to avoid.

Ok, I see your point. The same issue happens on index pages more
aggressively. I didn't allow wal-skipping on indexes for the
reason.

> I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
> paragraph will be simpler, not more complex.  In the implementation I'm
> envisioning, smgrDoPendingDeletes() would change name, perhaps to
> AtEOXact_Storage().  For every relfilenode it does not delete, it would ensure
> durability by syncing (for large nodes) or by WAL-logging each page (for small
> nodes).  RelationNeedsWAL() would return false whenever the applicable
> relfilenode appears in pendingDeletes.  Access methods would remove their
> smgrimmedsync() calls, but they would otherwise not change.  Would anyone like
> to try implementing that?

Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache. This is extending skip-wal feature to
indexes. And makes the old 0002 patch on nbtree useless.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ebca88dea9f9458cbd58f15e370ff3fc8fbd371b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/2] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 3859609090a274fc1ba59964f3819d19217bd8ef Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 20 May 2019 15:38:59 +0900
Subject: [PATCH 2/2] Fix WAL skipping feature

This patch shows a PoC of how change WAL-skipping feature to avoid
table corruption caused by mixing wal-logged and wal-skipped
operations.
---
 src/backend/access/heap/heapam.c         |  4 ++--
 src/backend/access/heap/heapam_handler.c |  7 +------
 src/backend/access/heap/rewriteheap.c    |  3 ---
 src/backend/access/transam/xact.c        |  6 ++++++
 src/backend/commands/copy.c              |  4 ----
 src/backend/commands/createas.c          |  3 +--
 src/backend/commands/tablecmds.c         |  2 --
 src/backend/utils/cache/relcache.c       | 22 ++++++++++++++++++++++
 src/include/access/heapam.h              |  1 -
 src/include/utils/rel.h                  |  3 ++-
 src/include/utils/relcache.h             |  1 +
 11 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 19d2c529d8..dda76c8736 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1950,7 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2133,7 +2133,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8d8161fd97..f4af981a35 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -560,12 +560,7 @@ tuple_lock_retry:
 static void
 heapam_finish_bulk_insert(Relation relation, int options)
 {
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
+    /* heapam doesn't need do this */
 }
 
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 20feeec327..fb35992a13 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2133,6 +2133,9 @@ CommitTransaction(void)
     /* Commit updates to the relation map --- do this as late as possible */
     AtEOXact_RelationMap(true, is_parallel_worker);
 
+    /* Perform pending flush */
+    AtEOXact_DoPendingFlush();
+
     /*
      * set the current transaction state information appropriately during
      * commit processing
@@ -2349,6 +2352,9 @@ PrepareTransaction(void)
      */
     PreCommit_CheckForSerializationFailure();
 
+    /* Perform pending flush */
+    AtEOXact_DoPendingFlush();
+
     /* NOTIFY will be handled below */
 
     /*
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6ffc3a62f6..9bae04b8a7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2761,11 +2761,7 @@ CopyFrom(CopyState cstate)
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
          cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..83e5f9220f 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bfcf9472d7..b686497443 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4741,8 +4741,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d0f6f715e6..10fd405171 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2913,6 +2913,28 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+void
+AtEOXact_DoPendingFlush()
+{
+    HASH_SEQ_STATUS status;
+    RelIdCacheEnt *idhentry;
+
+    if (!RelationIdCache)
+        return;
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+    {
+        Relation rel = idhentry->reldesc;
+        if (RELATION_IS_LOCAL(rel) && !XLogIsNeeded() && rel->rd_smgr)
+        {
+            FlushRelationBuffers(rel);
+            smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+        }
+    }
+}
+
+
 /*
  * AtEOXact_RelationCache
  *
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 62aaa08eff..0fb7d86bf2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3..41ab634ff5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -514,7 +514,8 @@ typedef struct ViewOptions
  *        True if relation needs WAL.
  */
 #define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+     !(RELATION_IS_LOCAL(relation) && !XLogIsNeeded()))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 364495a5f0..cd9b1a6f68 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -123,6 +123,7 @@ extern void RelationCloseSmgrByOid(Oid relationId);
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                           SubTransactionId parentSubid);
+extern void AtEOXact_DoPendingFlush(void);
 
 /*
  * Routines to help manage rebuilding of relcache init files
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

21 мая 2019 г., 15:29:48

Hello.

At Mon, 20 May 2019 15:54:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190520.155430.215084510.horiguchi.kyotaro@lab.ntt.co.jp>
> > I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
> > paragraph will be simpler, not more complex.  In the implementation I'm
> > envisioning, smgrDoPendingDeletes() would change name, perhaps to
> > AtEOXact_Storage().  For every relfilenode it does not delete, it would ensure
> > durability by syncing (for large nodes) or by WAL-logging each page (for small
> > nodes).  RelationNeedsWAL() would return false whenever the applicable
> > relfilenode appears in pendingDeletes.  Access methods would remove their
> > smgrimmedsync() calls, but they would otherwise not change.  Would anyone like
> > to try implementing that?
> 
> Following this direction, the attached PoC works *at least for*
> the wal_optimization TAP tests, but doing pending flush not in
> smgr but in relcache. This is extending skip-wal feature to
> indexes. And makes the old 0002 patch on nbtree useless.

This is a tidier version of the patch.

- Passes regression tests including 018_wal_optimize.pl

- Move the substantial work to table/index AMs.

  Each AM can decide whether to support WAL skip or not.
  Currently heap and nbtree support it.

- The timing of sync is moved from AtEOXact to PreCommit. This is
  because heap_sync() needs xact state = INPROGRESS.

- matview and cluster is broken, since swapping to new
  relfilenode doesn't change rd_newRelfilenodeSubid. I'll address
  that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 680462288cb82da23c19a02239787fc1ea08cdde Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/2] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 75b90a8020275af6ee5e6ee5a4433c5582bd9148 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 20 May 2019 15:38:59 +0900
Subject: [PATCH 2/2] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/brin/brin.c           |  2 +
 src/backend/access/gin/ginutil.c         |  2 +
 src/backend/access/gist/gist.c           |  2 +
 src/backend/access/hash/hash.c           |  2 +
 src/backend/access/heap/heapam.c         |  8 +--
 src/backend/access/heap/heapam_handler.c | 15 +++---
 src/backend/access/heap/rewriteheap.c    |  3 --
 src/backend/access/index/indexam.c       | 16 ++++++
 src/backend/access/nbtree/nbtree.c       | 13 +++++
 src/backend/access/transam/xact.c        |  6 +++
 src/backend/commands/copy.c              |  6 ---
 src/backend/commands/createas.c          |  5 +-
 src/backend/commands/matview.c           |  4 --
 src/backend/commands/tablecmds.c         |  4 --
 src/backend/utils/cache/relcache.c       | 87 ++++++++++++++++++++++++++++++++
 src/include/access/amapi.h               |  8 +++
 src/include/access/genam.h               |  1 +
 src/include/access/heapam.h              |  1 -
 src/include/access/nbtree.h              |  1 +
 src/include/access/tableam.h             | 36 +++++++------
 src/include/utils/rel.h                  | 21 +++++++-
 src/include/utils/relcache.h             |  1 +
 22 files changed, 188 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index aba234c0af..681520852f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -125,6 +125,8 @@ brinhandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..f4f0eebec5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -77,6 +77,8 @@ ginhandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index d70a138f54..3a23e7c4b2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -99,6 +99,8 @@ gisthandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 048e40e46f..3fa8262319 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,8 @@ hashhandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 19d2c529d8..7f78122b81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1950,7 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2133,7 +2133,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8906,10 +8906,6 @@ heap2_redo(XLogReaderState *record)
 void
 heap_sync(Relation rel)
 {
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
     /* main heap */
     FlushRelationBuffers(rel);
     /* FlushRelationBuffers will have opened rd_smgr */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8d8161fd97..a2e1464845 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -557,15 +557,14 @@ tuple_lock_retry:
     return result;
 }
 
+/* ------------------------------------------------------------------------
+ * WAL-skipping related routine
+ * ------------------------------------------------------------------------
+ */
 static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_at_commit_sync(Relation relation)
 {
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
+    heap_sync(relation);
 }
 
 
@@ -2573,7 +2572,7 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
+    .at_commit_sync = heapam_at_commit_sync,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bce4274362..1ac77f7c14 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -654,9 +654,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0fc9139bad..1d089603b7 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *        index_can_return    - does index support index-only scans?
  *        index_getprocid - get a support procedure OID
  *        index_getprocinfo - get a support procedure's lookup info
+ *        index_at_commit_sync - perform at_commit_sync
  *
  * NOTES
  *        This file contains the index_ routines which used
@@ -837,6 +838,21 @@ index_getprocinfo(Relation irel,
     return locinfo;
 }
 
+/* ----------------
+ *        index_at_commit_sync
+ *
+ *        This routine perfoms at-commit sync of index storage.  This is called
+ *        when permanent index created in the current transaction is committed.
+ *        ----------------
+ */
+void
+index_at_commit_sync(Relation irel)
+{
+    Assert(irel->rd_indam != NULL && irel->rd_indam->amatcommitsync != NULL);
+    
+    irel->rd_indam->amatcommitsync(irel);
+}
+
 /* ----------------
  *        index_store_float8_orderby_distances
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 02fb352b94..39377f35eb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,8 @@ bthandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = btinitparallelscan;
     amroutine->amparallelrescan = btparallelrescan;
 
+    amroutine->amatcommitsync = btatcommitsync;
+
     PG_RETURN_POINTER(amroutine);
 }
 
@@ -1385,3 +1387,14 @@ btcanreturn(Relation index, int attno)
 {
     return true;
 }
+
+/*
+ *    btatcommitsync() -- Perform at-commit sync of WAL-skipped index
+ */
+void
+btatcommitsync(Relation index)
+{
+    FlushRelationBuffers(index);
+    smgrimmedsync(index->rd_smgr, MAIN_FORKNUM);
+}
+
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 20feeec327..bc38a53195 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2120,6 +2120,9 @@ CommitTransaction(void)
     if (!is_parallel_worker)
         PreCommit_CheckForSerializationFailure();
 
+    /* Sync WAL-skipped relations */
+    PreCommit_RelationSync();
+
     /*
      * Insert notifications sent by NOTIFY commands into the queue.  This
      * should be late in the pre-commit sequence to minimize time spent
@@ -2395,6 +2398,9 @@ PrepareTransaction(void)
                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                  errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
 
+    /* Sync WAL-skipped relations */
+    PreCommit_RelationSync();
+
     /* Prevent cancel/die interrupt while cleaning up */
     HOLD_INTERRUPTS();
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 5f81aa57d4..a25c82438e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2761,11 +2761,7 @@ CopyFrom(CopyState cstate)
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
          cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
@@ -3364,8 +3360,6 @@ CopyFrom(CopyState cstate)
 
     FreeExecutorState(estate);
 
-    table_finish_bulk_insert(cstate->rel, ti_options);
-
     return processed;
 }
 
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..859b869b0d 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
-
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
     myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 99bf3c29f2..c84edd0db0 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * replication is in use. We can skip the FSM in any case.
      */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
     myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bfcf9472d7..75f11a327d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4741,8 +4741,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -5026,8 +5024,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
-
         table_close(newrel, NoLock);
     }
 }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d0f6f715e6..4bffbfff5d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1512,6 +1512,9 @@ RelationInitIndexAccessInfo(Relation relation)
     relation->rd_exclprocs = NULL;
     relation->rd_exclstrats = NULL;
     relation->rd_amcache = NULL;
+
+    if (relation->rd_indam->amatcommitsync != NULL)
+        relation->rd_can_skipwal = true;
 }
 
 /*
@@ -1781,6 +1784,9 @@ RelationInitTableAccessMethod(Relation relation)
      * Now we can fetch the table AM's API struct
      */
     InitTableAmRoutine(relation);
+
+    if (relation->rd_tableam && relation->rd_tableam->at_commit_sync)
+        relation->rd_can_skipwal = true;
 }
 
 /*
@@ -2913,6 +2919,73 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+/*
+ * PreComimt_RelationSync
+ *
+ *    Sync relations that were WAL-skipped in this transaction .
+ *
+ * AMs may have skipped WAL-logging for relations created in the current
+ * transaction. This let such relations be synced.  This operation can only be
+ * perfomed while transaction status is INPROGRESS so it is separated from
+ * AtEOXact_RelationCache.
+ */
+void
+PreCommit_RelationSync(void)
+{
+    HASH_SEQ_STATUS status;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /* See AtEOXact_RelationCache for details on eoxact_list */
+    if (eoxact_list_overflowed)
+    {
+        hash_seq_init(&status, RelationIdCache);
+        while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        {
+            Relation rel = idhentry->reldesc;
+
+            if (!RelationNeedsAtCommitSync(rel))
+                continue;
+
+            if (rel->rd_tableam != NULL)
+                table_at_commit_sync(rel);
+            else
+            {
+                Assert (rel->rd_indam != NULL);
+                table_at_commit_sync(rel);
+            }                
+        }
+    }
+    else
+    {
+        for (i = 0; i < eoxact_list_len; i++)
+        {
+            Relation rel;
+
+            idhentry = (RelIdCacheEnt *) hash_search(RelationIdCache,
+                                                     (void *) &eoxact_list[i],
+                                                     HASH_FIND,
+                                                     NULL);
+
+            if (idhentry == NULL)
+                continue;
+
+            rel = idhentry->reldesc;
+
+            if (!RelationNeedsAtCommitSync(rel))
+                continue;
+
+            if (rel->rd_tableam != NULL)
+                table_at_commit_sync(rel);
+            else
+            {
+                Assert (rel->rd_indam != NULL);
+                table_at_commit_sync(rel);
+            }
+        }
+    }
+}
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3032,7 +3105,21 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
         if (isCommit)
+        {
+            /*
+             * While wal_level=minimal, we have skipped WAL-logging on
+             * persistent relations created in this transaction. Sync that
+             * tables out before they become publicly accessible.
+             */
+            if (!XLogIsNeeded() && relation->rd_smgr &&
+                relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+            {
+                FlushRelationBuffers(relation);
+                smgrimmedsync(relation->rd_smgr, MAIN_FORKNUM);
+            }
+
             relation->rd_createSubid = InvalidSubTransactionId;
+        }
         else if (RelationHasReferenceCountZero(relation))
         {
             RelationClearRelation(relation, false);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 09a7404267..fc6981d98a 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -156,6 +156,11 @@ typedef void (*aminitparallelscan_function) (void *target);
 /* (re)start parallel index scan */
 typedef void (*amparallelrescan_function) (IndexScanDesc scan);
 
+/* sync relation at commit */
+typedef void (*amatcommitsync_function) (Relation indexRelation);
+
+    /* interface function to support WAL-skipping feature */
+    
 /*
  * API struct for an index AM.  Note this must be stored in a single palloc'd
  * chunk of memory.
@@ -230,6 +235,9 @@ typedef struct IndexAmRoutine
     amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
     aminitparallelscan_function aminitparallelscan; /* can be NULL */
     amparallelrescan_function amparallelrescan; /* can be NULL */
+
+    /* interface function to support WAL-skipping feature */
+    amatcommitsync_function amatcommitsync; /* can be NULL */;
 } IndexAmRoutine;
 
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9717183ef2..b225fd622e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,6 +177,7 @@ extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
                 uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
                   uint16 procnum);
+extern void index_at_commit_sync(Relation irel);
 extern void index_store_float8_orderby_distances(IndexScanDesc scan,
                                      Oid *orderByTypes, double *distances,
                                      bool recheckOrderBy);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 62aaa08eff..0fb7d86bf2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6c1acd4855..1d042e89b5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -717,6 +717,7 @@ extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
                 IndexBulkDeleteResult *stats);
 extern bool btcanreturn(Relation index, int attno);
+extern void btatcommitsync(Relation index);
 
 /*
  * prototypes for internal functions in nbtree.c
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06eae2337a..90254cb278 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -409,19 +409,15 @@ typedef struct TableAmRoutine
                                TM_FailureData *tmfd);
 
     /*
-     * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * Sync relation at commit-time if needed.
      *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags that apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert for
-     * those flags, and ignore others.
+     *  A table AM may skip WAL-logging for relations created in the current
+     *  transaction. This routine is called commit-time and the table AM
+     *  must flush buffer and sync the underlying storage.
      *
      * Optional callback.
      */
-    void        (*finish_bulk_insert) (Relation rel, int options);
+    void        (*at_commit_sync) (Relation rel);
 
 
     /* ------------------------------------------------------------------------
@@ -1104,8 +1100,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  *
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1300,20 +1295,23 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
 }
 
 /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Sync relation at commit-time if needed.
+ *
+ *  A table AM that defines this interface can allow derived objects created
+ *  in the current transaction to skip WAL-logging. This routine is called
+ *  commit-time and the table AM must flush buffer and sync the underlying
+ *  storage.
+ *
+ * Optional callback.
  */
 static inline void
-table_finish_bulk_insert(Relation rel, int options)
+table_at_commit_sync(Relation rel)
 {
     /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
+    if (rel->rd_tableam && rel->rd_tableam->at_commit_sync)
+        rel->rd_tableam->at_commit_sync(rel);
 }
 
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3..c09fd84a1c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -64,6 +64,9 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
+    /* Some relations cane comit WAL-logging on certain condition. */
+    bool        rd_can_skipwal; /* can skip WAL-logging?  */
+
     /*
      * rd_createSubid is the ID of the highest subtransaction the rel has
      * survived into; or zero if the rel was not created in the current top
@@ -512,9 +515,25 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
+ *
+ * If underlying table AM has at_commit_sync interface, returns false if
+ * wal_level = minimal and this relation is created in the current transaction
  */
 #define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \
+     (!relation->rd_can_skipwal ||                                       \
+      !(RELATION_IS_LOCAL(relation) && !XLogIsNeeded())))
+
+/*
+ * RelationNeedAtCommitSync
+ *      True if relation needs WAL needs on-commit sync
+ */
+#define RelationNeedsAtCommitSync(relation) \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     relation->rd_can_skipwal &&                                        \
+     (RELATION_IS_LOCAL(relation) ||                                    \
+      relation->rd_newRelfilenodeSubid != InvalidBlockNumber)            \
+     && !XLogIsNeeded()))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 364495a5f0..07c4cfa565 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -120,6 +120,7 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+extern void PreCommit_RelationSync(void);
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                           SubTransactionId parentSubid);
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

23 мая 2019 г., 10:10:35

Attached is a new version.

At Tue, 21 May 2019 21:29:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190521.212948.34357392.horiguchi.kyotaro@lab.ntt.co.jp>

> At Mon, 20 May 2019 15:54:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190520.155430.215084510.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
> > > paragraph will be simpler, not more complex.  In the implementation I'm
> > > envisioning, smgrDoPendingDeletes() would change name, perhaps to
> > > AtEOXact_Storage().  For every relfilenode it does not delete, it would ensure
> > > durability by syncing (for large nodes) or by WAL-logging each page (for small
> > > nodes).  RelationNeedsWAL() would return false whenever the applicable
> > > relfilenode appears in pendingDeletes.  Access methods would remove their
> > > smgrimmedsync() calls, but they would otherwise not change.  Would anyone like
> > > to try implementing that?
> > 
> > Following this direction, the attached PoC works *at least for*
> > the wal_optimization TAP tests, but doing pending flush not in
> > smgr but in relcache. This is extending skip-wal feature to
> > indexes. And makes the old 0002 patch on nbtree useless.
> 
> This is a tidier version of the patch.
> 
> - Passes regression tests including 018_wal_optimize.pl
> 
> - Move the substantial work to table/index AMs.
> 
>   Each AM can decide whether to support WAL skip or not.
>   Currently heap and nbtree support it.
> 
> - The timing of sync is moved from AtEOXact to PreCommit. This is
>   because heap_sync() needs xact state = INPROGRESS.
> 
> - matview and cluster is broken, since swapping to new
>   relfilenode doesn't change rd_newRelfilenodeSubid. I'll address
>   that.

cluster/matview are fixed.

A obstacle to fix them was the unreliability of
newRelfilenodeSubid.  As mentioned in the comment of
RelationData, newRelfilenodeSubid may dissapear by certain
sequence of commands.

In the attched v14, I added "rd_firstRelfilenodeSubid", which
stores the subtransaction id where the first relfilenode
replacementin the current transaction. It suivives any sequence
of commands, including one mentioned in CopyFrom's comment (which
I removed by this patch).

With the attached patch, on relations based on table/index AMs
that supports WAL-skipping, WAL-logging is eliminated if the
relation is created in the current transaction, or relfilenode is
replaced in the current transaction. At-commit file sync is
surely performed. (Only Heap and Btree support it.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 0430cf502bc8d04f3e71cc69a748a9a035706cb6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/2] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From effbb1cdc777e0612a51682dd41f0f46b7881798 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 20 May 2019 15:38:59 +0900
Subject: [PATCH 2/2] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/brin/brin.c           |   2 +
 src/backend/access/gin/ginutil.c         |   2 +
 src/backend/access/gist/gist.c           |   2 +
 src/backend/access/hash/hash.c           |   2 +
 src/backend/access/heap/heapam.c         |   8 +-
 src/backend/access/heap/heapam_handler.c |  24 ++----
 src/backend/access/heap/rewriteheap.c    |  12 +--
 src/backend/access/index/indexam.c       |  18 +++++
 src/backend/access/nbtree/nbtree.c       |  13 ++++
 src/backend/access/transam/xact.c        |   6 ++
 src/backend/commands/cluster.c           |  29 ++++++++
 src/backend/commands/copy.c              |  38 ++--------
 src/backend/commands/createas.c          |   5 +-
 src/backend/commands/matview.c           |   4 -
 src/backend/commands/tablecmds.c         |  10 +--
 src/backend/utils/cache/relcache.c       | 123 ++++++++++++++++++++++++++++++-
 src/include/access/amapi.h               |   6 ++
 src/include/access/genam.h               |   1 +
 src/include/access/heapam.h              |   1 -
 src/include/access/nbtree.h              |   1 +
 src/include/access/rewriteheap.h         |   2 +-
 src/include/access/tableam.h             |  47 ++++++------
 src/include/utils/rel.h                  |  35 ++++++++-
 src/include/utils/relcache.h             |   4 +
 24 files changed, 289 insertions(+), 106 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..4b48f44949 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -125,6 +125,8 @@ brinhandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..f4f0eebec5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -77,6 +77,8 @@ ginhandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 45c00aaa87..ebaf4495b8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -99,6 +99,8 @@ gisthandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e9f2c84af1..ce7ac58204 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,8 @@ hashhandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = NULL;
     amroutine->amparallelrescan = NULL;
 
+    amroutine->amatcommitsync = NULL;
+
     PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6c342635e8..642e7d0cc5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1950,7 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2133,7 +2133,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8906,10 +8906,6 @@ heap2_redo(XLogReaderState *record)
 void
 heap_sync(Relation rel)
 {
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
     /* main heap */
     FlushRelationBuffers(rel);
     /* FlushRelationBuffers will have opened rd_smgr */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a4a28e88ec..17126e599b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -559,15 +559,14 @@ tuple_lock_retry:
     return result;
 }
 
+/* ------------------------------------------------------------------------
+ * WAL-skipping related routine
+ * ------------------------------------------------------------------------
+ */
 static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_at_commit_sync(Relation relation)
 {
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
+    heap_sync(relation);
 }
 
 
@@ -702,7 +701,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +714,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     /* Remember if it's a system catalog */
     is_system_catalog = IsSystemRelation(OldHeap);
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
-     */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
     /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
@@ -732,7 +724,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2626,7 +2618,7 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
+    .at_commit_sync = heapam_at_commit_sync,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 131ec7b8d7..617eec582b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -245,8 +244,7 @@ static void logical_end_heap_rewrite(RewriteState state);
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -654,9 +651,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +689,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..ade721a383 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *        index_can_return    - does index support index-only scans?
  *        index_getprocid - get a support procedure OID
  *        index_getprocinfo - get a support procedure's lookup info
+ *        index_at_commit_sync - perform at_commit_sync
  *
  * NOTES
  *        This file contains the index_ routines which used
@@ -837,6 +838,23 @@ index_getprocinfo(Relation irel,
     return locinfo;
 }
 
+/* ----------------
+ *        index_at_commit_sync
+ *
+ *  An index AM that defines this interface can allow derived objects created
+ *  in the current transaction to skip WAL-logging. This routine is called
+ *  commit-time and the AM must flush buffer and sync the underlying storage.
+ *
+ *  Optional interface
+ *  ----------------
+ */
+void
+index_at_commit_sync(Relation irel)
+{
+    if (irel->rd_indam && irel->rd_indam->amatcommitsync)
+        irel->rd_indam->amatcommitsync(irel);
+}
+
 /* ----------------
  *        index_store_float8_orderby_distances
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..695b058b85 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,8 @@ bthandler(PG_FUNCTION_ARGS)
     amroutine->aminitparallelscan = btinitparallelscan;
     amroutine->amparallelrescan = btparallelrescan;
 
+    amroutine->amatcommitsync = btatcommitsync;
+
     PG_RETURN_POINTER(amroutine);
 }
 
@@ -1385,3 +1387,14 @@ btcanreturn(Relation index, int attno)
 {
     return true;
 }
+
+/*
+ *    btatcommitsync() -- Perform at-commit sync of WAL-skipped index
+ */
+void
+btatcommitsync(Relation index)
+{
+    FlushRelationBuffers(index);
+    smgrimmedsync(index->rd_smgr, MAIN_FORKNUM);
+}
+
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f1108ccc8b..0670985bc2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2120,6 +2120,9 @@ CommitTransaction(void)
     if (!is_parallel_worker)
         PreCommit_CheckForSerializationFailure();
 
+    /* Sync WAL-skipped relations */
+    PreCommit_RelationSync();
+
     /*
      * Insert notifications sent by NOTIFY commands into the queue.  This
      * should be late in the pre-commit sequence to minimize time spent
@@ -2395,6 +2398,9 @@ PrepareTransaction(void)
                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                  errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
 
+    /* Sync WAL-skipped relations */
+    PreCommit_RelationSync();
+
     /* Prevent cancel/die interrupt while cleaning up */
     HOLD_INTERRUPTS();
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ebaec4f8dd..504a04104f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,41 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
     if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
     {
+        Relation rel1;
+        Relation rel2;
+
         /*
          * Normal non-mapped relations: swap relfilenodes, reltablespaces,
          * relpersistence
          */
         Assert(!target_is_pg_class);
 
+        /* Update creation subid hints of relcache */
+        rel1 = relation_open(r1, ExclusiveLock);
+        rel2 = relation_open(r2, ExclusiveLock);
+
+        /*
+         * New relation's relfilenode is created in the current transaction
+         * and used as old ralation's new relfilenode. So its
+         * newRelfilenodeSubid as new relation's createSubid. We don't fix
+         * rel2 since it would be deleted soon.
+         */
+        Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+        rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+        /* record the first relfilenode change in the current transaction */
+        if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        {
+            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+            /* Flag the old relation as needing eoxact cleanup */
+            RelationEOXactListAdd(rel1);
+        }
+
+        relation_close(rel1, ExclusiveLock);
+        relation_close(rel2, ExclusiveLock);
+
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b00891ffd2..77608c09c3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2720,28 +2720,9 @@ CopyFrom(CopyState cstate)
      * If it does commit, we'll have done the table_finish_bulk_insert() at
      * the bottom of this routine first.
      *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_insert and RelationGetBufferForTuple specify that
-     * skipping WAL logging is only safe if we ensure that our tuples do not
-     * go into pages containing tuples from any other transactions --- but this
-     * must be the case if we have a new table or new relfilenode, so we need
-     * no additional work to enforce that.
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time, even if we must use WAL because of
+     * archiving.  This could possibly be wrong, but it's unlikely.
      *
      * We currently don't support this optimization if the COPY target is a
      * partitioned table as we currently only lazily initialize partition
@@ -2757,15 +2738,14 @@ CopyFrom(CopyState cstate)
      * are not supported as per the description above.
      *----------
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+    /*
+     * createSubid is creation check, firstRelfilenodeSubid is truncation and
+     * cluster check. Partitioned table doesn't have storage.
+     */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
@@ -3364,8 +3344,6 @@ CopyFrom(CopyState cstate)
 
     FreeExecutorState(estate);
 
-    table_finish_bulk_insert(cstate->rel, ti_options);
-
     return processed;
 }
 
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..859b869b0d 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
-
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
     myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index dc2940cd4e..583c542121 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * replication is in use. We can skip the FSM in any case.
      */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
     myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 602a8dbd1c..f63662f4ed 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4733,9 +4733,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
     /*
      * Prepare a BulkInsertState and options for table_insert. Because we're
-     * building a new heap, we can skip WAL-logging and fsync it to disk at
-     * the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * building a new heap, the underlying table AM can skip WAL-logging and
+     * fsync the relation to disk at the end of the current transaction
+     * instead. The FSM is empty too, so don't bother using it.
      */
     if (newrel)
     {
@@ -4743,8 +4743,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -5028,8 +5026,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
-
         table_close(newrel, NoLock);
     }
 }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d7832..cd418c5f80 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -177,6 +177,13 @@ static bool eoxact_list_overflowed = false;
             eoxact_list_overflowed = true; \
     } while (0)
 
+/* Function version of the macro above */
+void
+RelationEOXactListAdd(Relation rel)
+{
+    EOXactListAdd(rel);
+}
+
 /*
  * EOXactTupleDescArray stores TupleDescs that (might) need AtEOXact
  * cleanup work.  The array expands as needed; there is no hashtable because
@@ -263,6 +270,7 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+static void PreCommit_SyncOneRelation(Relation relation);
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1512,6 +1520,10 @@ RelationInitIndexAccessInfo(Relation relation)
     relation->rd_exclprocs = NULL;
     relation->rd_exclstrats = NULL;
     relation->rd_amcache = NULL;
+
+    /* set AM-type-independent WAL-skip flag if this am supports it */
+    if (relation->rd_indam->amatcommitsync != NULL)
+        relation->rd_can_skipwal = true;
 }
 
 /*
@@ -1781,6 +1793,10 @@ RelationInitTableAccessMethod(Relation relation)
      * Now we can fetch the table AM's API struct
      */
     InitTableAmRoutine(relation);
+
+    /* set AM-type-independent WAL-skip flag if this am supports it */
+    if (relation->rd_tableam && relation->rd_tableam->at_commit_sync)
+        relation->rd_can_skipwal = true;
 }
 
 /*
@@ -2594,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2661,7 +2678,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2801,7 +2818,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2913,6 +2930,93 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+/*
+ * PreCommit_RelationSync
+ *
+ *    Sync relations that were WAL-skipped in this transaction .
+ *
+ * Access method may have skipped WAL-logging for relations created in the
+ * current transaction. Such relations need to be synced at top-transaction's
+ * commit.  The operation requires active transaction state, so separately
+ * performed from AtEOXact_RelationCache.
+ */
+void
+PreCommit_RelationSync(void)
+{
+    HASH_SEQ_STATUS status;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /* See AtEOXact_RelationCache about eoxact_list */
+    if (eoxact_list_overflowed)
+    {
+        hash_seq_init(&status, RelationIdCache);
+        while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+            PreCommit_SyncOneRelation(idhentry->reldesc);
+    }
+    else
+    {
+        for (i = 0; i < eoxact_list_len; i++)
+        {
+            idhentry = (RelIdCacheEnt *) hash_search(RelationIdCache,
+                                                     (void *) &eoxact_list[i],
+                                                     HASH_FIND,
+                                                     NULL);
+
+            if (idhentry != NULL)
+                PreCommit_SyncOneRelation(idhentry->reldesc);
+        }
+    }
+}
+
+/*
+ * PreCommit_SyncOneRelation
+ *
+ *    Sync one relation if needed
+ *
+ * NB: this processing must be idempotent, because EOXactListAdd() doesn't
+ * bother to prevent duplicate entries in eoxact_list[].
+ */
+static void
+PreCommit_SyncOneRelation(Relation relation)
+{
+    HeapTuple reltup;
+    Form_pg_class relform;
+
+    /* return immediately if no need for sync */
+    if (!RelationNeedsAtCommitSync(relation))
+        return;
+
+    /*
+     * We are about to sync a WAL-skipped relation. The relfilenode here is
+     * wrong if the last sub transaction that created new relfilenode was
+     * aborted.
+     */
+    if (relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId &&
+        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+    {
+        reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relation->rd_id));
+        if (!HeapTupleIsValid(reltup))
+            elog(ERROR, "cache lookup failed for relation %u", relation->rd_id);
+        relform = (Form_pg_class) GETSTRUCT(reltup);
+        relation->rd_rel->relfilenode = relform->relfilenode;
+        relation->rd_node.relNode = relform->relfilenode;
+        ReleaseSysCache(reltup);
+    }
+
+    if (relation->rd_tableam != NULL)
+        table_at_commit_sync(relation);
+    else
+    {
+        Assert (relation->rd_indam != NULL);
+        table_at_commit_sync(relation);
+    }
+
+    /* We have synced the files, forget about relfilenode change */
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+}
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3058,6 +3162,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      * Likewise, reset the hint about the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3149,7 +3254,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3158,6 +3263,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3440,6 +3553,10 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     RelationDropStorage(relation);
 
+    /* Record the subxid where the first relfilenode change happen */
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
     /*
      * Create storage for the main fork of the new relfilenode.  If it's a
      * table-like object, call into the table AM to do so, which'll also
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..75159d10d4 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -156,6 +156,9 @@ typedef void (*aminitparallelscan_function) (void *target);
 /* (re)start parallel index scan */
 typedef void (*amparallelrescan_function) (IndexScanDesc scan);
 
+/* sync relation at commit after skipping WAL-logging */
+typedef void (*amatcommitsync_function) (Relation indexRelation);
+    
 /*
  * API struct for an index AM.  Note this must be stored in a single palloc'd
  * chunk of memory.
@@ -230,6 +233,9 @@ typedef struct IndexAmRoutine
     amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
     aminitparallelscan_function aminitparallelscan; /* can be NULL */
     amparallelrescan_function amparallelrescan; /* can be NULL */
+
+    /* interface function to do at-commit sync after skipping WAL-logging */
+    amatcommitsync_function amatcommitsync; /* can be NULL */;
 } IndexAmRoutine;
 
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..8e661edfdd 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,6 +177,7 @@ extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
                                     uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
                                    uint16 procnum);
+extern void index_at_commit_sync(Relation irel);
 extern void index_store_float8_orderby_distances(IndexScanDesc scan,
                                                  Oid *orderByTypes, double *distances,
                                                  bool recheckOrderBy);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b88bd8a4d7..187c668878 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..f33d2b38b5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -717,6 +717,7 @@ extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
                                               IndexBulkDeleteResult *stats);
 extern bool btcanreturn(Relation index, int attno);
+extern void btatcommitsync(Relation index);
 
 /*
  * prototypes for internal functions in nbtree.c
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6f1cd382d8..759a1e806d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -409,19 +409,15 @@ typedef struct TableAmRoutine
                                TM_FailureData *tmfd);
 
     /*
-     * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * Sync relation at commit-time after skipping WAL-logging.
      *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags that apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert for
-     * those flags, and ignore others.
+     *  A table AM may skip WAL-logging for relations created in the current
+     *  transaction. This routine is called commit-time and the table AM
+     *  must flush buffer and sync the underlying storage.
      *
      * Optional callback.
      */
-    void        (*finish_bulk_insert) (Relation rel, int options);
+    void        (*at_commit_sync) (Relation rel);
 
 
     /* ------------------------------------------------------------------------
@@ -1089,10 +1085,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
@@ -1112,10 +1104,12 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * Note that most of these options will be applied when inserting into the
  * heap's TOAST table, too, if the tuple requires any out-of-line data.
  *
+ * The core function RelationNeedsWAL() considers skipping WAL-logging on
+ * relations created in-transaction or truncated when the AM provides
+ * at_commit_sync interface.
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1205,6 +1199,8 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * delete it.  Failure return codes are TM_SelfModified, TM_Updated, and
  * TM_BeingModified (the last only possible if wait == false).
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, and, if possible, t_cmax.  See comments for
  * struct TM_FailureData for additional info.
@@ -1249,6 +1245,8 @@ table_delete(Relation rel, ItemPointer tid, CommandId cid,
  * update was done.  However, any TOAST changes in the new tuple's
  * data are not reflected into *newtup.
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, t_cmax.  See comments for struct TM_FailureData
  * for additional info.
@@ -1310,20 +1308,23 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
 }
 
 /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Sync relation at commit-time if needed.
+ *
+ *  A table AM that defines this interface can allow derived objects created
+ *  in the current transaction to skip WAL-logging. This routine is called
+ *  commit-time and the table AM must flush buffer and sync the underlying
+ *  storage.
+ *
+ * Optional callback.
  */
 static inline void
-table_finish_bulk_insert(Relation rel, int options)
+table_at_commit_sync(Relation rel)
 {
     /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
+    if (rel->rd_tableam && rel->rd_tableam->at_commit_sync)
+        rel->rd_tableam->at_commit_sync(rel);
 }
 
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3..6a3ef80575 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,6 +63,7 @@ typedef struct RelationData
     bool        rd_indexvalid;    /* is rd_indexlist valid? (also rd_pkindex and
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
+    bool        rd_can_skipwal; /* underlying AM allow WAL-logging?  */
 
     /*
      * rd_createSubid is the ID of the highest subtransaction the rel has
@@ -76,10 +77,17 @@ typedef struct RelationData
      * transaction, with one of them occurring in a subsequently aborted
      * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
      * ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+     * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+     * relfilenode change has took place first in the current
+     * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+     * OID means that the currently active relfilenode is transaction-local
+     * and no-need for WAL-logging.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
     SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
                                                  * current xact */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
+                                                 * first in current xact */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -512,9 +520,32 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
+ *
+ * If underlying AM supports WAL-skipping feature, returns false if wal_level
+ * = minimal and this relation is created or truncated in the current
+ * transaction.
  */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (!relation->rd_can_skipwal ||                                        \
+      XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
+
+/*
+ * RelationNeedsAtCommitSync
+ *      True if relation needs at-commit sync
+ *
+ * This macro is used in few places but written here because it is tightly
+ * related with RelationNeedsWAL() above. We don't need to sync local or temp
+ * relations.
+ */
+#define RelationNeedsAtCommitSync(relation) \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     !(!relation->rd_can_skipwal ||                                        \
+       XLogIsNeeded() ||                                                \
+       (relation->rd_createSubid == InvalidSubTransactionId &&            \
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d9c10ffcba..b681d3afb2 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -120,6 +120,7 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+extern void PreCommit_RelationSync(void);
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
@@ -138,4 +139,7 @@ extern bool criticalRelcachesBuilt;
 /* should be used only by relcache.c and postinit.c */
 extern bool criticalSharedRelcachesBuilt;
 
+/* add rel to eoxact cleanup list */
+void RelationEOXactListAdd(Relation rel);
+
 #endif                            /* RELCACHE_H */
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

25 мая 2019 г., 05:33:32

On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> Following this direction, the attached PoC works *at least for*
> the wal_optimization TAP tests, but doing pending flush not in
> smgr but in relcache.

This task, syncing files created in the current transaction, is not the kind
of task normally assigned to a cache.  We already have a module, storage.c,
that maintains state about files created in the current transaction.  Why did
you use relcache instead of storage.c?

On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> This is a tidier version of the patch.

> - Move the substantial work to table/index AMs.
> 
>   Each AM can decide whether to support WAL skip or not.
>   Currently heap and nbtree support it.

Why would an AM find it important to disable WAL skip?

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro HORIGUCHI

Дата:

27 мая 2019 г., 08:08:26

Thanks for the comment!

At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190525023332.GE1624191@rfd.leadboat.com>
> On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > Following this direction, the attached PoC works *at least for*
> > the wal_optimization TAP tests, but doing pending flush not in
> > smgr but in relcache.
> 
> This task, syncing files created in the current transaction, is not the kind
> of task normally assigned to a cache.  We already have a module, storage.c,
> that maintains state about files created in the current transaction.  Why did
> you use relcache instead of storage.c?

The reason was at-commit sync needs buffer flush beforehand. But
FlushRelationBufferWithoutRelCache() in v11 can do
that. storage.c is reasonable as the place.

> On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > This is a tidier version of the patch.
> 
> > - Move the substantial work to table/index AMs.
> > 
> >   Each AM can decide whether to support WAL skip or not.
> >   Currently heap and nbtree support it.
> 
> Why would an AM find it important to disable WAL skip?

The reason is currently it's AM's responsibility to decide
whether to skip WAL or not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

28 мая 2019 г., 02:02:25

On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
> At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190525023332.GE1624191@rfd.leadboat.com>
> > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > > Following this direction, the attached PoC works *at least for*
> > > the wal_optimization TAP tests, but doing pending flush not in
> > > smgr but in relcache.
> > 
> > This task, syncing files created in the current transaction, is not the kind
> > of task normally assigned to a cache.  We already have a module, storage.c,
> > that maintains state about files created in the current transaction.  Why did
> > you use relcache instead of storage.c?
> 
> The reason was at-commit sync needs buffer flush beforehand. But
> FlushRelationBufferWithoutRelCache() in v11 can do
> that. storage.c is reasonable as the place.

Okay.  I do want this to work in 9.5 and later, but I'm not aware of a reason
relcache.c would be a better code location in older branches.  Unless you
think of a reason to prefer relcache.c, please use storage.c.

> > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > This is a tidier version of the patch.
> > 
> > > - Move the substantial work to table/index AMs.
> > > 
> > >   Each AM can decide whether to support WAL skip or not.
> > >   Currently heap and nbtree support it.
> > 
> > Why would an AM find it important to disable WAL skip?
> 
> The reason is currently it's AM's responsibility to decide
> whether to skip WAL or not.

I see.  Skipping the sync would be a mere optimization; no AM would require it
for correctness.  An AM might want RelationNeedsWAL() to keep returning true
despite the sync happening, perhaps because it persists data somewhere other
than the forks of pg_class.relfilenode.  Since the index and table APIs
already assume one relfilenode captures all persistent data, I'm not seeing a
use case for an AM overriding this behavior.  Let's take away the AM's
responsibility for this decision, making the system simpler.  A future patch
could let AM code decide, if someone find a real-world use case for
AM-specific logic around when to skip WAL.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Amit Kapila

Дата:

29 июня 2019 г., 02:46:34

On Tue, May 28, 2019 at 4:33 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
> > At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190525023332.GE1624191@rfd.leadboat.com>
> > > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > > > Following this direction, the attached PoC works *at least for*
> > > > the wal_optimization TAP tests, but doing pending flush not in
> > > > smgr but in relcache.
> > >
> > > This task, syncing files created in the current transaction, is not the kind
> > > of task normally assigned to a cache.  We already have a module, storage.c,
> > > that maintains state about files created in the current transaction.  Why did
> > > you use relcache instead of storage.c?
> >
> > The reason was at-commit sync needs buffer flush beforehand. But
> > FlushRelationBufferWithoutRelCache() in v11 can do
> > that. storage.c is reasonable as the place.
>
> Okay.  I do want this to work in 9.5 and later, but I'm not aware of a reason
> relcache.c would be a better code location in older branches.  Unless you
> think of a reason to prefer relcache.c, please use storage.c.
>
> > > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > > This is a tidier version of the patch.
> > >
> > > > - Move the substantial work to table/index AMs.
> > > >
> > > >   Each AM can decide whether to support WAL skip or not.
> > > >   Currently heap and nbtree support it.
> > >
> > > Why would an AM find it important to disable WAL skip?
> >
> > The reason is currently it's AM's responsibility to decide
> > whether to skip WAL or not.
>
> I see.  Skipping the sync would be a mere optimization; no AM would require it
> for correctness.  An AM might want RelationNeedsWAL() to keep returning true
> despite the sync happening, perhaps because it persists data somewhere other
> than the forks of pg_class.relfilenode.  Since the index and table APIs
> already assume one relfilenode captures all persistent data, I'm not seeing a
> use case for an AM overriding this behavior.  Let's take away the AM's
> responsibility for this decision, making the system simpler.  A future patch
> could let AM code decide, if someone find a real-world use case for
> AM-specific logic around when to skip WAL.
>

It seems there is some feedback for this patch and the CF is going to
start in 2 days.  Are you planning to work on this patch for next CF,
if not then it is better to bump this?  It is not a good idea to see
the patch in "waiting on author" in the beginning of CF unless the
author is actively working on the patch and is going to produce a
version in next few days.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

10 июля 2019 г., 07:19:14

Hello. Rebased the patch to master(bd56cd75d2).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ac52e2c1c56a96c1745149ff4220a3a116d6c811 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::real_dir($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 4363a50092dc8aa536b24582a3160f4f47c85349 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Mon, 27 May 2019 16:06:30 +0900
Subject: [PATCH 2/3] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/heap/heapam.c         |  4 +-
 src/backend/access/heap/heapam_handler.c | 22 +----------
 src/backend/access/heap/rewriteheap.c    | 13 ++-----
 src/backend/catalog/storage.c            | 64 +++++++++++++++++++++++++-------
 src/backend/commands/cluster.c           | 24 ++++++++++++
 src/backend/commands/copy.c              | 38 ++++---------------
 src/backend/commands/createas.c          |  5 +--
 src/backend/commands/matview.c           |  4 --
 src/backend/commands/tablecmds.c         | 10 ++---
 src/backend/storage/buffer/bufmgr.c      | 33 +++++++++++-----
 src/backend/utils/cache/relcache.c       | 16 ++++++--
 src/include/access/heapam.h              |  1 -
 src/include/access/rewriteheap.h         |  2 +-
 src/include/access/tableam.h             | 41 ++------------------
 src/include/storage/bufmgr.h             |  1 +
 src/include/utils/rel.h                  | 17 ++++++++-
 16 files changed, 148 insertions(+), 147 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b061..eca98fb063 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1941,7 +1941,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2124,7 +2124,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe98a..b9554f6064 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -556,18 +556,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
  * ------------------------------------------------------------------------
@@ -699,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +700,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     /* Remember if it's a system catalog */
     is_system_catalog = IsSystemRelation(OldHeap);
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
-     */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
     /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
@@ -729,7 +710,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2517,7 +2498,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 72a448ad31..992d4b9880 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * min_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..e4bcdc390f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -57,7 +57,8 @@ typedef struct PendingRelDelete
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
-    bool        atCommit;        /* T=delete at commit; F=delete at abort */
+    bool        atCommit;        /* T=work at commit; F=work at abort */
+    bool        dosync;            /* T=work is sync; F=work is delete */
     int            nestLevel;        /* xact nesting level of request */
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
@@ -114,10 +115,29 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
+    pending->dosync = false;
     pending->nestLevel = GetCurrentTransactionNestLevel();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * We are going to skip WAL-logging for storage of persistent relations
+     * created in the current transaction when wal_level = minimal. The
+     * relation needs to be synced at commit.
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pending = (PendingRelDelete *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+        pending->relnode = rnode;
+        pending->backend = backend;
+        pending->atCommit = true;
+        pending->dosync = true;
+        pending->nestLevel = GetCurrentTransactionNestLevel();
+        pending->next = pendingDeletes;
+        pendingDeletes = pending;
+    }
+
     return srel;
 }
 
@@ -155,6 +175,7 @@ RelationDropStorage(Relation rel)
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
+    pending->dosync = false;
     pending->nestLevel = GetCurrentTransactionNestLevel();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
@@ -428,21 +449,34 @@ smgrDoPendingDeletes(bool isCommit)
             {
                 SMgrRelation srel;
 
-                srel = smgropen(pending->relnode, pending->backend);
-
-                /* allocate the initial array, or extend it, if needed */
-                if (maxrels == 0)
+                if (pending->dosync)
                 {
-                    maxrels = 8;
-                    srels = palloc(sizeof(SMgrRelation) * maxrels);
+                    /* Perform pending sync of WAL-skipped relation */
+                    FlushRelationBuffersWithoutRelcache(pending->relnode,
+                                                        false);
+                    srel = smgropen(pending->relnode, pending->backend);
+                    smgrimmedsync(srel, MAIN_FORKNUM);
+                    smgrclose(srel);
                 }
-                else if (maxrels <= nrels)
+                else
                 {
-                    maxrels *= 2;
-                    srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
-                }
+                    /* Collect pending deletions */
+                    srel = smgropen(pending->relnode, pending->backend);
 
-                srels[nrels++] = srel;
+                    /* allocate the initial array, or extend it, if needed */
+                    if (maxrels == 0)
+                    {
+                        maxrels = 8;
+                        srels = palloc(sizeof(SMgrRelation) * maxrels);
+                    }
+                    else if (maxrels <= nrels)
+                    {
+                        maxrels *= 2;
+                        srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+                    }
+
+                    srels[nrels++] = srel;
+                }
             }
             /* must explicitly free the list entry */
             pfree(pending);
@@ -489,8 +523,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     nrels = 0;
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
+        /* Pending syncs are excluded */
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
-            && pending->backend == InvalidBackendId)
+            && pending->backend == InvalidBackendId && !pending->dosync)
             nrels++;
     }
     if (nrels == 0)
@@ -502,8 +537,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     *ptr = rptr;
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
+        /* Pending syncs are excluded */
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
-            && pending->backend == InvalidBackendId)
+            && pending->backend == InvalidBackendId && !pending->dosync)
         {
             *rptr = pending->relnode;
             rptr++;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ebaec4f8dd..6fc9d7d64e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
     if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
     {
+        Relation rel1;
+        Relation rel2;
+
         /*
          * Normal non-mapped relations: swap relfilenodes, reltablespaces,
          * relpersistence
          */
         Assert(!target_is_pg_class);
 
+        /* Update creation subid hints of relcache */
+        rel1 = relation_open(r1, ExclusiveLock);
+        rel2 = relation_open(r2, ExclusiveLock);
+
+        /*
+         * New relation's relfilenode is created in the current transaction
+         * and used as old ralation's new relfilenode. So its
+         * newRelfilenodeSubid as new relation's createSubid. We don't fix
+         * rel2 since it would be deleted soon.
+         */
+        Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+        rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+        /* record the first relfilenode change in the current transaction */
+        if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+        relation_close(rel1, ExclusiveLock);
+        relation_close(rel2, ExclusiveLock);
+
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f1161f0fee..f4beff0001 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2722,28 +2722,9 @@ CopyFrom(CopyState cstate)
      * If it does commit, we'll have done the table_finish_bulk_insert() at
      * the bottom of this routine first.
      *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time, even if we must use WAL because of
+     * archiving.  This could possibly be wrong, but it's unlikely.
      *
      * We currently don't support this optimization if the COPY target is a
      * partitioned table as we currently only lazily initialize partition
@@ -2759,15 +2740,14 @@ CopyFrom(CopyState cstate)
      * are not supported as per the description above.
      *----------
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+    /*
+     * createSubid is creation check, firstRelfilenodeSubid is truncation and
+     * cluster check. Partitioned table doesn't have storage.
+     */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
@@ -3366,8 +3346,6 @@ CopyFrom(CopyState cstate)
 
     FreeExecutorState(estate);
 
-    table_finish_bulk_insert(cstate->rel, ti_options);
-
     return processed;
 }
 
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4c1d909d38..39ebd73691 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
-
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
     myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * replication is in use. We can skip the FSM in any case.
      */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
     myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0f1a9f0e54..ac7336ef58 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4761,9 +4761,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
     /*
      * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * we're building a new heap, the underlying table AM can skip WAL-logging
+     * and fsync the relation to disk at the end of the current transaction
+     * instead. The FSM is empty too, so don't bother using it.
      */
     if (newrel)
     {
@@ -4771,8 +4771,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -5057,8 +5055,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
-
         table_close(newrel, NoLock);
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7332e6b590..280fdf8080 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
 static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -3190,20 +3191,32 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
-    /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3220,7 +3233,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3250,18 +3263,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d7832..812bfadb40 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2661,7 +2661,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2801,7 +2801,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -3058,6 +3058,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      * Likewise, reset the hint about the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3149,7 +3150,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3158,6 +3159,15 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c2b0481e7e..ac0e981acb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
                                uint8 flags,
                                TM_FailureData *tmfd);
 
-    /*
-     * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
-     *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags that apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert for
-     * those flags, and ignore others.
-     *
-     * Optional callback.
-     */
-    void        (*finish_bulk_insert) (Relation rel, int options);
-
-
     /* ------------------------------------------------------------------------
      * DDL related functionality.
      * ------------------------------------------------------------------------
@@ -1088,10 +1072,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
@@ -1111,10 +1091,8 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * Note that most of these options will be applied when inserting into the
  * heap's TOAST table, too, if the tuple requires any out-of-line data.
  *
- *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1249,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * update was done.  However, any TOAST changes in the new tuple's
  * data are not reflected into *newtup.
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, t_cmax.  See comments for struct TM_FailureData
  * for additional info.
@@ -1309,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
                                        flags, tmfd);
 }
 
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
-    /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                                    ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d35b4a5061..5cbb5a7b27 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -76,10 +76,17 @@ typedef struct RelationData
      * transaction, with one of them occurring in a subsequently aborted
      * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
      * ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+     * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+     * relfilenode change has took place first in the current
+     * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+     * OID means that the currently active relfilenode is transaction-local
+     * and no-need for WAL-logging.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
     SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
                                                  * current xact */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
+                                                 * first in current xact */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -512,9 +519,15 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
  */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
-- 
2.16.3

From 63fc1a432f20e99df6f081bc6af640bf6907879c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Wed, 29 May 2019 23:03:22 +0900
Subject: [PATCH 3/3] Rename smgrDoPendingDeletes to smgrDoPendingOperations

The function longer does only deletions but also syncs. Rename the
function to refect that. smgrGetPendingDeletes is not renamed since it
does not change behavior.
---
 src/backend/access/transam/xact.c |  4 +--
 src/backend/catalog/storage.c     | 57 ++++++++++++++++++++-------------------
 src/include/catalog/storage.h     |  2 +-
 3 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d7930c077d..cc0c43b2dd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2228,7 +2228,7 @@ CommitTransaction(void)
      * Other backends will observe the attendant catalog changes and not
      * attempt to access affected files.
      */
-    smgrDoPendingDeletes(true);
+    smgrDoPendingOperations(true);
 
     AtCommit_Notify();
     AtEOXact_GUC(true, 1);
@@ -2716,7 +2716,7 @@ AbortTransaction(void)
         ResourceOwnerRelease(TopTransactionResourceOwner,
                              RESOURCE_RELEASE_AFTER_LOCKS,
                              false, true);
-        smgrDoPendingDeletes(false);
+        smgrDoPendingOperations(false);
 
         AtEOXact_GUC(false, 1);
         AtEOXact_SPI(false);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e4bcdc390f..6ebe75aa37 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -53,17 +53,17 @@
  * but I'm being paranoid.
  */
 
-typedef struct PendingRelDelete
+typedef struct PendingRelOps
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
     bool        atCommit;        /* T=work at commit; F=work at abort */
     bool        dosync;            /* T=work is sync; F=work is delete */
     int            nestLevel;        /* xact nesting level of request */
-    struct PendingRelDelete *next;    /* linked-list link */
-} PendingRelDelete;
+    struct PendingRelOps *next;    /* linked-list link */
+} PendingRelOps;
 
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOps *pendingDeletes = NULL; /* head of linked list */
 
 /*
  * RelationCreateStorage
@@ -79,7 +79,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 SMgrRelation
 RelationCreateStorage(RelFileNode rnode, char relpersistence)
 {
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
     SMgrRelation srel;
     BackendId    backend;
     bool        needs_wal;
@@ -110,8 +110,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
         log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
     /* Add the relation to the list of stuff to delete at abort */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOps *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
@@ -127,8 +127,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
      */
     if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
     {
-        pending = (PendingRelDelete *)
-            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+        pending = (PendingRelOps *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
         pending->relnode = rnode;
         pending->backend = backend;
         pending->atCommit = true;
@@ -167,11 +167,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
 void
 RelationDropStorage(Relation rel)
 {
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
 
     /* Add the relation to the list of stuff to delete at commit */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOps *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
@@ -185,9 +185,9 @@ RelationDropStorage(Relation rel)
      * present in the pending-delete list twice, once with atCommit true and
      * once with atCommit false.  Hence, it will be physically deleted at end
      * of xact in either case (and the other entry will be ignored by
-     * smgrDoPendingDeletes, so no error will occur).  We could instead remove
-     * the existing list entry and delete the physical file immediately, but
-     * for now I'll keep the logic simple.
+     * smgrDoPendingOperations, so no error will occur).  We could instead
+     * remove the existing list entry and delete the physical file
+     * immediately, but for now I'll keep the logic simple.
      */
 
     RelationCloseSmgr(rel);
@@ -213,9 +213,9 @@ RelationDropStorage(Relation rel)
 void
 RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOps *pending;
+    PendingRelOps *prev;
+    PendingRelOps *next;
 
     prev = NULL;
     for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -406,7 +406,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 }
 
 /*
- *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+ *    smgrDoPendingOperations() -- Take care of relation deletes and syncs at
+ *        end of xact.
  *
  * This also runs when aborting a subxact; we want to clean up a failed
  * subxact immediately.
@@ -417,12 +418,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * already recovered the physical storage.
  */
 void
-smgrDoPendingDeletes(bool isCommit)
+smgrDoPendingOperations(bool isCommit)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOps *pending;
+    PendingRelOps *prev;
+    PendingRelOps *next;
     int            nrels = 0,
                 i = 0,
                 maxrels = 0;
@@ -518,7 +519,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     int            nestLevel = GetCurrentTransactionNestLevel();
     int            nrels;
     RelFileNode *rptr;
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
 
     nrels = 0;
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
@@ -558,8 +559,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
 void
 PostPrepare_smgr(void)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *next;
+    PendingRelOps *pending;
+    PendingRelOps *next;
 
     for (pending = pendingDeletes; pending != NULL; pending = next)
     {
@@ -580,7 +581,7 @@ void
 AtSubCommit_smgr(void)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
 
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
@@ -599,7 +600,7 @@ AtSubCommit_smgr(void)
 void
 AtSubAbort_smgr(void)
 {
-    smgrDoPendingDeletes(false);
+    smgrDoPendingOperations(false);
 }
 
 void
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..43836cf11c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -30,7 +30,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
-extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingOperations(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

12 июля 2019 г., 04:03:35

On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> Hello. Rebased the patch to master(bd56cd75d2).

It looks like you did more than just a rebase, because this v16 no longer
modifies many files that v14 did modify.  (That's probably good, since you had
pending review comments.)  What other changes did you make?

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

12 июля 2019 г., 11:30:41

Many message seem lost during moving to new environmet..
I'm digging the archive but coudn't find the message for v15..

At Thu, 11 Jul 2019 18:03:35 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190712010335.GB1610889@rfd.leadboat.com>
> On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> > Hello. Rebased the patch to master(bd56cd75d2).
> 
> It looks like you did more than just a rebase, because this v16 no longer
> modifies many files that v14 did modify.  (That's probably good, since you had
> pending review comments.)  What other changes did you make?

Yeah.. Maybe I forgot to send pre-v15 or v16 before rebasing.

v14: WAL-logging is controled by AMs and syncing at commit is
    controled according to the behavior.  At-commit sync is still
    controlled per-relation basis, which means it must be
    processed before transaction state becomes TRNAS_COMMIT. So
    it needs to be separated into PreCommit_RelationSync() from
    AtEOXact_RelationCache().

v15: The biggest change is that at-commit sync is changed to smgr
   basis. At-commit sync is programmed at creation of a storage
   file (RelationCreateStorage), and smgrDoPendingDelete(or
   smgrDoPendingOperations after rename) runs syncs.  AM are no
   longer involved and all permanent relations are WAL-skipped at
   all in the creation transaction while wal_level=minimal.

   All storages created for a relation are once synced then
   removed at commit.

v16: rebased.

The v16 seems no longer works so I'll send further rebased version.

Sorry for the late reply and confusion..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

12 июля 2019 г., 11:37:25

At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190712.173041.236938840.horikyota.ntt@gmail.com>
> The v16 seems no longer works so I'll send further rebased version.

It's just by renaming of TestLib::real_dir to perl2host.
This is rebased version v17.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9bcd4acb14c5cef2d4bdf20c9be8c86597a9cf7c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b26cd8efd5
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From 5d56e218b7771b3277d3aa97145dea16fdd48dbc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Mon, 27 May 2019 16:06:30 +0900
Subject: [PATCH 2/3] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/heap/heapam.c         |  4 +-
 src/backend/access/heap/heapam_handler.c | 22 +----------
 src/backend/access/heap/rewriteheap.c    | 13 ++-----
 src/backend/catalog/storage.c            | 64 +++++++++++++++++++++++++-------
 src/backend/commands/cluster.c           | 24 ++++++++++++
 src/backend/commands/copy.c              | 39 ++++---------------
 src/backend/commands/createas.c          |  5 +--
 src/backend/commands/matview.c           |  4 --
 src/backend/commands/tablecmds.c         | 10 ++---
 src/backend/storage/buffer/bufmgr.c      | 33 +++++++++++-----
 src/backend/utils/cache/relcache.c       | 16 ++++++--
 src/include/access/heapam.h              |  1 -
 src/include/access/rewriteheap.h         |  2 +-
 src/include/access/tableam.h             | 41 ++------------------
 src/include/storage/bufmgr.h             |  1 +
 src/include/utils/rel.h                  | 17 ++++++++-
 16 files changed, 148 insertions(+), 148 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d768b9b061..eca98fb063 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1941,7 +1941,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2124,7 +2124,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09bc6fe98a..b9554f6064 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -556,18 +556,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
  * ------------------------------------------------------------------------
@@ -699,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +700,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     /* Remember if it's a system catalog */
     is_system_catalog = IsSystemRelation(OldHeap);
 
-    /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
-     */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
     /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
@@ -729,7 +710,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2517,7 +2498,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 72a448ad31..992d4b9880 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * min_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..e4bcdc390f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -57,7 +57,8 @@ typedef struct PendingRelDelete
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
-    bool        atCommit;        /* T=delete at commit; F=delete at abort */
+    bool        atCommit;        /* T=work at commit; F=work at abort */
+    bool        dosync;            /* T=work is sync; F=work is delete */
     int            nestLevel;        /* xact nesting level of request */
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
@@ -114,10 +115,29 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
+    pending->dosync = false;
     pending->nestLevel = GetCurrentTransactionNestLevel();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * We are going to skip WAL-logging for storage of persistent relations
+     * created in the current transaction when wal_level = minimal. The
+     * relation needs to be synced at commit.
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pending = (PendingRelDelete *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+        pending->relnode = rnode;
+        pending->backend = backend;
+        pending->atCommit = true;
+        pending->dosync = true;
+        pending->nestLevel = GetCurrentTransactionNestLevel();
+        pending->next = pendingDeletes;
+        pendingDeletes = pending;
+    }
+
     return srel;
 }
 
@@ -155,6 +175,7 @@ RelationDropStorage(Relation rel)
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
+    pending->dosync = false;
     pending->nestLevel = GetCurrentTransactionNestLevel();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
@@ -428,21 +449,34 @@ smgrDoPendingDeletes(bool isCommit)
             {
                 SMgrRelation srel;
 
-                srel = smgropen(pending->relnode, pending->backend);
-
-                /* allocate the initial array, or extend it, if needed */
-                if (maxrels == 0)
+                if (pending->dosync)
                 {
-                    maxrels = 8;
-                    srels = palloc(sizeof(SMgrRelation) * maxrels);
+                    /* Perform pending sync of WAL-skipped relation */
+                    FlushRelationBuffersWithoutRelcache(pending->relnode,
+                                                        false);
+                    srel = smgropen(pending->relnode, pending->backend);
+                    smgrimmedsync(srel, MAIN_FORKNUM);
+                    smgrclose(srel);
                 }
-                else if (maxrels <= nrels)
+                else
                 {
-                    maxrels *= 2;
-                    srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
-                }
+                    /* Collect pending deletions */
+                    srel = smgropen(pending->relnode, pending->backend);
 
-                srels[nrels++] = srel;
+                    /* allocate the initial array, or extend it, if needed */
+                    if (maxrels == 0)
+                    {
+                        maxrels = 8;
+                        srels = palloc(sizeof(SMgrRelation) * maxrels);
+                    }
+                    else if (maxrels <= nrels)
+                    {
+                        maxrels *= 2;
+                        srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+                    }
+
+                    srels[nrels++] = srel;
+                }
             }
             /* must explicitly free the list entry */
             pfree(pending);
@@ -489,8 +523,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     nrels = 0;
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
+        /* Pending syncs are excluded */
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
-            && pending->backend == InvalidBackendId)
+            && pending->backend == InvalidBackendId && !pending->dosync)
             nrels++;
     }
     if (nrels == 0)
@@ -502,8 +537,9 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     *ptr = rptr;
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
+        /* Pending syncs are excluded */
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
-            && pending->backend == InvalidBackendId)
+            && pending->backend == InvalidBackendId && !pending->dosync)
         {
             *rptr = pending->relnode;
             rptr++;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ebaec4f8dd..6fc9d7d64e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
     if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
     {
+        Relation rel1;
+        Relation rel2;
+
         /*
          * Normal non-mapped relations: swap relfilenodes, reltablespaces,
          * relpersistence
          */
         Assert(!target_is_pg_class);
 
+        /* Update creation subid hints of relcache */
+        rel1 = relation_open(r1, ExclusiveLock);
+        rel2 = relation_open(r2, ExclusiveLock);
+
+        /*
+         * New relation's relfilenode is created in the current transaction
+         * and used as old ralation's new relfilenode. So its
+         * newRelfilenodeSubid as new relation's createSubid. We don't fix
+         * rel2 since it would be deleted soon.
+         */
+        Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+        rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+        /* record the first relfilenode change in the current transaction */
+        if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+        relation_close(rel1, ExclusiveLock);
+        relation_close(rel2, ExclusiveLock);
+
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 4f04d122c3..f02efd59fc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2535,9 +2535,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
     for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
         ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
-    table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
-                             miinfo->ti_options);
-
     pfree(buffer);
 }
 
@@ -2726,28 +2723,9 @@ CopyFrom(CopyState cstate)
      * If it does commit, we'll have done the table_finish_bulk_insert() at
      * the bottom of this routine first.
      *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time, even if we must use WAL because of
+     * archiving.  This could possibly be wrong, but it's unlikely.
      *
      * We currently don't support this optimization if the COPY target is a
      * partitioned table as we currently only lazily initialize partition
@@ -2763,15 +2741,14 @@ CopyFrom(CopyState cstate)
      * are not supported as per the description above.
      *----------
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+    /*
+     * createSubid is creation check, firstRelfilenodeSubid is truncation and
+     * cluster check. Partitioned table doesn't have storage.
+     */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4c1d909d38..39ebd73691 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
-
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
     myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * replication is in use. We can skip the FSM in any case.
      */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
     myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0f1a9f0e54..ac7336ef58 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4761,9 +4761,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
     /*
      * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * we're building a new heap, the underlying table AM can skip WAL-logging
+     * and fsync the relation to disk at the end of the current transaction
+     * instead. The FSM is empty too, so don't bother using it.
      */
     if (newrel)
     {
@@ -4771,8 +4771,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -5057,8 +5055,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
-
         table_close(newrel, NoLock);
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7332e6b590..280fdf8080 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
 static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -3190,20 +3191,32 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
-    /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3220,7 +3233,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3250,18 +3263,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d7832..812bfadb40 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2661,7 +2661,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2801,7 +2801,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -3058,6 +3058,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      * Likewise, reset the hint about the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3149,7 +3150,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3158,6 +3159,15 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c2b0481e7e..ac0e981acb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
                                uint8 flags,
                                TM_FailureData *tmfd);
 
-    /*
-     * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
-     *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags that apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert for
-     * those flags, and ignore others.
-     *
-     * Optional callback.
-     */
-    void        (*finish_bulk_insert) (Relation rel, int options);
-
-
     /* ------------------------------------------------------------------------
      * DDL related functionality.
      * ------------------------------------------------------------------------
@@ -1088,10 +1072,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
@@ -1111,10 +1091,8 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * Note that most of these options will be applied when inserting into the
  * heap's TOAST table, too, if the tuple requires any out-of-line data.
  *
- *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1249,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * update was done.  However, any TOAST changes in the new tuple's
  * data are not reflected into *newtup.
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, t_cmax.  See comments for struct TM_FailureData
  * for additional info.
@@ -1309,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
                                        flags, tmfd);
 }
 
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
-    /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                                    ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d35b4a5061..5cbb5a7b27 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -76,10 +76,17 @@ typedef struct RelationData
      * transaction, with one of them occurring in a subsequently aborted
      * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
      * ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+     * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+     * relfilenode change has took place first in the current
+     * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+     * OID means that the currently active relfilenode is transaction-local
+     * and no-need for WAL-logging.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
     SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
                                                  * current xact */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
+                                                 * first in current xact */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -512,9 +519,15 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
  */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
-- 
2.16.3

From 264bb593502db35ab8dbd7ddd505d2e729807293 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Wed, 29 May 2019 23:03:22 +0900
Subject: [PATCH 3/3] Rename smgrDoPendingDeletes to smgrDoPendingOperations

The function longer does only deletions but also syncs. Rename the
function to refect that. smgrGetPendingDeletes is not renamed since it
does not change behavior.
---
 src/backend/access/transam/xact.c |  4 +--
 src/backend/catalog/storage.c     | 57 ++++++++++++++++++++-------------------
 src/include/catalog/storage.h     |  2 +-
 3 files changed, 32 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d7930c077d..cc0c43b2dd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2228,7 +2228,7 @@ CommitTransaction(void)
      * Other backends will observe the attendant catalog changes and not
      * attempt to access affected files.
      */
-    smgrDoPendingDeletes(true);
+    smgrDoPendingOperations(true);
 
     AtCommit_Notify();
     AtEOXact_GUC(true, 1);
@@ -2716,7 +2716,7 @@ AbortTransaction(void)
         ResourceOwnerRelease(TopTransactionResourceOwner,
                              RESOURCE_RELEASE_AFTER_LOCKS,
                              false, true);
-        smgrDoPendingDeletes(false);
+        smgrDoPendingOperations(false);
 
         AtEOXact_GUC(false, 1);
         AtEOXact_SPI(false);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e4bcdc390f..6ebe75aa37 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -53,17 +53,17 @@
  * but I'm being paranoid.
  */
 
-typedef struct PendingRelDelete
+typedef struct PendingRelOps
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
     bool        atCommit;        /* T=work at commit; F=work at abort */
     bool        dosync;            /* T=work is sync; F=work is delete */
     int            nestLevel;        /* xact nesting level of request */
-    struct PendingRelDelete *next;    /* linked-list link */
-} PendingRelDelete;
+    struct PendingRelOps *next;    /* linked-list link */
+} PendingRelOps;
 
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOps *pendingDeletes = NULL; /* head of linked list */
 
 /*
  * RelationCreateStorage
@@ -79,7 +79,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 SMgrRelation
 RelationCreateStorage(RelFileNode rnode, char relpersistence)
 {
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
     SMgrRelation srel;
     BackendId    backend;
     bool        needs_wal;
@@ -110,8 +110,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
         log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
     /* Add the relation to the list of stuff to delete at abort */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOps *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
@@ -127,8 +127,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
      */
     if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
     {
-        pending = (PendingRelDelete *)
-            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+        pending = (PendingRelOps *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
         pending->relnode = rnode;
         pending->backend = backend;
         pending->atCommit = true;
@@ -167,11 +167,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
 void
 RelationDropStorage(Relation rel)
 {
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
 
     /* Add the relation to the list of stuff to delete at commit */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOps *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOps));
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
@@ -185,9 +185,9 @@ RelationDropStorage(Relation rel)
      * present in the pending-delete list twice, once with atCommit true and
      * once with atCommit false.  Hence, it will be physically deleted at end
      * of xact in either case (and the other entry will be ignored by
-     * smgrDoPendingDeletes, so no error will occur).  We could instead remove
-     * the existing list entry and delete the physical file immediately, but
-     * for now I'll keep the logic simple.
+     * smgrDoPendingOperations, so no error will occur).  We could instead
+     * remove the existing list entry and delete the physical file
+     * immediately, but for now I'll keep the logic simple.
      */
 
     RelationCloseSmgr(rel);
@@ -213,9 +213,9 @@ RelationDropStorage(Relation rel)
 void
 RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOps *pending;
+    PendingRelOps *prev;
+    PendingRelOps *next;
 
     prev = NULL;
     for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -406,7 +406,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 }
 
 /*
- *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+ *    smgrDoPendingOperations() -- Take care of relation deletes and syncs at
+ *        end of xact.
  *
  * This also runs when aborting a subxact; we want to clean up a failed
  * subxact immediately.
@@ -417,12 +418,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * already recovered the physical storage.
  */
 void
-smgrDoPendingDeletes(bool isCommit)
+smgrDoPendingOperations(bool isCommit)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOps *pending;
+    PendingRelOps *prev;
+    PendingRelOps *next;
     int            nrels = 0,
                 i = 0,
                 maxrels = 0;
@@ -518,7 +519,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     int            nestLevel = GetCurrentTransactionNestLevel();
     int            nrels;
     RelFileNode *rptr;
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
 
     nrels = 0;
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
@@ -558,8 +559,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
 void
 PostPrepare_smgr(void)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *next;
+    PendingRelOps *pending;
+    PendingRelOps *next;
 
     for (pending = pendingDeletes; pending != NULL; pending = next)
     {
@@ -580,7 +581,7 @@ void
 AtSubCommit_smgr(void)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
+    PendingRelOps *pending;
 
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
@@ -599,7 +600,7 @@ AtSubCommit_smgr(void)
 void
 AtSubAbort_smgr(void)
 {
-    smgrDoPendingDeletes(false);
+    smgrDoPendingOperations(false);
 }
 
 void
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..43836cf11c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -30,7 +30,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
-extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingOperations(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

25 июля 2019 г., 04:39:36

I found that CF-bot complaining on this.

Seems that some comment fixes by the recent 21039555cd are the
cause.

No substantial change have been made by this rebasing.

regards.

On Fri, Jul 12, 2019 at 5:37 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190712.173041.236938840.horikyota.ntt@gmail.com>
> > The v16 seems no longer works so I'll send further rebased version.
>
> It's just by renaming of TestLib::real_dir to perl2host.
> This is rebased version v17.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



--
Kyotaro Horiguchi
NTT Open Source Software Center

For two-phase commit, PrepareTransaction() needs to execute pending syncs.

On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
>      /* Remember if it's a system catalog */
>      is_system_catalog = IsSystemRelation(OldHeap);
>  
> -    /*
> -     * We need to log the copied data in WAL iff WAL archiving/streaming is
> -     * enabled AND it's a WAL-logged rel.
> -     */
> -    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> -
>      /* use_wal off requires smgr_targblock be initially invalid */
>      Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);

Since you're deleting the use_wal variable, update that last comment.

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
>              {
>                  SMgrRelation srel;
>  
> -                srel = smgropen(pending->relnode, pending->backend);
> -
> -                /* allocate the initial array, or extend it, if needed */
> -                if (maxrels == 0)
> +                if (pending->dosync)
>                  {
> -                    maxrels = 8;
> -                    srels = palloc(sizeof(SMgrRelation) * maxrels);
> +                    /* Perform pending sync of WAL-skipped relation */
> +                    FlushRelationBuffersWithoutRelcache(pending->relnode,
> +                                                        false);
> +                    srel = smgropen(pending->relnode, pending->backend);
> +                    smgrimmedsync(srel, MAIN_FORKNUM);

This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.

The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point."  Please write the part to WAL-log the contents of small files
instead of syncing them.

> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
>       * If it does commit, we'll have done the table_finish_bulk_insert() at
>       * the bottom of this routine first.
>       *
> -     * As mentioned in comments in utils/rel.h, the in-same-transaction test
> -     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> -     * can be cleared before the end of the transaction. The exact case is
> -     * when a relation sets a new relfilenode twice in same transaction, yet
> -     * the second one fails in an aborted subtransaction, e.g.
> -     *
> -     * BEGIN;
> -     * TRUNCATE t;
> -     * SAVEPOINT save;
> -     * TRUNCATE t;
> -     * ROLLBACK TO save;
> -     * COPY ...

The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.

> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -74,11 +74,13 @@ typedef struct RelationData
>      SubTransactionId rd_createSubid;    /* rel was created in current xact */
>      SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
>                                                   * current xact */
> +    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
> +                                                 * first in current xact */

In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed.  Many bits of code need to look at all three,
e.g. RelationClose().  This field needs to be 100% reliable.  In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.

nm

Вложения

wal-optimize-noah-tests-v2.patch

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

19 августа 2019 г., 12:59:59

Thank you for taking time.

At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> For two-phase commit, PrepareTransaction() needs to execute pending syncs.

Now TwoPhaseFileHeader has two new members for (commit-time)
pending syncs. Pending-syncs are useless on wal-replay, but that
is needed for commit-prepared.


> On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,    
...
> > -    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > -
> >      /* use_wal off requires smgr_targblock be initially invalid */
> >      Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
> 
> Since you're deleting the use_wal variable, update that last comment.

Oops. Rewrote it.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
...
> > +                    smgrimmedsync(srel, MAIN_FORKNUM);
> 
> This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> false due to this code, use RelationNeedsWAL() for multiple forks, and then
> not actually sync all forks.

I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?

> The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> or if it's smaller than some threshold, WAL-log the contents of the whole file
> at that point."  Please write the part to WAL-log the contents of small files
> instead of syncing them.

I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.

> > --- a/src/backend/commands/copy.c
> > +++ b/src/backend/commands/copy.c
> > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> >       * If it does commit, we'll have done the table_finish_bulk_insert() at
> >       * the bottom of this routine first.
> >       *
> > -     * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > -     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> > -     * can be cleared before the end of the transaction. The exact case is
> > -     * when a relation sets a new relfilenode twice in same transaction, yet
> > -     * the second one fails in an aborted subtransaction, e.g.
> > -     *
> > -     * BEGIN;
> > -     * TRUNCATE t;
> > -     * SAVEPOINT save;
> > -     * TRUNCATE t;
> > -     * ROLLBACK TO save;
> > -     * COPY ...
> 
> The comment material being deleted is still correct, so don't delete it.
> Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> attached patch adds an assertion that RelationNeedsWAL() and the
> pendingDeletes array have the same opinion about the relfilenode, and it
> expands a test case to fail that assertion.

(Un?)Fortunately, that doesn't fail.. (with rebased version on
the recent master) I'll recheck that tomorrow.

> > --- a/src/include/utils/rel.h
> > +++ b/src/include/utils/rel.h
> > @@ -74,11 +74,13 @@ typedef struct RelationData
> >      SubTransactionId rd_createSubid;    /* rel was created in current xact */
> >      SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
> >                                                   * current xact */
> > +    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
> > +                                                 * first in current xact */
> 
> In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> all the lines printed.  Many bits of code need to look at all three,
> e.g. RelationClose().

Agreed. I'll recheck that.

>  This field needs to be 100% reliable.  In other words,
> it must equal InvalidSubTransactionId if and only if the relfilenode matches
> the relfilenode that would be in place if the top transaction rolled back.

I don't get this. I think the variable moves as you suggested. It
is handled same way with fd_new* in AtEOSubXact_cleanup but the
difference is in assignment but rollback. rd_fist* won't change
after the first assignment so rollback of the subid means
relfilenode is also rolled back to the initial value at the
beginning of the top transaction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

20 августа 2019 г., 09:03:14

On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> 
> Now TwoPhaseFileHeader has two new members for (commit-time)
> pending syncs. Pending-syncs are useless on wal-replay, but that
> is needed for commit-prepared.

There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files.  (A crash may
have already destroyed their data.)  PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.

A failure in these new syncs needs to prevent the transaction from being
marked committed.  Hence, in CommitTransaction(), these new syncs need to
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit().  I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > +                    smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

I hadn't thought that far.  What do you think is best?

> > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > at that point."  Please write the part to WAL-log the contents of small files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
every buffer header containing a buffer of the current database.  The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >       * If it does commit, we'll have done the table_finish_bulk_insert() at
> > >       * the bottom of this routine first.
> > >       *
> > > -     * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> > > -     * can be cleared before the end of the transaction. The exact case is
> > > -     * when a relation sets a new relfilenode twice in same transaction, yet
> > > -     * the second one fails in an aborted subtransaction, e.g.
> > > -     *
> > > -     * BEGIN;
> > > -     * TRUNCATE t;
> > > -     * SAVEPOINT save;
> > > -     * TRUNCATE t;
> > > -     * ROLLBACK TO save;
> > > -     * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

Did you build with --enable-cassert?

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >      SubTransactionId rd_createSubid;    /* rel was created in current xact */
> > >      SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
> > >                                                   * current xact */
> > > +    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
> > > +                                                 * first in current xact */

> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.
> 
> I don't get this. I think the variable moves as you suggested. It
> is handled same way with fd_new* in AtEOSubXact_cleanup but the
> difference is in assignment but rollback. rd_fist* won't change
> after the first assignment so rollback of the subid means
> relfilenode is also rolled back to the initial value at the
> beginning of the top transaction.

$ git grep -n 'rd_firstRelfilenodeSubid = '
src/backend/commands/cluster.c:1061:            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
src/backend/utils/cache/relcache.c:3067:    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
src/backend/utils/cache/relcache.c:3173:            relation->rd_firstRelfilenodeSubid = parentSubid;
src/backend/utils/cache/relcache.c:3175:            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;

swap_relation_files() is the only place initializing this field.  Many paths
that assign a new relfilenode will never call swap_relation_files().

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

20 августа 2019 г., 11:17:57

Hello.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190819.185959.118543656.horikyota.ntt@gmail.com>
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

I saw the assertion failure.  It's a part of intended
behavior. In this patch, relcache doesn't hold the whole history
of relfilenodes so we cannot remove useless pending syncs
perfectly. On the other hand they are harmless except that they
cause extra sync of files that are removed immediately. So I
choosed that once registered pending syncs are not removed.

If we want consistency here, we need to record creator subxid in
PendingRelOps (PendingRelDelete) struct and rather large work at
subtransaction end.

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >      SubTransactionId rd_createSubid;    /* rel was created in current xact */
> > >      SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
> > >                                                   * current xact */
> > > +    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
> > > +                                                 * first in current xact */
> > 
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> > all the lines printed.  Many bits of code need to look at all three,
> > e.g. RelationClose().
> 
> Agreed. I'll recheck that.
> 
> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.
> 
> I don't get this. I think the variable moves as you suggested. It
> is handled same way with fd_new* in AtEOSubXact_cleanup but the
> difference is in assignment but rollback. rd_fist* won't change
> after the first assignment so rollback of the subid means
> relfilenode is also rolled back to the initial value at the
> beginning of the top transaction.

So I'll add this in the next version to see how it looks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

21 августа 2019 г., 10:32:38

Hello. New version is attached.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190819.185959.118543656.horikyota.ntt@gmail.com>
> Thank you for taking time.
> 
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
 
Now TwoPhaseFileHeader has two new members for pending syncs. It
is useless on wal-replay, but that is needed for commit-prepared.

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/access/heap/heapam_handler.c
> > > +++ b/src/backend/access/heap/heapam_handler.c
> > > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,    
> ...
> > > -    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > > -
> > >      /* use_wal off requires smgr_targblock be initially invalid */
> > >      Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
> > 
> > Since you're deleting the use_wal variable, update that last comment.

Oops! Rewrote it.

> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > +                    smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

All forks are synced and have no WALs emitted (as before) in the
attached version 19. FSM and VM are not changed.

> > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > at that point."  Please write the part to WAL-log the contents of small files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

This is not included in this version. I'll continue to consider
this.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >       * If it does commit, we'll have done the table_finish_bulk_insert() at
> > >       * the bottom of this routine first.
> > >       *
> > > -     * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> > > -     * can be cleared before the end of the transaction. The exact case is
> > > -     * when a relation sets a new relfilenode twice in same transaction, yet
> > > -     * the second one fails in an aborted subtransaction, e.g.
> > > -     *
> > > -     * BEGIN;
> > > -     * TRUNCATE t;
> > > -     * SAVEPOINT save;
> > > -     * TRUNCATE t;
> > > -     * ROLLBACK TO save;
> > > -     * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.

The code is changed to use rd_firstRelfilenodeSubid instead of
rd_firstRelfilenodeSubid which has the issue mentioned in the
deleted section. So this is right but irrelevant to the code
here. The same thing is written in the comment in RelationData.

(In short, not reverted)

> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
..
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> > all the lines printed.  Many bits of code need to look at all three,
> > e.g. RelationClose().

I forgot to maintain rd_firstRelfilenode in many places and the
assertion failure no longer happens after I fixed it. Opposite to
my previous mail, of course useless pending entries are removed
at subtransction abort and no needless syncs happen in that
meaning. But another type of useless sync was seen with the
previous version 18.

(In short fixed.)


> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.

Sorry, I confused this with another similar behavior of the
previous version 18, where files are synced even if it is to be
removed immediately at commit. In this version
smgrDoPendingOperations doesn't sync to-be-deleted files.

While checking this, I found that smgrDoPendingDeletes is making
unnecessary call to smgrclose() which lead server to crash while
deleting files. I removed it.


Please find the new version attached.

Changes:

- Rebased to f8cf524da1.

- Fixed prepare transaction. test2a catches this.
  (twophase.c)

- Fixed a comment in heapam_relation_copy_for_cluster.

- All forks are synced. (smgrDoPendingDeletes/Operations, SyncRelationFiles)

- Fixed handling of rd_firstRelfilenodeSubid.
  (RelationBuildLocalRelation, RelationSetNewRelfilenode,
   load_relcache_init_file) 

- Prevent to-be-deleted files from syncing. (smgrDoPendingDeletes/Operations)

- Fixed a crash bug caused by smgrclose() in smgrDoPendingOperations.

Minor changes:

- Renamed: PendingRelOps => PendingRelOp
- Type changed: bool PendingRelOp.dosync => PendingOpType PendingRelOp.op

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b4144d7e1f1fb22f4387e3af9d37a29b68c9795f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++++++++++
 1 file changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+    # Same for prepared transaction
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2a (id serial PRIMARY KEY);
+        INSERT INTO test2a VALUES (DEFAULT);
+        TRUNCATE test2a;
+        INSERT INTO test2a VALUES (DEFAULT);
+        PREPARE TRANSACTION 't';
+        COMMIT PREPARED 't';");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From d62a337281024c1f9df09596e62724057b02cdfb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 21 Aug 2019 13:57:00 +0900
Subject: [PATCH 2/3] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/heap/heapam.c         |   4 +-
 src/backend/access/heap/heapam_handler.c |  22 +----
 src/backend/access/heap/rewriteheap.c    |  13 +--
 src/backend/access/transam/twophase.c    |  23 ++++-
 src/backend/catalog/storage.c            | 158 ++++++++++++++++++++++++++-----
 src/backend/commands/cluster.c           |  24 +++++
 src/backend/commands/copy.c              |  39 ++------
 src/backend/commands/createas.c          |   5 +-
 src/backend/commands/matview.c           |   4 -
 src/backend/commands/tablecmds.c         |  10 +-
 src/backend/storage/buffer/bufmgr.c      |  33 +++++--
 src/backend/storage/smgr/md.c            |  30 ++++++
 src/backend/utils/cache/relcache.c       |  28 ++++--
 src/include/access/heapam.h              |   1 -
 src/include/access/rewriteheap.h         |   2 +-
 src/include/access/tableam.h             |  40 +-------
 src/include/catalog/storage.h            |   8 ++
 src/include/storage/bufmgr.h             |   1 +
 src/include/storage/md.h                 |   1 +
 src/include/utils/rel.h                  |  17 +++-
 20 files changed, 300 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb811d345a..ef18b61c55 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f1ff01e8cb..27f414a361 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
  * ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * smgr_targblock must be initially invalid if we are to skip WAL logging
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index a17508a82f..9e0d7295af 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 477709bbc2..e3512fc415 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -921,6 +921,7 @@ typedef struct TwoPhaseFileHeader
     Oid            owner;            /* user running the transaction */
     int32        nsubxacts;        /* number of following subxact XIDs */
     int32        ncommitrels;    /* number of delete-on-commit rels */
+    int32        npendsyncrels;    /* number of sync-on-commit rels */
     int32        nabortrels;        /* number of delete-on-abort rels */
     int32        ninvalmsgs;        /* number of cache invalidation messages */
     bool        initfileinval;    /* does relcache init file need invalidation? */
@@ -1009,6 +1010,7 @@ StartPrepare(GlobalTransaction gxact)
     TwoPhaseFileHeader hdr;
     TransactionId *children;
     RelFileNode *commitrels;
+    RelFileNode *pendsyncrels;
     RelFileNode *abortrels;
     SharedInvalidationMessage *invalmsgs;
 
@@ -1034,6 +1036,7 @@ StartPrepare(GlobalTransaction gxact)
     hdr.owner = gxact->owner;
     hdr.nsubxacts = xactGetCommittedChildren(&children);
     hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
+    hdr.npendsyncrels = smgrGetPendingSyncs(true, &pendsyncrels);
     hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
     hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs,
                                                           &hdr.initfileinval);
@@ -1057,6 +1060,11 @@ StartPrepare(GlobalTransaction gxact)
         save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileNode));
         pfree(commitrels);
     }
+    if (hdr.npendsyncrels > 0)
+    {
+        save_state_data(pendsyncrels, hdr.npendsyncrels * sizeof(RelFileNode));
+        pfree(pendsyncrels);
+    }
     if (hdr.nabortrels > 0)
     {
         save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileNode));
@@ -1464,6 +1472,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     TransactionId latestXid;
     TransactionId *children;
     RelFileNode *commitrels;
+    RelFileNode *pendsyncrels;
     RelFileNode *abortrels;
     RelFileNode *delrels;
     int            ndelrels;
@@ -1499,6 +1508,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     children = (TransactionId *) bufptr;
     bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
     commitrels = (RelFileNode *) bufptr;
+    bufptr += MAXALIGN(hdr->npendsyncrels * sizeof(RelFileNode));
+    pendsyncrels = (RelFileNode *) bufptr;
     bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode));
     abortrels = (RelFileNode *) bufptr;
     bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode));
@@ -1544,9 +1555,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     gxact->valid = false;
 
     /*
-     * We have to remove any files that were supposed to be dropped. For
-     * consistency with the regular xact.c code paths, must do this before
-     * releasing locks, so do it before running the callbacks.
+     * We have to sync or remove any files that were supposed to be done
+     * so. For consistency with the regular xact.c code paths, must do this
+     * before releasing locks, so do it before running the callbacks.
      *
      * NB: this code knows that we couldn't be dropping any temp rels ...
      */
@@ -1554,11 +1565,17 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     {
         delrels = commitrels;
         ndelrels = hdr->ncommitrels;
+
+        /* Make sure files supposed to be synced are synced */
+        SyncRelationFiles(pendsyncrels, hdr->npendsyncrels);
     }
     else
     {
         delrels = abortrels;
         ndelrels = hdr->nabortrels;
+
+        /* We don't have an at-abort pending sync */
+        Assert(pendsyncrels == 0);
     }
 
     /* Make sure files supposed to be dropped are dropped */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..354a74c27c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,6 +30,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -53,11 +54,13 @@
  * but I'm being paranoid.
  */
 
+/* entry type of pendingDeletes */
 typedef struct PendingRelDelete
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
-    bool        atCommit;        /* T=delete at commit; F=delete at abort */
+    bool        atCommit;        /* T=work at commit; F=work at abort */
+    PendingOpType    op;            /* type of operation to do */
     int            nestLevel;        /* xact nesting level of request */
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
@@ -114,10 +117,29 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
+    pending->op = PENDING_DELETE;
     pending->nestLevel = GetCurrentTransactionNestLevel();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * We are going to skip WAL-logging for storage of persistent relations
+     * created in the current transaction when wal_level = minimal. The
+     * relation needs to be synced at commit.
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pending = (PendingRelDelete *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+        pending->relnode = rnode;
+        pending->backend = backend;
+        pending->atCommit = true;
+        pending->op = PENDING_SYNC;
+        pending->nestLevel = GetCurrentTransactionNestLevel();
+        pending->next = pendingDeletes;
+        pendingDeletes = pending;
+    }
+
     return srel;
 }
 
@@ -155,6 +177,7 @@ RelationDropStorage(Relation rel)
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
+    pending->op = PENDING_DELETE;
     pending->nestLevel = GetCurrentTransactionNestLevel();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
@@ -201,7 +224,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
     {
         next = pending->next;
         if (RelFileNodeEquals(rnode, pending->relnode)
-            && pending->atCommit == atCommit)
+            && pending->atCommit == atCommit
+            && pending->op == PENDING_DELETE)
         {
             /* unlink and delete list entry */
             if (prev)
@@ -406,6 +430,7 @@ smgrDoPendingDeletes(bool isCommit)
                 i = 0,
                 maxrels = 0;
     SMgrRelation *srels = NULL;
+    struct HTAB *synchash = NULL;
 
     prev = NULL;
     for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -428,21 +453,50 @@ smgrDoPendingDeletes(bool isCommit)
             {
                 SMgrRelation srel;
 
-                srel = smgropen(pending->relnode, pending->backend);
-
-                /* allocate the initial array, or extend it, if needed */
-                if (maxrels == 0)
+                if (pending->op == PENDING_SYNC)
                 {
-                    maxrels = 8;
-                    srels = palloc(sizeof(SMgrRelation) * maxrels);
-                }
-                else if (maxrels <= nrels)
-                {
-                    maxrels *= 2;
-                    srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
-                }
+                    /* We don't have abort-time pending syncs */
+                    Assert(isCommit);
 
-                srels[nrels++] = srel;
+                    /* Create hash if not yet */
+                    if (synchash == NULL)
+                    {
+                        HASHCTL hash_ctl;
+
+                        memset(&hash_ctl, 0, sizeof(hash_ctl));
+                        hash_ctl.keysize = sizeof(SMgrRelation*);
+                        hash_ctl.entrysize = sizeof(SMgrRelation*);
+                        hash_ctl.hcxt = CurrentMemoryContext;
+                        synchash =
+                            hash_create("pending sync hash", 8,
+                                        &hash_ctl,
+                                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                    }
+
+                    /* Collect pending syncs */
+                    srel = smgropen(pending->relnode, pending->backend);
+                    (void) hash_search(synchash, (void *) &srel,
+                                       HASH_ENTER, NULL);
+                }
+                else
+                {
+                    /* Collect pending deletions */
+                    srel = smgropen(pending->relnode, pending->backend);
+
+                    /* allocate the initial array, or extend it, if needed */
+                    if (maxrels == 0)
+                    {
+                        maxrels = 8;
+                        srels = palloc(sizeof(SMgrRelation) * maxrels);
+                    }
+                    else if (maxrels <= nrels)
+                    {
+                        maxrels *= 2;
+                        srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+                    }
+
+                    srels[nrels++] = srel;
+                }
             }
             /* must explicitly free the list entry */
             pfree(pending);
@@ -450,6 +504,43 @@ smgrDoPendingDeletes(bool isCommit)
         }
     }
 
+    /* Sync only files that are not to be removed. */
+    if (synchash)
+    {
+        HASH_SEQ_STATUS hstat;
+        SMgrRelation *psrel;
+
+        /* remove to-be-removed files from synchash */
+        if (nrels > 0)
+        {
+            int i;
+            bool found;
+
+            for (i = 0 ; i < nrels ; i++)
+                (void) hash_search(synchash, (void *) &(srels[i]),
+                                   HASH_REMOVE, &found);
+        }
+
+        /* sync survuvied files */
+        hash_seq_init(&hstat, synchash);
+        while ((psrel = (SMgrRelation *) hash_seq_search(&hstat)) != NULL)
+        {
+            ForkNumber fork;
+
+            /* Perform pending sync of WAL-skipped relation */
+            FlushRelationBuffersWithoutRelcache((*psrel)->smgr_rnode.node,
+                                                false);
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                if (smgrexists(*psrel, fork))
+                    smgrimmedsync(*psrel, fork);
+            }
+        }
+
+        hash_destroy(synchash);
+        synchash = NULL;
+    }
+
     if (nrels > 0)
     {
         smgrdounlinkall(srels, nrels, false);
@@ -462,11 +553,12 @@ smgrDoPendingDeletes(bool isCommit)
 }
 
 /*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ *                                 deleted or synced.
  *
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * The return value is the number of relations scheduled for the operation
+ * specified by op. *ptr is set to point to a freshly-palloc'd array of
+ * RelFileNodes.  If there are no matching relations, *ptr is set to NULL.
  *
  * Only non-temporary relations are included in the returned list.  This is OK
  * because the list is used only in contexts where temporary relations don't
@@ -475,11 +567,11 @@ smgrDoPendingDeletes(bool isCommit)
  * (and all temporary files will be zapped if we restart anyway, so no need
  * for redo to do it also).
  *
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
  */
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingOpType op, bool forCommit, RelFileNode **ptr)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
     int            nrels;
@@ -490,7 +582,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
-            && pending->backend == InvalidBackendId)
+            && pending->backend == InvalidBackendId
+            && pending->op == op)
             nrels++;
     }
     if (nrels == 0)
@@ -503,7 +596,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
-            && pending->backend == InvalidBackendId)
+            && pending->backend == InvalidBackendId
+            && pending->op == op)
         {
             *rptr = pending->relnode;
             rptr++;
@@ -512,6 +606,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+    return smgrGetPendingOperations(PENDING_DELETE, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+    return smgrGetPendingOperations(PENDING_SYNC, forCommit, ptr);
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 28985a07ec..f665ee8358 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
     if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
     {
+        Relation rel1;
+        Relation rel2;
+
         /*
          * Normal non-mapped relations: swap relfilenodes, reltablespaces,
          * relpersistence
          */
         Assert(!target_is_pg_class);
 
+        /* Update creation subid hints of relcache */
+        rel1 = relation_open(r1, ExclusiveLock);
+        rel2 = relation_open(r2, ExclusiveLock);
+
+        /*
+         * New relation's relfilenode is created in the current transaction
+         * and used as old ralation's new relfilenode. So its
+         * newRelfilenodeSubid as new relation's createSubid. We don't fix
+         * rel2 since it would be deleted soon.
+         */
+        Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+        rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+        /* record the first relfilenode change in the current transaction */
+        if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+        relation_close(rel1, ExclusiveLock);
+        relation_close(rel2, ExclusiveLock);
+
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b28..3ce04f7efc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
     for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
         ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
-    table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
-                             miinfo->ti_options);
-
     pfree(buffer);
 }
 
@@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
      * If it does commit, we'll have done the table_finish_bulk_insert() at
      * the bottom of this routine first.
      *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time, even if we must use WAL because of
+     * archiving.  This could possibly be wrong, but it's unlikely.
      *
      * We currently don't support this optimization if the COPY target is a
      * partitioned table as we currently only lazily initialize partition
@@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate)
      * are not supported as per the description above.
      *----------
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+    /*
+     * createSubid is creation check, firstRelfilenodeSubid is truncation and
+     * cluster check. Partitioned table doesn't have storage.
+     */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
-
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
     myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * replication is in use. We can skip the FSM in any case.
      */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
     myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cceefbdd49..2468b178cb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4762,9 +4762,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
     /*
      * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * we're building a new heap, the underlying table AM can skip WAL-logging
+     * and smgr will sync the relation to disk at the end of the current
+     * transaction instead. The FSM is empty too, so don't bother using it.
      */
     if (newrel)
     {
@@ -4772,8 +4772,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -5058,8 +5056,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
-
         table_close(newrel, NoLock);
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..41ff6da9d9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
 static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -3191,20 +3192,32 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
-    /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3221,7 +3234,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3251,18 +3264,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..514c6098e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
     RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
 }
 
+/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+    int            i;
+
+    for (i = 0; i < nsyncrels; i++)
+    {
+        SMgrRelation srel;
+        ForkNumber    fork;
+
+        /* sync all existing forks of the relation */
+        FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+        srel = smgropen(syncrels[i], InvalidBackendId);
+
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+                smgrimmedsync(srel, fork);
+        }
+
+        smgrclose(srel);
+    }
+}
+
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 248860758c..147babb6b5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+         * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+         * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
          * rewrite-rule, partition key, and partition descriptor substructures
          * in place, because various places assume that these structures won't
          * move while they are working with an open relcache entry.  (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      * Likewise, reset the hint about the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      * operations on the rel in the same transaction.
      */
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
     /* Flag relation as needing eoxact cleanup (to remove the hint) */
     EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
                                uint8 flags,
                                TM_FailureData *tmfd);
 
-    /*
-     * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
-     *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags that apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert for
-     * those flags, and ignore others.
-     *
-     * Optional callback.
-     */
-    void        (*finish_bulk_insert) (Relation rel, int options);
-
-
     /* ------------------------------------------------------------------------
      * DDL related functionality.
      * ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * heap's TOAST table, too, if the tuple requires any out-of-line data.
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * update was done.  However, any TOAST changes in the new tuple's
  * data are not reflected into *newtup.
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, t_cmax.  See comments for struct TM_FailureData
  * for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
                                        flags, tmfd);
 }
 
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
-    /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..1de6f1655c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,13 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+    PENDING_DELETE,
+    PENDING_SYNC
+} PendingOpType;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -32,6 +39,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int    smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                                    ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..2bb2947bdb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 /* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c5d36680a2..f372dc2086 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -75,10 +75,17 @@ typedef struct RelationData
      * transaction, with one of them occurring in a subsequently aborted
      * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
      * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+     * relfilenode change has took place in the current transaction. Unlike
+     * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+     * means that the currently active relfilenode is transaction-local and we
+     * sync the relation at commit instead of WAL-logging.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
     SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
                                                  * current xact */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
+                                                 * first in current xact */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -514,9 +521,15 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
  */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
-- 
2.16.3

From 6f6b87ef06e26ad8222f5900f8e3b146d2f18cba Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Wed, 29 May 2019 23:03:22 +0900
Subject: [PATCH 3/3] Rename smgrDoPendingDeletes to smgrDoPendingOperations

The function longer does only deletions but also syncs. Rename the
function to reflect that. smgrGetPendingDeletes is not renamed since it
does not change behavior.
---
 src/backend/access/transam/xact.c |  4 +-
 src/backend/catalog/storage.c     | 91 ++++++++++++++++++++-------------------
 src/include/catalog/storage.h     |  2 +-
 3 files changed, 49 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f594d33e7a..0123fb0f7f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2228,7 +2228,7 @@ CommitTransaction(void)
      * Other backends will observe the attendant catalog changes and not
      * attempt to access affected files.
      */
-    smgrDoPendingDeletes(true);
+    smgrDoPendingOperations(true);
 
     AtCommit_Notify();
     AtEOXact_GUC(true, 1);
@@ -2716,7 +2716,7 @@ AbortTransaction(void)
         ResourceOwnerRelease(TopTransactionResourceOwner,
                              RESOURCE_RELEASE_AFTER_LOCKS,
                              false, true);
-        smgrDoPendingDeletes(false);
+        smgrDoPendingOperations(false);
 
         AtEOXact_GUC(false, 1);
         AtEOXact_SPI(false);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 354a74c27c..544ef3aa55 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -54,18 +54,18 @@
  * but I'm being paranoid.
  */
 
-/* entry type of pendingDeletes */
-typedef struct PendingRelDelete
+/* entry type of pendingOperations */
+typedef struct PendingRelOp
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
     bool        atCommit;        /* T=work at commit; F=work at abort */
     PendingOpType    op;            /* type of operation to do */
     int            nestLevel;        /* xact nesting level of request */
-    struct PendingRelDelete *next;    /* linked-list link */
-} PendingRelDelete;
+    struct PendingRelOp *next;    /* linked-list link */
+} PendingRelOp;
 
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingOperations = NULL; /* head of linked list */
 
 /*
  * RelationCreateStorage
@@ -81,7 +81,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 SMgrRelation
 RelationCreateStorage(RelFileNode rnode, char relpersistence)
 {
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
     SMgrRelation srel;
     BackendId    backend;
     bool        needs_wal;
@@ -112,15 +112,15 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
         log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
     /* Add the relation to the list of stuff to delete at abort */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOp *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
     pending->op = PENDING_DELETE;
     pending->nestLevel = GetCurrentTransactionNestLevel();
-    pending->next = pendingDeletes;
-    pendingDeletes = pending;
+    pending->next = pendingOperations;
+    pendingOperations = pending;
 
     /*
      * We are going to skip WAL-logging for storage of persistent relations
@@ -129,15 +129,15 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
      */
     if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
     {
-        pending = (PendingRelDelete *)
-            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+        pending = (PendingRelOp *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
         pending->relnode = rnode;
         pending->backend = backend;
         pending->atCommit = true;
         pending->op = PENDING_SYNC;
         pending->nestLevel = GetCurrentTransactionNestLevel();
-        pending->next = pendingDeletes;
-        pendingDeletes = pending;
+        pending->next = pendingOperations;
+        pendingOperations = pending;
     }
 
     return srel;
@@ -169,27 +169,27 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
 void
 RelationDropStorage(Relation rel)
 {
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     /* Add the relation to the list of stuff to delete at commit */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOp *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
     pending->op = PENDING_DELETE;
     pending->nestLevel = GetCurrentTransactionNestLevel();
-    pending->next = pendingDeletes;
-    pendingDeletes = pending;
+    pending->next = pendingOperations;
+    pendingOperations = pending;
 
     /*
      * NOTE: if the relation was created in this transaction, it will now be
      * present in the pending-delete list twice, once with atCommit true and
      * once with atCommit false.  Hence, it will be physically deleted at end
      * of xact in either case (and the other entry will be ignored by
-     * smgrDoPendingDeletes, so no error will occur).  We could instead remove
-     * the existing list entry and delete the physical file immediately, but
-     * for now I'll keep the logic simple.
+     * smgrDoPendingOperations, so no error will occur).  We could instead
+     * remove the existing list entry and delete the physical file
+     * immediately, but for now I'll keep the logic simple.
      */
 
     RelationCloseSmgr(rel);
@@ -215,12 +215,12 @@ RelationDropStorage(Relation rel)
 void
 RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
 
     prev = NULL;
-    for (pending = pendingDeletes; pending != NULL; pending = next)
+    for (pending = pendingOperations; pending != NULL; pending = next)
     {
         next = pending->next;
         if (RelFileNodeEquals(rnode, pending->relnode)
@@ -231,7 +231,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
             if (prev)
                 prev->next = next;
             else
-                pendingDeletes = next;
+                pendingOperations = next;
             pfree(pending);
             /* prev does not change */
         }
@@ -409,7 +409,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 }
 
 /*
- *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
+ *    smgrDoPendingOperations() -- Take care of relation deletes and syncs at
+ *                                 end of xact.
  *
  * This also runs when aborting a subxact; we want to clean up a failed
  * subxact immediately.
@@ -420,12 +421,12 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * already recovered the physical storage.
  */
 void
-smgrDoPendingDeletes(bool isCommit)
+smgrDoPendingOperations(bool isCommit)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
     int            nrels = 0,
                 i = 0,
                 maxrels = 0;
@@ -433,7 +434,7 @@ smgrDoPendingDeletes(bool isCommit)
     struct HTAB *synchash = NULL;
 
     prev = NULL;
-    for (pending = pendingDeletes; pending != NULL; pending = next)
+    for (pending = pendingOperations; pending != NULL; pending = next)
     {
         next = pending->next;
         if (pending->nestLevel < nestLevel)
@@ -447,7 +448,7 @@ smgrDoPendingDeletes(bool isCommit)
             if (prev)
                 prev->next = next;
             else
-                pendingDeletes = next;
+                pendingOperations = next;
             /* do deletion if called for */
             if (pending->atCommit == isCommit)
             {
@@ -576,10 +577,10 @@ smgrGetPendingOperations(PendingOpType op, bool forCommit, RelFileNode **ptr)
     int            nestLevel = GetCurrentTransactionNestLevel();
     int            nrels;
     RelFileNode *rptr;
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     nrels = 0;
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    for (pending = pendingOperations; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
             && pending->backend == InvalidBackendId
@@ -593,7 +594,7 @@ smgrGetPendingOperations(PendingOpType op, bool forCommit, RelFileNode **ptr)
     }
     rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
     *ptr = rptr;
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    for (pending = pendingOperations; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
             && pending->backend == InvalidBackendId
@@ -630,13 +631,13 @@ smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
 void
 PostPrepare_smgr(void)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *next;
 
-    for (pending = pendingDeletes; pending != NULL; pending = next)
+    for (pending = pendingOperations; pending != NULL; pending = next)
     {
         next = pending->next;
-        pendingDeletes = next;
+        pendingOperations = next;
         /* must explicitly free the list entry */
         pfree(pending);
     }
@@ -646,15 +647,15 @@ PostPrepare_smgr(void)
 /*
  * AtSubCommit_smgr() --- Take care of subtransaction commit.
  *
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
  */
 void
 AtSubCommit_smgr(void)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    for (pending = pendingOperations; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel)
             pending->nestLevel = nestLevel - 1;
@@ -671,7 +672,7 @@ AtSubCommit_smgr(void)
 void
 AtSubAbort_smgr(void)
 {
-    smgrDoPendingDeletes(false);
+    smgrDoPendingOperations(false);
 }
 
 void
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 1de6f1655c..dcb3bc4b69 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -37,7 +37,7 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
-extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingOperations(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern int    smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

22 августа 2019 г., 07:08:09

On Wed, Aug 21, 2019 at 04:32:38PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190819.185959.118543656.horikyota.ntt@gmail.com>
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
>  
> Now TwoPhaseFileHeader has two new members for pending syncs. It
> is useless on wal-replay, but that is needed for commit-prepared.

Syncs need to happen in PrepareTransaction(), not in commit-prepared.  I wrote
about that in https://postgr.es/m/20190820060314.GA3086296@rfd.leadboat.com

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

22 августа 2019 г., 15:06:06

Hello.

At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190820060314.GA3086296@rfd.leadboat.com>
> On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> > 
> > Now TwoPhaseFileHeader has two new members for (commit-time)
> > pending syncs. Pending-syncs are useless on wal-replay, but that
> > is needed for commit-prepared.
> 
> There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
> command, which is far too late to be syncing new relation files.  (A crash may
> have already destroyed their data.)  PrepareTransaction(), which implements
> the PREPARE TRANSACTION command, is the right place for these syncs.
> 
> A failure in these new syncs needs to prevent the transaction from being
> marked committed.  Hence, in CommitTransaction(), these new syncs need to

Agreed.

> happen after the last step that could create assign a new relfilenode and
> before RecordTransactionCommit().  I suspect it's best to do it after
> PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.

> > > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > > not actually sync all forks.
> > 
> > I agree that all forks needs syncing, but FSM and VM are checking
> > RelationNeedsWAL(modified). To make sure, are you suggesting to
> > sync all forks instead of emitting WAL for them, or suggesting
> > that VM and FSM to emit WALs even when the modified
> > RelationNeedsWAL returns false (+ sync all forks)?
> 
> I hadn't thought that far.  What do you think is best?

As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.


> > > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> > > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > > at that point."  Please write the part to WAL-log the contents of small files
> > > instead of syncing them.
> > 
> > I'm not sure the point of the behavior. I suppose that the "log"
> > is a sequence of new_page records. It also needs to be synced and
> > it is always larger than the file to be synced. I can't think of
> > an appropriate threshold without the point.
> 
> Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> every buffer header containing a buffer of the current database.  The belief
> has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> in a busy system with large shared_buffers.

I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers.  If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..

If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).

If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.

Any thoughts?



# Sorry time's up today.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

26 августа 2019 г., 08:08:43

On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190820060314.GA3086296@rfd.leadboat.com>
> > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > > > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> > > > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > > > at that point."  Please write the part to WAL-log the contents of small files
> > > > instead of syncing them.
> > > 
> > > I'm not sure the point of the behavior. I suppose that the "log"
> > > is a sequence of new_page records. It also needs to be synced and
> > > it is always larger than the file to be synced. I can't think of
> > > an appropriate threshold without the point.
> > 
> > Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> > every buffer header containing a buffer of the current database.  The belief
> > has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> > in a busy system with large shared_buffers.
> 
> I'm at a loss.. The decision between WAL and sync is made at
> commit time, when we no longer have a pin on a buffer. When
> emitting WAL, opposite to the assumption, lock needs to be
> re-acquired for every page to emit log_new_page. What is worse,
> we may need to reload evicted buffers.  If the file has been
> CopyFrom'ed, ring buffer strategy makes the situnation farther
> worse. That doesn't seem cheap at all..

Consider a one-page relfilenode.  Doing all the things you list for a single
page may be cheaper than locking millions of buffer headers.

> If there were any chance on WAL for smaller files here, it would
> be on the files smaller than the ring size of bulk-write
> strategy(16MB).

Like you, I expect the optimal threshold is less than 16MB, though you should
benchmark to see.  Under the ideal threshold, when a transaction creates a new
relfilenode just smaller than the threshold, that transaction will be somewhat
slower than it would be if the threshold were zero.  Locking every buffer
header causes a distributed slow-down for other queries, and protecting the
latency of non-DDL queries is typically more useful than accelerating
TRUNCATE, CREATE TABLE, etc.  Writing more WAL also slows down other queries;
beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
than the buffer scan harms them.  That's about where the threshold should be.

This should be GUC-controlled, especially since this is back-patch material.
We won't necessarily pick the best value on the first attempt, and the best
value could depend on factors like the filesystem, the storage hardware, and
the database's latency goals.  One could define the GUC as an absolute size
(e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
threshold is 1MB when shared_buffers is 1GB).  I'm not sure which is better.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

27 августа 2019 г., 09:49:32

Hello.

At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190826050843.GB3153606@rfd.leadboat.com>
noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190820060314.GA3086296@rfd.leadboat.com>
> > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > > I'm not sure the point of the behavior. I suppose that the "log"
> > > > is a sequence of new_page records. It also needs to be synced and
> > > > it is always larger than the file to be synced. I can't think of
> > > > an appropriate threshold without the point.
> > > 
> > > Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> > > every buffer header containing a buffer of the current database.  The belief
> > > has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> > > in a busy system with large shared_buffers.
> > 
> > I'm at a loss.. The decision between WAL and sync is made at
> > commit time, when we no longer have a pin on a buffer. When
> > emitting WAL, opposite to the assumption, lock needs to be
> > re-acquired for every page to emit log_new_page. What is worse,
> > we may need to reload evicted buffers.  If the file has been
> > CopyFrom'ed, ring buffer strategy makes the situnation farther
> > worse. That doesn't seem cheap at all..
> 
> Consider a one-page relfilenode.  Doing all the things you list for a single
> page may be cheaper than locking millions of buffer headers.

If I understand you correctly, I would say that *all* buffers
that don't belong to in-transaction-created files are skipped
before taking locks. No lock conflict happens with other
backends.

FlushRelationBuffers uses double-checked-locking as follows:

FlushRelationBuffers_common():
..
  if(!islocal) {
    for (i for all buffers) {
      if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) {
        LockBufHdr(bufHdr);
        if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) {
          PinBuffer_Locked(bubHder);
          LWLockAcquire();
          FlushBuffer();

128GB shared buffers contain 16M buffers. On my
perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
has only 6GB, the test is ignoring the effect of cache that comes
from the difference of the buffer size). (attached 1)

With WAL-emitting we find every buffers of the file using buffer
hash, we suffer partition locks instead of the 15ms of local
latency. That seems worse.

> > If there were any chance on WAL for smaller files here, it would
> > be on the files smaller than the ring size of bulk-write
> > strategy(16MB).
> 
> Like you, I expect the optimal threshold is less than 16MB, though you should
> benchmark to see.  Under the ideal threshold, when a transaction creates a new
> relfilenode just smaller than the threshold, that transaction will be somewhat
> slower than it would be if the threshold were zero.  Locking every buffer

I looked closer on this.

For a 16MB file, the cost of write-fsyncing cost is almost the
same to that of WAL-emitting cost. It was about 200 ms on the
Vista-era machine with non-performant rotating magnetic disks
with xfs. (attached 2, 3) Although write-fsyncing of relation
file makes no lock conflict with other backends, WAL-emitting
delays other backends' commits at most by that milliseconds.

In summary, the characteristics of the two methods on a 16MB file
are as the follows.

File write:
 - 15ms of buffer scan without locks (@128GB shared buffer)

 + no hash search for a buffer

 = take locks on all buffers only of the file one by one (to write)

 + plus 200ms of write-fdatasync (of whole the relation file),
    which doesn't conflict with other backends. (except via CPU
    time slots and IO bandwidth.)

WAL write : 
 + no buffer scan

 - 2048 times (16M/8k) of partition lock on finding every buffer
   for the target file, which can conflict with other backends.

 = take locks on all buffers only of the file one by one (to take FPW)

 - plus 200ms of open(create)-write-fdatasync (of a WAL file (of
   default size)), which can delay commits on other backends at
   most by that duration.

> header causes a distributed slow-down for other queries, and protecting the
> latency of non-DDL queries is typically more useful than accelerating
> TRUNCATE, CREATE TABLE, etc.  Writing more WAL also slows down other queries;
> beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
> than the buffer scan harms them.  That's about where the threshold should be.

If the discussion above is correct, we shouldn't use WAL-write
even for files around 16MB. For smaller shared_buffers and file
size, the delays are:

Scan all buffers takes:
  15  ms for 128GB shared_buffers
   4.5ms for 32GB shared_buffers

fdatasync takes:
  200 ms for  16MB/sync
   51 ms for   1MB/sync
   46 ms for 512kB/sync
   40 ms for 256kB/sync
   37 ms for 128kB/sync
   35 ms for <64kB/sync

It seems reasonable for 5400rpm disks. The threashold seems 64kB
on my configuration. It can differ by configuration but I think
not so largely. (I'm not sure about SSD or in-memory
filesystems.)

So for smaller than 64kB files:

File write:
 -- 15ms of buffer scan without locks
 +  no hash search for a buffer
 =  plus 35 ms of write-fdatasync

WAL write : 
 ++ no buffer scan
 -  one partition lock on finding every buffer for the target
    file, which can conflict with other backends. (but ignorable.)
 =  plus 35 ms of (open(create)-)write-fdatasync

It's possible that WAL records with smaller size is needless of
time for its own sync. This is the most obvious gain by WAL
emitting. considring 5-15ms of buffer scanning time, 256 or 512
kilobytes are the candidate default threshold but it would be
safe to use 64kB. 

> This should be GUC-controlled, especially since this is back-patch material.

Is this size of patch back-patchable?

> We won't necessarily pick the best value on the first attempt, and the best
> value could depend on factors like the filesystem, the storage hardware, and
> the database's latency goals.  One could define the GUC as an absolute size
> (e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
> threshold is 1MB when shared_buffers is 1GB).  I'm not sure which is better.

I'm not sure whether the knob shows apparent performance gain and
whether we can offer the criteria to identify the proper
value. But I'll add this feature with a GUC
effective_io_block_size defaults to 64kB as the threshold in the
next version. (The name and default value are arguable, of course.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

27 августа 2019 г., 09:59:11

At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190827.154932.250364935.horikyota.ntt@gmail.com>
> 128GB shared buffers contain 16M buffers. On my
> perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
> has only 6GB, the test is ignoring the effect of cache that comes
> from the difference of the buffer size). (attached 1)
...
> For a 16MB file, the cost of write-fsyncing cost is almost the
> same to that of WAL-emitting cost. It was about 200 ms on the
> Vista-era machine with non-performant rotating magnetic disks
> with xfs. (attached 2, 3) Although write-fsyncing of relation
> file makes no lock conflict with other backends, WAL-emitting
> delays other backends' commits at most by that milliseconds.

FWIW, the attached are the programs I used to take the numbers.

testloop.c: to take time to loop over buffers in FlushRelationBuffers

testfile.c: to take time to sync a heap file. (means one file for the size)

testfile2.c: to take time to emit a wal record. (means 16MB per file)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

typedef struct RelFileNode
{
  unsigned int spc;
  unsigned int db;
  unsigned int rel;
} RelFileNode;

typedef struct Buffer
{
  RelFileNode rnode;
} Buffer;

//#define NBUFFERS ((int)((128.0 * 1024 * 1024 * 1024) / (8.0 * 1024)))
#define NBUFFERS ((int)((32.0 * 1024 * 1024 * 1024) / (8.0 * 1024)))
int main(void) {
  int i;
  RelFileNode t = {1,2,3};
  Buffer *bufs = (Buffer *) malloc(sizeof(Buffer) * NBUFFERS);
  struct timeval st, ed;
  int matches = 0, unmatches = 0;
  Buffer *b;

  for (i = 0 ; i < NBUFFERS ; i++) {
    bufs[i].rnode.spc = random() * 100;
    bufs[i].rnode.db = random() * 100;
    bufs[i].rnode.rel = random() * 10000;
  }

  /* start measuring */
  gettimeofday(&st, NULL);

  b = bufs;
  for (i = 0 ; i < NBUFFERS ; i++) {
    if (b->rnode.spc == t.spc && b->rnode.db == t.db && b->rnode.rel == t.rel)
      matches++;
    else
      unmatches++;

    b++;
  }
  gettimeofday(&ed, NULL);

  printf("%lf ms for %d loops, matches %d, unmatches %d\n",
         (double)((ed.tv_sec - st.tv_sec) * 1000.0 +
                  (ed.tv_usec - st.tv_usec) / 1000.0),
         i, matches, unmatches);

  return 0;
}
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/time.h>
#include <fcntl.h>

//#define FILE_SIZE (16 * 1024 * 1024)
//#define LOOPS 100

#define FILE_SIZE (64 * 1024)
#define LOOPS 1000

//#define FILE_SIZE (8 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (1 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (512)
//#define LOOPS 1000

//#define FILE_SIZE (128)
//#define LOOPS 1000

char buf[FILE_SIZE];
char fname[256];

int main(void) {
  int i, j;
  int fd = -1;
  struct timeval st, ed;
  double accum = 0.0;
  int bufperfile = (int)((16.0 * 1024 * 1024) / FILE_SIZE);

  for (i = 0 ; i < LOOPS ; i++) {
    snprintf(fname, 256, "test%03d.file", i);
    unlink(fname); // ignore errors
  }

  for (i = 0 ; i < LOOPS ; i++) {
    for (j = 0 ; j < FILE_SIZE ; j++)
      buf[j] = random()* 256;

    if (i % bufperfile == 0) {
      if (fd >= 0)
        close(fd);

      snprintf(fname, 256, "test%03d.file", i / bufperfile);
      fd = open(fname, O_CREAT | O_RDWR, 0644);
      if (fd < 0) {
        fprintf(stderr, "open error: %m\n");
        exit(1);
      }
      memset(buf, 0, sizeof(buf));
      if (write(fd, buf, sizeof(buf)) < 0) {
        fprintf(stderr, "init write error: %m\n");
        exit(1);
      }
      if (fsync(fd) < 0) {
        fprintf(stderr, "init fsync error: %m\n");
        exit(1);
      }
      if (lseek(fd, 0, SEEK_SET) < 0) {
        fprintf(stderr, "init lseek error: %m\n");
        exit(1);
      }
      
    }

    gettimeofday(&st, NULL);
    if (write(fd, buf, FILE_SIZE) < 0) {
      fprintf(stderr, "write error: %m\n");
      exit(1);
    }
    if (fdatasync(fd) < 0) {
      fprintf(stderr, "fdatasync error: %m\n");
      exit(1);
    }
    gettimeofday(&ed, NULL);

    accum += (double)((ed.tv_sec - st.tv_sec) * 1000.0 +
                      (ed.tv_usec - st.tv_usec) / 1000.0);
  }

  printf("%.2lf ms for %d %dkB-records (%d MB), %.2lf ms per %dkB)\n",
         accum, i, FILE_SIZE / 1024, i * FILE_SIZE, accum / i, FILE_SIZE / 1024);

  return 0;
}

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <fcntl.h>

//#define FILE_SIZE (16 * 1024 * 1024)
//#define LOOPS 100

//#define FILE_SIZE (8 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (1 * 1024)
//#define LOOPS 1000

//#define FILE_SIZE (512)
//#define LOOPS 1000

#define FILE_SIZE (128)
#define LOOPS 1000

char buf[FILE_SIZE];

int main(void) {
  int i;
  int fd = -1;
  double accum = 0.0;
  struct timeval st, ed;

  for (i = 0 ; i < LOOPS ; i++) {
    char fname[256];
    snprintf(fname, 256, "test%03d.file", i);
    unlink(fname); // ignore errors
  }

  for (i = 0 ; i < LOOPS ; i++) {
    char fname[256];
    int j;

    snprintf(fname, 256, "test%03d.file", i);

    for (j = 0 ; j < FILE_SIZE ; j++)
      buf[j] = random()* 256;

    if (fd >= 0)
      close(fd);

    gettimeofday(&st, NULL);
    fd = open(fname, O_CREAT | O_RDWR, 0644);
    if (fd < 0) {
      fprintf(stderr, "open error: %m\n");
      exit(1);
    }

    if (write(fd, buf, FILE_SIZE) < 0) {
      fprintf(stderr, "write error: %m\n");
      exit(1);
    }
    if (fdatasync(fd) < 0) {
      fprintf(stderr, "fdatasync error: %m\n");
      exit(1);
    }

    if (lseek(fd, 0, SEEK_SET) < 0) {
      fprintf(stderr, "lseek error: %m\n");
      exit(1);
    }
    gettimeofday(&ed, NULL);

    accum += (double)((ed.tv_sec - st.tv_sec) * 1000.0 +
                      (ed.tv_usec - st.tv_usec) / 1000.0);
  }

  printf("%.2lf ms for %d %dkB-files (%d MB), %.2lf ms per %dkB)\n",
         accum, i, FILE_SIZE / 1024, i * FILE_SIZE, accum / i, FILE_SIZE / 1024);

  return 0;
}

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

28 августа 2019 г., 09:42:10

Hello, Noah.

At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190827.154932.250364935.horikyota.ntt@gmail.com>
> I'm not sure whether the knob shows apparent performance gain and
> whether we can offer the criteria to identify the proper
> value. But I'll add this feature with a GUC
> effective_io_block_size defaults to 64kB as the threshold in the
> next version. (The name and default value are arguable, of course.)

This is a new version of the patch based on the discussion.

The differences from v19 are the follows.

- Removed the new stuff in two-phase.c.

  The action on PREPARE TRANSACTION is now taken in
  PrepareTransaction(). Instead of storing pending syncs in
  two-phase files, the function immediately syncs all files that
  can survive the transaction end. (twophase.c, xact.c)

- Separate pendingSyncs from pendingDeletes.

  pendingSyncs gets handled differently from pendingDeletes so it
  is separated.

- Let smgrDoPendingSyncs() to avoid performing fsync on
  to-be-deleted files.

  In previous versions the function syncs all recorded files even
  if it is being deleted.  Since we use WAL-logging as the
  alternative of fsync now, performance gets more significance
g  than before.  Thus this version avoids uesless fsyncs.

- Use log_newpage instead of fsync for small tables.

  As in the discussion up-thread, I think I understand how
  WAL-logging works better than fsync.  smgrDoPendingSync issues
  log_newpage for all blocks in the table smaller than the GUC
  variable "effective_io_block_size".  I found
  log_newpage_range() that does exact what is needed here but it
  requires Relation that is no available there.  I removed an
  assertion in CreateFakeRelcacheEntry so that it works while
  non-recovery mode.

- Rebased and fixed some bugs.

I'm trying to measure performance difference on WAL/fsync.


By the way, smgrDoPendingDelete is called from CommitTransaction
and AbortTransaction directlry, and from AbortSubTransaction via
AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is
called only from AbortSubTransaction. I think these should be
unified either way.  Any opinions?

CommitTransaction()
  + msgrDoPendingDelete()

AbortTransaction()
  + msgrDoPendingDelete()

AbortSubTransactoin()
  AtSubAbort_smgr()
   + msgrDoPendingDelete()

# Looking around, the prefixes AtEOact/PreCommit/AtAbort don't
# seem to be used keeping a principle.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 83deb772808cdd3afdb44a7630656cc827adfe33 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/4] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++++++++++
 1 file changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+    # Same for prepared transaction
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2a (id serial PRIMARY KEY);
+        INSERT INTO test2a VALUES (DEFAULT);
+        TRUNCATE test2a;
+        INSERT INTO test2a VALUES (DEFAULT);
+        PREPARE TRANSACTION 't';
+        COMMIT PREPARED 't';");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

From e0650491226a689120d19060ad5da0917f7d3bd6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 21 Aug 2019 13:57:00 +0900
Subject: [PATCH 2/4] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/heap/heapam.c         |   4 +-
 src/backend/access/heap/heapam_handler.c |  22 +--
 src/backend/access/heap/rewriteheap.c    |  13 +-
 src/backend/access/transam/xact.c        |  17 ++
 src/backend/access/transam/xlogutils.c   |  11 +-
 src/backend/catalog/storage.c            | 295 +++++++++++++++++++++++++++----
 src/backend/commands/cluster.c           |  24 +++
 src/backend/commands/copy.c              |  39 +---
 src/backend/commands/createas.c          |   5 +-
 src/backend/commands/matview.c           |   4 -
 src/backend/commands/tablecmds.c         |  10 +-
 src/backend/storage/buffer/bufmgr.c      |  41 +++--
 src/backend/storage/smgr/md.c            |  30 ++++
 src/backend/utils/cache/relcache.c       |  28 ++-
 src/backend/utils/misc/guc.c             |  13 ++
 src/include/access/heapam.h              |   1 -
 src/include/access/rewriteheap.h         |   2 +-
 src/include/access/tableam.h             |  40 +----
 src/include/catalog/storage.h            |  12 ++
 src/include/storage/bufmgr.h             |   1 +
 src/include/storage/md.h                 |   1 +
 src/include/utils/rel.h                  |  17 +-
 22 files changed, 455 insertions(+), 175 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cb811d345a..ef18b61c55 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f1ff01e8cb..27f414a361 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
  * ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * smgr_targblock must be initially invalid if we are to skip WAL logging
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index a17508a82f..9e0d7295af 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f594d33e7a..1c4b264947 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2107,6 +2107,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before emitting commit record so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true, false);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2339,6 +2346,14 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Sync all WAL-skipped files now. Some of them may be deleted at
+     * transaction end but we don't bother store that information in PREPARE
+     * record or two-phase files. Like commit, we should sync WAL-skipped
+     * files before emitting PREPARE record. See CommitTransaction().
+     */
+    smgrDoPendingSyncs(true, true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2657,6 +2672,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false, false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
@@ -4941,6 +4957,7 @@ AbortSubTransaction(void)
                            s->parent->curTransactionOwner);
         AtEOSubXact_LargeObject(false, s->subTransactionId,
                                 s->parent->subTransactionId);
+        smgrDoPendingSyncs(false, false);
         AtSubAbort_Notify();
 
         /* Advertise the fact that we aborted in pg_xact. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 1fc39333f1..ff7dba429a 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+    /*
+     * We will never be working with temp rels during recovery or syncing
+     * WAL-skpped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 3cc886f7fe..43926ecaba 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,9 +30,13 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int    effective_io_block_size = 64; /* threshold of WAL-skipping in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -53,16 +57,17 @@
  * but I'm being paranoid.
  */
 
-typedef struct PendingRelDelete
+typedef struct PendingRelOp
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
-    bool        atCommit;        /* T=delete at commit; F=delete at abort */
+    bool        atCommit;        /* T=work at commit; F=work at abort */
     int            nestLevel;        /* xact nesting level of request */
-    struct PendingRelDelete *next;    /* linked-list link */
-} PendingRelDelete;
+    struct PendingRelOp *next;    /* linked-list link */
+} PendingRelOp;
 
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingSyncs = NULL; /* head of linked list */
 
 /*
  * RelationCreateStorage
@@ -78,7 +83,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 SMgrRelation
 RelationCreateStorage(RelFileNode rnode, char relpersistence)
 {
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
     SMgrRelation srel;
     BackendId    backend;
     bool        needs_wal;
@@ -109,8 +114,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
         log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
     /* Add the relation to the list of stuff to delete at abort */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOp *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
@@ -118,6 +123,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * When wal_level = minimal, we are going to skip WAL-logging for storage
+     * of persistent relations created in the current transaction. The
+     * relation needs to be synced at commit.
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        int nestLevel = GetCurrentTransactionNestLevel();
+
+        pending = (PendingRelOp *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
+        pending->relnode = rnode;
+        pending->backend = backend;
+        pending->atCommit = true;
+        pending->nestLevel = nestLevel;
+        pending->next = pendingSyncs;
+        pendingSyncs = pending;
+    }
+
     return srel;
 }
 
@@ -147,11 +171,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
 void
 RelationDropStorage(Relation rel)
 {
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     /* Add the relation to the list of stuff to delete at commit */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOp *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
@@ -192,9 +216,9 @@ RelationDropStorage(Relation rel)
 void
 RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
 
     prev = NULL;
     for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -399,9 +423,9 @@ void
 smgrDoPendingDeletes(bool isCommit)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
     int            nrels = 0,
                 i = 0,
                 maxrels = 0;
@@ -462,11 +486,195 @@ smgrDoPendingDeletes(bool isCommit)
 }
 
 /*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
  *
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * This should be called before smgrDoPendingDeletes() at every subtransaction
+ * end. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ *
+ * If sync_all is true, syncs all files including that are scheduled to be
+ * deleted.
+ */
+void
+smgrDoPendingSyncs(bool isCommit, bool sync_all)
+{
+    int            nestLevel = GetCurrentTransactionNestLevel();
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
+    SMgrRelation srel = NULL;
+    ForkNumber fork;
+    BlockNumber nblocks[MAX_FORKNUM + 1];
+    BlockNumber total_blocks = 0;
+    HTAB    *delhash = NULL;
+
+    /* Return if nothing to be synced in this nestlevel */
+    if (!pendingSyncs || pendingSyncs->nestLevel < nestLevel)
+        return;
+
+    Assert (pendingSyncs->nestLevel <= nestLevel);
+    Assert (pendingSyncs->backend == InvalidBackendId);
+
+    /*
+     * If sync_all is false, pending syncs on the relation that are to be
+     * deleted in this transaction-end should be ignored. Collect pending
+     * deletes that will happen in the following call to
+     * smgrDoPendingDeletes().
+     */
+    if (!sync_all)
+    {
+        for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+        {
+            bool found PG_USED_FOR_ASSERTS_ONLY;
+
+            if (pending->nestLevel < pendingSyncs->nestLevel ||
+                pending->atCommit != isCommit)
+                continue;
+
+            /* create the hash if not yet */
+            if (delhash == NULL)
+            {
+                HASHCTL hash_ctl;
+
+                memset(&hash_ctl, 0, sizeof(hash_ctl));
+                hash_ctl.keysize = sizeof(RelFileNode);
+                hash_ctl.entrysize = sizeof(RelFileNode);
+                hash_ctl.hcxt = CurrentMemoryContext;
+                delhash =
+                    hash_create("pending del temporary hash", 8, &hash_ctl,
+                                HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+            }
+
+            (void) hash_search(delhash, (void *) &(pending->relnode),
+                               HASH_ENTER, &found);
+            Assert(!found);
+        }
+    }
+
+    /* Loop over pendingSyncs */
+    prev = NULL;
+    for (pending = pendingSyncs; pending != NULL; pending = next)
+    {
+        bool to_be_removed = (!isCommit); /* don't sync if aborted */
+
+        next = pending->next;
+
+        /* outer-level entries should not be processed yet */
+        if (pending->nestLevel < nestLevel)
+        {
+            prev = pending;
+            continue;
+        }
+
+        /* don't sync relnodes that is being deleted */
+        if (delhash && !to_be_removed)
+            hash_search(delhash, (void *) &pending->relnode,
+                        HASH_FIND, &to_be_removed);
+
+        /* remove the entry if no longer useful */
+        if (to_be_removed)
+        {
+            if (prev)
+                prev->next = next;
+            else
+                pendingSyncs = next;
+            pfree(pending);
+            continue;
+        }
+
+        /* actual sync happens at the end of top transaction */
+        if (nestLevel > 1)
+        {
+            prev = pending;
+            continue;
+        }
+
+        /* Now the time to sync the rnode */
+        srel = smgropen(pendingSyncs->relnode, pendingSyncs->backend);
+
+        /*
+         * We emit newpage WAL records for smaller size of relations.
+         *
+         * Small WAL records have a chance to be emitted at once along with
+         * other backends' WAL records. We emit WAL records instead of syncing
+         * for files that are smaller than a certain threshold expecting
+         * faster commit. The threshold is defined by the GUC
+         * effective_io_block_size.
+         */
+        for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+        {
+            /* FSM doesn't need WAL nor sync */
+            if (fork != FSM_FORKNUM && smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL record for the file according to the total
+         * size.
+         */
+        if (total_blocks * BLCKSZ >= effective_io_block_size * 1024)
+        {
+            /* Flush all buffers then sync the file */
+            FlushRelationBuffersWithoutRelcache(srel->smgr_rnode.node, false);
+
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                if (smgrexists(srel, fork))
+                    smgrimmedsync(srel, fork);
+            }
+        }
+        else
+        {
+            /*
+             * Emit WAL records for all blocks. Some of the blocks might have
+             * been synced or evicted, but We don't bother checking that. The
+             * file is small enough.
+             */
+            for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+            {
+                bool   page_std = (fork == MAIN_FORKNUM);
+                int    n        = nblocks[fork];
+                Relation rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /* Emit WAL for the whole file */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, page_std);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+
+        /* done remove from list */
+        if (prev)
+            prev->next = next;
+        else
+            pendingSyncs = next;
+        pfree(pending);
+    }
+
+    if (delhash)
+        hash_destroy(delhash);
+}
+
+/*
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ *                                 deleted or synced.
+ *
+ * The return value is the number of relations scheduled in the given
+ * list. *ptr is set to point to a freshly-palloc'd array of RelFileNodes.  If
+ * there are no matching relations, *ptr is set to NULL.
  *
  * Only non-temporary relations are included in the returned list.  This is OK
  * because the list is used only in contexts where temporary relations don't
@@ -475,19 +683,19 @@ smgrDoPendingDeletes(bool isCommit)
  * (and all temporary files will be zapped if we restart anyway, so no need
  * for redo to do it also).
  *
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
  */
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingRelOp *list, bool forCommit, RelFileNode **ptr)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
     int            nrels;
     RelFileNode *rptr;
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     nrels = 0;
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    for (pending = list; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
             && pending->backend == InvalidBackendId)
@@ -500,7 +708,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     }
     rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
     *ptr = rptr;
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    for (pending = list; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
             && pending->backend == InvalidBackendId)
@@ -512,6 +720,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+    return smgrGetPendingOperations(pendingDeletes, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+    return smgrGetPendingOperations(pendingSyncs, forCommit, ptr);
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
@@ -522,8 +744,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
 void
 PostPrepare_smgr(void)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *next;
 
     for (pending = pendingDeletes; pending != NULL; pending = next)
     {
@@ -532,25 +754,34 @@ PostPrepare_smgr(void)
         /* must explicitly free the list entry */
         pfree(pending);
     }
+
+    /* We shouldn't have an entry in pendingSyncs */
+    Assert(pendingSyncs == NULL);
 }
 
 
 /*
  * AtSubCommit_smgr() --- Take care of subtransaction commit.
  *
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
  */
 void
 AtSubCommit_smgr(void)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel)
             pending->nestLevel = nestLevel - 1;
     }
+
+    for (pending = pendingSyncs; pending != NULL; pending = pending->next)
+    {
+        if (pending->nestLevel >= nestLevel)
+            pending->nestLevel = nestLevel - 1;
+    }
 }
 
 /*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 28985a07ec..f665ee8358 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
     if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
     {
+        Relation rel1;
+        Relation rel2;
+
         /*
          * Normal non-mapped relations: swap relfilenodes, reltablespaces,
          * relpersistence
          */
         Assert(!target_is_pg_class);
 
+        /* Update creation subid hints of relcache */
+        rel1 = relation_open(r1, ExclusiveLock);
+        rel2 = relation_open(r2, ExclusiveLock);
+
+        /*
+         * New relation's relfilenode is created in the current transaction
+         * and used as old ralation's new relfilenode. So its
+         * newRelfilenodeSubid as new relation's createSubid. We don't fix
+         * rel2 since it would be deleted soon.
+         */
+        Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+        rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+        /* record the first relfilenode change in the current transaction */
+        if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+        relation_close(rel1, ExclusiveLock);
+        relation_close(rel2, ExclusiveLock);
+
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b28..3ce04f7efc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
     for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
         ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
-    table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
-                             miinfo->ti_options);
-
     pfree(buffer);
 }
 
@@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
      * If it does commit, we'll have done the table_finish_bulk_insert() at
      * the bottom of this routine first.
      *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time, even if we must use WAL because of
+     * archiving.  This could possibly be wrong, but it's unlikely.
      *
      * We currently don't support this optimization if the COPY target is a
      * partitioned table as we currently only lazily initialize partition
@@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate)
      * are not supported as per the description above.
      *----------
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+    /*
+     * createSubid is creation check, firstRelfilenodeSubid is truncation and
+     * cluster check. Partitioned table doesn't have storage.
+     */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
-
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
     myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * replication is in use. We can skip the FSM in any case.
      */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
     myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cceefbdd49..2468b178cb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4762,9 +4762,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
     /*
      * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * we're building a new heap, the underlying table AM can skip WAL-logging
+     * and smgr will sync the relation to disk at the end of the current
+     * transaction instead. The FSM is empty too, so don't bother using it.
      */
     if (newrel)
     {
@@ -4772,8 +4772,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -5058,8 +5056,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
-
         table_close(newrel, NoLock);
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..55c122b3a7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
 static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -675,10 +676,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *        a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on permanent relations,
+ * which is OK, because we only use it during XLOG replay and processing
+ * pending syncs.  If in the future we want to use it on temporary or unlogged
+ * relations, we could pass additional parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
@@ -3191,20 +3192,32 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
-    /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3221,7 +3234,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3251,18 +3264,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..514c6098e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
     RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
 }
 
+/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+    int            i;
+
+    for (i = 0; i < nsyncrels; i++)
+    {
+        SMgrRelation srel;
+        ForkNumber    fork;
+
+        /* sync all existing forks of the relation */
+        FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+        srel = smgropen(syncrels[i], InvalidBackendId);
+
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+                smgrimmedsync(srel, fork);
+        }
+
+        smgrclose(srel);
+    }
+}
+
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 248860758c..147babb6b5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+         * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+         * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
          * rewrite-rule, partition key, and partition descriptor substructures
          * in place, because various places assume that these structures won't
          * move while they are working with an open relcache entry.  (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      * Likewise, reset the hint about the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      * operations on the rel in the same transaction.
      */
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
     /* Flag relation as needing eoxact cleanup (to remove the hint) */
     EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..1e4fc256fc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/user.h"
@@ -2774,6 +2775,18 @@ static struct config_int ConfigureNamesInt[] =
         check_effective_io_concurrency, assign_effective_io_concurrency, NULL
     },
 
+    {
+        {"effective_io_block_size", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of file that can be fsync'ed in the minimum required duration."),
+            gettext_noop("For rotating magnetic disks, it is around the size of a track or sylinder."),
+            GUC_UNIT_KB
+        },
+        &effective_io_block_size,
+        64,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
             gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
                                uint8 flags,
                                TM_FailureData *tmfd);
 
-    /*
-     * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
-     *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags that apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert for
-     * those flags, and ignore others.
-     *
-     * Optional callback.
-     */
-    void        (*finish_bulk_insert) (Relation rel, int options);
-
-
     /* ------------------------------------------------------------------------
      * DDL related functionality.
      * ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * heap's TOAST table, too, if the tuple requires any out-of-line data.
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * update was done.  However, any TOAST changes in the new tuple's
  * data are not reflected into *newtup.
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, t_cmax.  See comments for struct TM_FailureData
  * for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
                                        flags, tmfd);
 }
 
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
-    /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..1c1cf5d252 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,16 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+    PENDING_DELETE,
+    PENDING_SYNC
+} PendingOpType;
+
+/* GUC variables */
+extern int    effective_io_block_size; /* threshold for WAL-skipping */
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -31,7 +41,9 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit, bool sync_all);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int    smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 509f4b7ef1..ace5f5a2ae 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                                    ForkNumber forkNum, BlockNumber firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..2bb2947bdb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 /* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index c5d36680a2..f372dc2086 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -75,10 +75,17 @@ typedef struct RelationData
      * transaction, with one of them occurring in a subsequently aborted
      * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
      * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+     * relfilenode change has took place in the current transaction. Unlike
+     * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+     * means that the currently active relfilenode is transaction-local and we
+     * sync the relation at commit instead of WAL-logging.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
     SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
                                                  * current xact */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
+                                                 * first in current xact */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -514,9 +521,15 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
  */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
-- 
2.16.3

From cce02653f263211b1c777c3aac4d25423035a68d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:05:30 +0900
Subject: [PATCH 3/4] Documentation for effective_io_block_size

---
 doc/src/sgml/config.sgml | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 89284dc5c0..2d38d897ca 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1832,6 +1832,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-effective-io-block-size" xreflabel="effective_io_block_size">
+      <term><varname>effective_io_block_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>effective_io_block_size</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the expected maximum size of a file for which <function>fsync</function> returns in the minimum
requiredduration. It is approximately the size of a track or sylinder for magnetic disks.
 
+        The value is specified in kilobytes and the default is <literal>64</literal> kilobytes.
+       </para>
+       <para>
+        When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
+        WAL-logging is skipped for tables created in-trasaction.  If a table
+        is smaller than that size at commit, it is WAL-logged instead of
+        issueing <function>fsync</function> on it.
+
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
 
-- 
2.16.3

From b31533b895a3b239339aeb466d6f1abc0a1a4669 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:12:18 +0900
Subject: [PATCH 4/4] Additional test for new GUC setting.

This patchset adds new GUC variable effective_io_block_size that
controls wheter WAL-skipped tables are finally WAL-logged or
fcync'ed. All of the TAP test performs WAL-logging so this adds an
item that performs file sync.
---
 src/test/recovery/t/018_wal_optimize.pl | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index b041121745..95063ab131 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -11,7 +11,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 26;
+use Test::More tests => 28;
 
 sub check_orphan_relfilenodes
 {
@@ -102,7 +102,23 @@ max_prepared_transactions = 1
     $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
     is($result, qq(1),
        "wal_level = $wal_level, optimized truncation with prepared transaction");
+    # Same for file sync mode
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        SET effective_io_block_size to 0;
+        BEGIN;
+        CREATE TABLE test2b (id serial PRIMARY KEY);
+        INSERT INTO test2b VALUES (DEFAULT);
+        TRUNCATE test2b;
+        INSERT INTO test2b VALUES (DEFAULT);
+        COMMIT;");
 
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2b;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with file-sync");
 
     # Data file for COPY query in follow-up tests.
     my $basedir = $node->basedir;
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Alvaro Herrera

Дата:

03 сентября 2019 г., 00:15:00

I have updated this patch's status to "needs review", since v20 has not
received any comments yet.

Noah, you're listed as committer for this patch.  Are you still on the
hook for getting it done during the v13 timeframe?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

03 сентября 2019 г., 01:28:51

On Mon, Sep 02, 2019 at 05:15:00PM -0400, Alvaro Herrera wrote:
> I have updated this patch's status to "needs review", since v20 has not
> received any comments yet.
> 
> Noah, you're listed as committer for this patch.  Are you still on the
> hook for getting it done during the v13 timeframe?

Yes, assuming "getting it done" = "getting the CF entry to state other than
Needs Review".

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

10 сентября 2019 г., 14:45:17

[Casual readers with opinions on GUC naming: consider skipping to the end.]

MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
rd_createSubid is set; see attached test case.  It needs to skip WAL whenever
RelationNeedsWAL() returns false.

On Tue, Aug 27, 2019 at 03:49:32PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190826050843.GB3153606@rfd.leadboat.com>
> > Consider a one-page relfilenode.  Doing all the things you list for a single
> > page may be cheaper than locking millions of buffer headers.
> 
> If I understand you correctly, I would say that *all* buffers
> that don't belong to in-transaction-created files are skipped
> before taking locks. No lock conflict happens with other
> backends.
> 
> FlushRelationBuffers uses double-checked-locking as follows:

I had misread the code; you're right.

> > This should be GUC-controlled, especially since this is back-patch material.
> 
> Is this size of patch back-patchable?

Its size is not an obstacle.  It's not ideal to back-patch such a user-visible
performance change, but it would be worse to leave back branches able to
corrupt data during recovery.

On Wed, Aug 28, 2019 at 03:42:10PM +0900, Kyotaro Horiguchi wrote:
> - Use log_newpage instead of fsync for small tables.

> I'm trying to measure performance difference on WAL/fsync.

I would measure it with simultaneous pgbench instances:

1. DDL pgbench instance repeatedly creates and drops a table of X kilobytes,
   using --rate to make this happen a fixed number of times per second.
2. Regular pgbench instance runs the built-in script at maximum qps.

For each X, try one test run with effective_io_block_size = X-1 and one with
effective_io_block_size = X.  If the regular pgbench instance gets materially
higher qps with effective_io_block_size = X-1, the ideal default is <X.
Otherwise, the ideal default is >=X.

> +     <varlistentry id="guc-effective-io-block-size" xreflabel="effective_io_block_size">
> +      <term><varname>effective_io_block_size</varname> (<type>integer</type>)
> +      <indexterm>
> +       <primary><varname>effective_io_block_size</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        Specifies the expected maximum size of a file for which <function>fsync</function> returns in the minimum
requiredduration. It is approximately the size of a track or sylinder for magnetic disks.

> +        The value is specified in kilobytes and the default is <literal>64</literal> kilobytes.
> +       </para>
> +       <para>
> +        When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
> +        WAL-logging is skipped for tables created in-trasaction.  If a table
> +        is smaller than that size at commit, it is WAL-logged instead of
> +        issueing <function>fsync</function> on it.
> +
> +       </para>
> +      </listitem>
> +     </varlistentry>

Cylinder and track sizes are obsolete as user-visible concepts.  (They're not
constant for a given drive, and I think modern disks provide no way to read
the relevant parameters.)  I like the name "wal_skip_threshold", and my second
choice would be "wal_skip_min_size".  Possibly documented as follows:

  When wal_level is minimal and a transaction commits after creating or
  rewriting a permanent table, materialized view, or index, this setting
  determines how to persist the new data.  If the data is smaller than this
  setting, write it to the WAL log; otherwise, use an fsync of the data file.
  Depending on the properties of your storage, raising or lowering this value
  might help if such commits are slowing concurrent transactions.  The default
  is 64 kilobytes (64kB).

Any other opinions on the GUC name?

Вложения

wal-optimize-noah-tests-v3.patch

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

25 октября 2019 г., 07:12:51

Hello. Thanks for the comment.

# Sorry in advance for possilbe breaking the thread.

> MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
> rd_createSubid is set; see attached test case.  It needs to skip WAL whenever
> RelationNeedsWAL() returns false.

Thanks for pointing out that. And the test patch helped me very much.

Most of callers can tell that to the function, but SetHintBits()
cannot easily. Rather I think we shouldn't even try to do
that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
but it is called only at the time FPW is added so I believe it doesn't
affect performance so much. However, we can use hash for pendingSyncs
instead of liked list. Anyway the change is in its own file
v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
0002.

AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
non-wal_minimal code paths.

> Cylinder and track sizes are obsolete as user-visible concepts.  (They're not
> onstant for a given drive, and I think modern disks provide no way to read
> the relevant parameters.)  I like the name "wal_skip_threshold", and my second

I strongly agree. Thanks for the draft. I used it as-is. I don't come
up with an appropriate second description of the GUC so I just removed
it.

# it was "For rotating magnetic disks, it is around the size of a
# track or sylinder."

> the relevant parameters.)  I like the name "wal_skip_threshold", and
> my second choice would be "wal_skip_min_size".  Possibly documented
> as follows:
..
> Any other opinions on the GUC name?

I prefer the first candidate. I already used the terminology in
storage.c and the name fits more to the context.

> * We emit newpage WAL records for smaller size of relations.
> *
> * Small WAL records have a chance to be emitted at once along with
> * other backends' WAL records. We emit WAL records instead of syncing
> * for files that are smaller than a certain threshold expecting faster
- * commit. The threshold is defined by the GUC effective_io_block_size.
+ * commit. The threshold is defined by the GUC wal_skip_threshold.

The attached are:

- v21-0001-TAP-test-for-copy-truncation-optimization.patch
  same as v20

- v21-0002-Fix-WAL-skipping-feature.patch
  GUC name changed.

- v21-0003-Fix-MarkBufferDirtyHint.patch
  PoC of fixing the function. will be merged into 0002. (New)

- v21-0004-Documentation-for-wal_skip_threshold.patch
  GUC name and description changed. (Previous 0003)

- v21-0005-Additional-test-for-new-GUC-setting.patch
  including adjusted version of wal-optimize-noah-tests-v3.patch
  Maybe test names need further adjustment. (Previous 0004)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 34149545942480d8dcc1cc587f40091b19b5aa39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH v21 1/5] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++
 1 file changed, 312 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..b041121745
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,312 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 26;
+
+sub check_orphan_relfilenodes
+{
+    my($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+       "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql('postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 and relpersistence <> 't' and
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply([sort(map { "$prefix$_" }
+                    grep(/^[0-9]+$/,
+                         slurp_dir($node->data_dir . "/$prefix")))],
+              [sort split /\n/, $filepaths_referenced],
+              $test_name);
+    return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    # Primary needs to have wal_level = minimal here
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir ($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+       "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test1 (id serial PRIMARY KEY);
+        TRUNCATE test1;
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+    is($result, qq(0),
+       "wal_level = $wal_level, optimized truncation with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2 (id serial PRIMARY KEY);
+        INSERT INTO test2 VALUES (DEFAULT);
+        TRUNCATE test2;
+        INSERT INTO test2 VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with inserted table");
+
+
+    # Same for prepared transaction
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test2a (id serial PRIMARY KEY);
+        INSERT INTO test2a VALUES (DEFAULT);
+        TRUNCATE test2a;
+        INSERT INTO test2a VALUES (DEFAULT);
+        PREPARE TRANSACTION 't';
+        COMMIT PREPARED 't';");
+
+    $node->stop('immediate');
+    $node->start;
+
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with prepared transaction");
+
+
+    # Data file for COPY query in follow-up tests.
+    my $basedir = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after the
+    # truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3;
+        COPY test3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+    is($result, qq(3),
+       "wal_level = $wal_level, optimized truncation with copied table");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+        COPY test3a FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a2;
+        SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+        COPY test3a2 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE test3a3;
+        SAVEPOINT s;
+            ALTER TABLE test3a3 SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE test3a3 SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY test3a3 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+    is($result, qq(3),
+       "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+    # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+        UPDATE test3b SET id2 = id2 + 1;
+        DELETE FROM test3b;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+    is($result, qq(0),
+       "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+    # Test truncation with inserted tuples using both INSERT and COPY. Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE test4;
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COPY test4 FROM '$copy_file' DELIMITER ',';
+        INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+    is($result, qq(5),
+       "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+        INSERT INTO test5 VALUES (DEFAULT, 1);
+        COPY test5 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER test6_before_row_insert
+          BEFORE INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+        CREATE TRIGGER test6_after_row_insert
+          AFTER INSERT ON test6
+          FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+        COPY test6 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+    is($result, qq(9),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER test7_before_stat_truncate
+          BEFORE TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+        CREATE TRIGGER test7_after_stat_truncate
+          AFTER TRUNCATE ON test7
+          FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+        INSERT INTO test7 VALUES (DEFAULT, 1);
+        TRUNCATE test7;
+        COPY test7 FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+    is($result, qq(4),
+       "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+    # Test redo of temp table creation.
+    $node->safe_psql('postgres', "
+        CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+
+    check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.23.0

From e297f55d0d9215d9e828ec32dc0ebadb8e04bb2c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:09 +0900
Subject: [PATCH v21 2/5] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modification is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/heap/heapam.c         |   4 +-
 src/backend/access/heap/heapam_handler.c |  22 +-
 src/backend/access/heap/rewriteheap.c    |  13 +-
 src/backend/access/transam/xact.c        |  17 ++
 src/backend/access/transam/xlogutils.c   |  11 +-
 src/backend/catalog/storage.c            | 294 ++++++++++++++++++++---
 src/backend/commands/cluster.c           |  24 ++
 src/backend/commands/copy.c              |  39 +--
 src/backend/commands/createas.c          |   5 +-
 src/backend/commands/matview.c           |   4 -
 src/backend/commands/tablecmds.c         |  10 +-
 src/backend/storage/buffer/bufmgr.c      |  41 ++--
 src/backend/storage/smgr/md.c            |  30 +++
 src/backend/utils/cache/relcache.c       |  28 ++-
 src/backend/utils/misc/guc.c             |  13 +
 src/include/access/heapam.h              |   1 -
 src/include/access/rewriteheap.h         |   2 +-
 src/include/access/tableam.h             |  40 +--
 src/include/catalog/storage.h            |  12 +
 src/include/storage/bufmgr.h             |   1 +
 src/include/storage/md.h                 |   1 +
 src/include/utils/rel.h                  |  19 +-
 22 files changed, 455 insertions(+), 176 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..a7ead9405a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2dd8821fac..0871df7730 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -558,18 +558,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
  * ------------------------------------------------------------------------
@@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * smgr_targblock must be initially invalid if we are to skip WAL logging
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d41dbcf5f7..9b757cacf4 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fc55fa6d53..59d65bc214 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2107,6 +2107,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before emitting commit record so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true, false);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2339,6 +2346,14 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Sync all WAL-skipped files now. Some of them may be deleted at
+     * transaction end but we don't bother store that information in PREPARE
+     * record or two-phase files. Like commit, we should sync WAL-skipped
+     * files before emitting PREPARE record. See CommitTransaction().
+     */
+    smgrDoPendingSyncs(true, true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2657,6 +2672,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false, false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
@@ -4964,6 +4980,7 @@ AbortSubTransaction(void)
                            s->parent->curTransactionOwner);
         AtEOSubXact_LargeObject(false, s->subTransactionId,
                                 s->parent->subTransactionId);
+        smgrDoPendingSyncs(false, false);
         AtSubAbort_Notify();
 
         /* Advertise the fact that we aborted in pg_xact. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5f1e5ba75d..fc296abf91 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+    /*
+     * We will never be working with temp rels during recovery or syncing
+     * WAL-skpped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 625af8d49a..806f235a24 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -30,9 +30,13 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int    wal_skip_threshold = 64; /* threshold of WAL-skipping in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -53,16 +57,17 @@
  * but I'm being paranoid.
  */
 
-typedef struct PendingRelDelete
+typedef struct PendingRelOp
 {
     RelFileNode relnode;        /* relation that may need to be deleted */
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
-    bool        atCommit;        /* T=delete at commit; F=delete at abort */
+    bool        atCommit;        /* T=work at commit; F=work at abort */
     int            nestLevel;        /* xact nesting level of request */
-    struct PendingRelDelete *next;    /* linked-list link */
-} PendingRelDelete;
+    struct PendingRelOp *next;    /* linked-list link */
+} PendingRelOp;
 
-static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingDeletes = NULL; /* head of linked list */
+static PendingRelOp *pendingSyncs = NULL; /* head of linked list */
 
 /*
  * RelationCreateStorage
@@ -78,7 +83,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 SMgrRelation
 RelationCreateStorage(RelFileNode rnode, char relpersistence)
 {
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
     SMgrRelation srel;
     BackendId    backend;
     bool        needs_wal;
@@ -109,8 +114,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
         log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
 
     /* Add the relation to the list of stuff to delete at abort */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOp *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
     pending->relnode = rnode;
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
@@ -118,6 +123,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * When wal_level = minimal, we are going to skip WAL-logging for storage
+     * of persistent relations created in the current transaction. The
+     * relation needs to be synced at commit.
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        int nestLevel = GetCurrentTransactionNestLevel();
+
+        pending = (PendingRelOp *)
+            MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
+        pending->relnode = rnode;
+        pending->backend = backend;
+        pending->atCommit = true;
+        pending->nestLevel = nestLevel;
+        pending->next = pendingSyncs;
+        pendingSyncs = pending;
+    }
+
     return srel;
 }
 
@@ -147,11 +171,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
 void
 RelationDropStorage(Relation rel)
 {
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     /* Add the relation to the list of stuff to delete at commit */
-    pending = (PendingRelDelete *)
-        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+    pending = (PendingRelOp *)
+        MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp));
     pending->relnode = rel->rd_node;
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
@@ -192,9 +216,9 @@ RelationDropStorage(Relation rel)
 void
 RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
 
     prev = NULL;
     for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -431,9 +455,9 @@ void
 smgrDoPendingDeletes(bool isCommit)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
-    PendingRelDelete *prev;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
     int            nrels = 0,
                 i = 0,
                 maxrels = 0;
@@ -494,11 +518,194 @@ smgrDoPendingDeletes(bool isCommit)
 }
 
 /*
- * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ *
+ * This should be called before smgrDoPendingDeletes() at every subtransaction
+ * end. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ *
+ * If sync_all is true, syncs all files including that are scheduled to be
+ * deleted.
+ */
+void
+smgrDoPendingSyncs(bool isCommit, bool sync_all)
+{
+    int            nestLevel = GetCurrentTransactionNestLevel();
+    PendingRelOp *pending;
+    PendingRelOp *prev;
+    PendingRelOp *next;
+    SMgrRelation srel = NULL;
+    ForkNumber fork;
+    BlockNumber nblocks[MAX_FORKNUM + 1];
+    BlockNumber total_blocks = 0;
+    HTAB    *delhash = NULL;
+
+    /* Return if nothing to be synced in this nestlevel */
+    if (!pendingSyncs || pendingSyncs->nestLevel < nestLevel)
+        return;
+
+    Assert (pendingSyncs->nestLevel <= nestLevel);
+    Assert (pendingSyncs->backend == InvalidBackendId);
+
+    /*
+     * If sync_all is false, pending syncs on the relation that are to be
+     * deleted in this transaction-end should be ignored. Collect pending
+     * deletes that will happen in the following call to
+     * smgrDoPendingDeletes().
+     */
+    if (!sync_all)
+    {
+        for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+        {
+            bool found PG_USED_FOR_ASSERTS_ONLY;
+
+            if (pending->nestLevel < pendingSyncs->nestLevel ||
+                pending->atCommit != isCommit)
+                continue;
+
+            /* create the hash if not yet */
+            if (delhash == NULL)
+            {
+                HASHCTL hash_ctl;
+
+                memset(&hash_ctl, 0, sizeof(hash_ctl));
+                hash_ctl.keysize = sizeof(RelFileNode);
+                hash_ctl.entrysize = sizeof(RelFileNode);
+                hash_ctl.hcxt = CurrentMemoryContext;
+                delhash =
+                    hash_create("pending del temporary hash", 8, &hash_ctl,
+                                HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+            }
+
+            (void) hash_search(delhash, (void *) &(pending->relnode),
+                               HASH_ENTER, &found);
+            Assert(!found);
+        }
+    }
+
+    /* Loop over pendingSyncs */
+    prev = NULL;
+    for (pending = pendingSyncs; pending != NULL; pending = next)
+    {
+        bool to_be_removed = (!isCommit); /* don't sync if aborted */
+
+        next = pending->next;
+
+        /* outer-level entries should not be processed yet */
+        if (pending->nestLevel < nestLevel)
+        {
+            prev = pending;
+            continue;
+        }
+
+        /* don't sync relnodes that is being deleted */
+        if (delhash && !to_be_removed)
+            hash_search(delhash, (void *) &pending->relnode,
+                        HASH_FIND, &to_be_removed);
+
+        /* remove the entry if no longer useful */
+        if (to_be_removed)
+        {
+            if (prev)
+                prev->next = next;
+            else
+                pendingSyncs = next;
+            pfree(pending);
+            continue;
+        }
+
+        /* actual sync happens at the end of top transaction */
+        if (nestLevel > 1)
+        {
+            prev = pending;
+            continue;
+        }
+
+        /* Now the time to sync the rnode */
+        srel = smgropen(pendingSyncs->relnode, pendingSyncs->backend);
+
+        /*
+         * We emit newpage WAL records for smaller size of relations.
+         *
+         * Small WAL records have a chance to be emitted at once along with
+         * other backends' WAL records. We emit WAL records instead of syncing
+         * for files that are smaller than a certain threshold expecting faster
+         * commit. The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+        {
+            /* FSM doesn't need WAL nor sync */
+            if (fork != FSM_FORKNUM && smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL record for the file according to the total
+         * size.
+         */
+        if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+        {
+            /* Flush all buffers then sync the file */
+            FlushRelationBuffersWithoutRelcache(srel->smgr_rnode.node, false);
+
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                if (smgrexists(srel, fork))
+                    smgrimmedsync(srel, fork);
+            }
+        }
+        else
+        {
+            /*
+             * Emit WAL records for all blocks. Some of the blocks might have
+             * been synced or evicted, but We don't bother checking that. The
+             * file is small enough.
+             */
+            for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+            {
+                bool   page_std = (fork == MAIN_FORKNUM);
+                int    n        = nblocks[fork];
+                Relation rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /* Emit WAL for the whole file */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, page_std);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+
+        /* done remove from list */
+        if (prev)
+            prev->next = next;
+        else
+            pendingSyncs = next;
+        pfree(pending);
+    }
+
+    if (delhash)
+        hash_destroy(delhash);
+}
+
+/*
+ * smgrGetPendingOperations() -- Get a list of non-temp relations to be
+ *                                 deleted or synced.
  *
- * The return value is the number of relations scheduled for termination.
- * *ptr is set to point to a freshly-palloc'd array of RelFileNodes.
- * If there are no relations to be deleted, *ptr is set to NULL.
+ * The return value is the number of relations scheduled in the given
+ * list. *ptr is set to point to a freshly-palloc'd array of RelFileNodes.  If
+ * there are no matching relations, *ptr is set to NULL.
  *
  * Only non-temporary relations are included in the returned list.  This is OK
  * because the list is used only in contexts where temporary relations don't
@@ -507,19 +714,19 @@ smgrDoPendingDeletes(bool isCommit)
  * (and all temporary files will be zapped if we restart anyway, so no need
  * for redo to do it also).
  *
- * Note that the list does not include anything scheduled for termination
- * by upper-level transactions.
+ * Note that the list does not include anything scheduled by upper-level
+ * transactions.
  */
-int
-smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+static inline int
+smgrGetPendingOperations(PendingRelOp *list, bool forCommit, RelFileNode **ptr)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
     int            nrels;
     RelFileNode *rptr;
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     nrels = 0;
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    for (pending = list; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
             && pending->backend == InvalidBackendId)
@@ -532,7 +739,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     }
     rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode));
     *ptr = rptr;
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    for (pending = list; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
             && pending->backend == InvalidBackendId)
@@ -544,6 +751,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
     return nrels;
 }
 
+/* Returns list of pending deletes, see smgrGetPendingOperations for details */
+int
+smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
+{
+    return smgrGetPendingOperations(pendingDeletes, forCommit, ptr);
+}
+
+/* Returns list of pending syncs, see smgrGetPendingOperations for details */
+int
+smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr)
+{
+    return smgrGetPendingOperations(pendingSyncs, forCommit, ptr);
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *
@@ -554,8 +775,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
 void
 PostPrepare_smgr(void)
 {
-    PendingRelDelete *pending;
-    PendingRelDelete *next;
+    PendingRelOp *pending;
+    PendingRelOp *next;
 
     for (pending = pendingDeletes; pending != NULL; pending = next)
     {
@@ -564,25 +785,34 @@ PostPrepare_smgr(void)
         /* must explicitly free the list entry */
         pfree(pending);
     }
+
+    /* We shouldn't have an entry in pendingSyncs */
+    Assert(pendingSyncs == NULL);
 }
 
 
 /*
  * AtSubCommit_smgr() --- Take care of subtransaction commit.
  *
- * Reassign all items in the pending-deletes list to the parent transaction.
+ * Reassign all items in the pending-operations list to the parent transaction.
  */
 void
 AtSubCommit_smgr(void)
 {
     int            nestLevel = GetCurrentTransactionNestLevel();
-    PendingRelDelete *pending;
+    PendingRelOp *pending;
 
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
         if (pending->nestLevel >= nestLevel)
             pending->nestLevel = nestLevel - 1;
     }
+
+    for (pending = pendingSyncs; pending != NULL; pending = pending->next)
+    {
+        if (pending->nestLevel >= nestLevel)
+            pending->nestLevel = nestLevel - 1;
+    }
 }
 
 /*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a23128d7a0..fba44de88a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
     if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
     {
+        Relation rel1;
+        Relation rel2;
+
         /*
          * Normal non-mapped relations: swap relfilenodes, reltablespaces,
          * relpersistence
          */
         Assert(!target_is_pg_class);
 
+        /* Update creation subid hints of relcache */
+        rel1 = relation_open(r1, ExclusiveLock);
+        rel2 = relation_open(r2, ExclusiveLock);
+
+        /*
+         * New relation's relfilenode is created in the current transaction
+         * and used as old ralation's new relfilenode. So its
+         * newRelfilenodeSubid as new relation's createSubid. We don't fix
+         * rel2 since it would be deleted soon.
+         */
+        Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+        rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+        /* record the first relfilenode change in the current transaction */
+        if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+        relation_close(rel1, ExclusiveLock);
+        relation_close(rel2, ExclusiveLock);
+
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b28..3ce04f7efc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
     for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
         ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
-    table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
-                             miinfo->ti_options);
-
     pfree(buffer);
 }
 
@@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
      * If it does commit, we'll have done the table_finish_bulk_insert() at
      * the bottom of this routine first.
      *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time, even if we must use WAL because of
+     * archiving.  This could possibly be wrong, but it's unlikely.
      *
      * We currently don't support this optimization if the COPY target is a
      * partitioned table as we currently only lazily initialize partition
@@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate)
      * are not supported as per the description above.
      *----------
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
+    /*
+     * createSubid is creation check, firstRelfilenodeSubid is truncation and
+     * cluster check. Partitioned table doesn't have storage.
+     */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..8a91d946e3 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * We can skip WAL-logging the insertions, unless PITR or streaming
      * replication is in use. We can skip the FSM in any case.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->rel, myState->ti_options);
-
     /* close rel, but keep lock until commit */
     table_close(myState->rel, NoLock);
     myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..1c854dcebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      * replication is in use. We can skip the FSM in any case.
      */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
     /* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
     FreeBulkInsertState(myState->bistate);
 
-    table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
     /* close transientrel, but keep lock until commit */
     table_close(myState->transientrel, NoLock);
     myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8d25d14772..54c8b0fb04 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4764,9 +4764,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
     /*
      * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * we're building a new heap, the underlying table AM can skip WAL-logging
+     * and smgr will sync the relation to disk at the end of the current
+     * transaction instead. The FSM is empty too, so don't bother using it.
      */
     if (newrel)
     {
@@ -4774,8 +4774,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         bistate = GetBulkInsertState();
 
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -5070,8 +5068,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
     {
         FreeBulkInsertState(bistate);
 
-        table_finish_bulk_insert(newrel, ti_options);
-
         table_close(newrel, NoLock);
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 483f705305..827626b330 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL;
 static int32 PrivateRefCountOverflowed = 0;
 static uint32 PrivateRefCountClock = 0;
 static PrivateRefCountEntry *ReservedRefCountEntry = NULL;
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
 
 static void ReservePrivateRefCountEntry(void);
 static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
@@ -675,10 +676,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *        a relcache entry for the relation.
  *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
+ * NB: At present, this function may only be used on permanent relations,
+ * which is OK, because we only use it during XLOG replay and processing
+ * pending syncs.  If in the future we want to use it on temporary or unlogged
+ * relations, we could pass additional parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
@@ -3203,20 +3204,32 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
-    /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3233,7 +3246,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3263,18 +3276,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 07f3c93d3f..514c6098e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid)
     RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
 }
 
+/*
+ * SyncRelationFiles -- sync files of all given relations
+ *
+ * This function is assumed to be called only when skipping WAL-logging and
+ * emits no xlog records.
+ */
+void
+SyncRelationFiles(RelFileNode *syncrels, int nsyncrels)
+{
+    int            i;
+
+    for (i = 0; i < nsyncrels; i++)
+    {
+        SMgrRelation srel;
+        ForkNumber    fork;
+
+        /* sync all existing forks of the relation */
+        FlushRelationBuffersWithoutRelcache(syncrels[i], false);
+        srel = smgropen(syncrels[i], InvalidBackendId);
+
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+                smgrimmedsync(srel, fork);
+        }
+
+        smgrclose(srel);
+    }
+}
+
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 585dcee5db..892462873f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2094,7 +2096,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
+         * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state.
+         * Also attempt to preserve the pg_class entry (rd_rel), tupledesc,
          * rewrite-rule, partition key, and partition descriptor substructures
          * in place, because various places assume that these structures won't
          * move while they are working with an open relcache entry.  (Note:
@@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2667,7 +2670,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      * Likewise, reset the hint about the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      * operations on the rel in the same transaction.
      */
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
     /* Flag relation as needing eoxact cleanup (to remove the hint) */
     EOXactListAdd(relation);
@@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a5ef0474..559f96a6dc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/user.h"
@@ -2774,6 +2775,18 @@ static struct config_int ConfigureNamesInt[] =
         check_effective_io_concurrency, assign_effective_io_concurrency, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of file that can be fsync'ed in the minimum required duration."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        64,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
             gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..80c2e1bafc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f81703b78..b652cd6cef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -407,22 +407,6 @@ typedef struct TableAmRoutine
                                uint8 flags,
                                TM_FailureData *tmfd);
 
-    /*
-     * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
-     *
-     * Typically callers of tuple_insert and multi_insert will just pass all
-     * the flags that apply to them, and each AM has to decide which of them
-     * make sense for it, and then only take actions in finish_bulk_insert for
-     * those flags, and ignore others.
-     *
-     * Optional callback.
-     */
-    void        (*finish_bulk_insert) (Relation rel, int options);
-
-
     /* ------------------------------------------------------------------------
      * DDL related functionality.
      * ------------------------------------------------------------------------
@@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * heap's TOAST table, too, if the tuple requires any out-of-line data.
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * update was done.  However, any TOAST changes in the new tuple's
  * data are not reflected into *newtup.
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, t_cmax.  See comments for struct TM_FailureData
  * for additional info.
@@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
                                        flags, tmfd);
 }
 
-/*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
- */
-static inline void
-table_finish_bulk_insert(Relation rel, int options)
-{
-    /* optional callback */
-    if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-        rel->rd_tableam->finish_bulk_insert(rel, options);
-}
-
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..24e71651c3 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,6 +19,16 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* enum for operation type of PendingDelete entries */
+typedef enum PendingOpType
+{
+    PENDING_DELETE,
+    PENDING_SYNC
+} PendingOpType;
+
+/* GUC variables */
+extern int    wal_skip_threshold; /* threshold for WAL-skipping */
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
@@ -31,7 +41,9 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit, bool sync_all);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern int    smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7e38..f31a36de17 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index c0f05e23ff..2bb2947bdb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 
 extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 /* md sync callbacks */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a5cf804f9f..b2062efa63 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -75,10 +75,17 @@ typedef struct RelationData
      * transaction, with one of them occurring in a subsequently aborted
      * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
      * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+     * relfilenode change has took place in the current transaction. Unlike
+     * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+     * means that the currently active relfilenode is transaction-local and we
+     * sync the relation at commit instead of WAL-logging.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
     SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
                                                  * current xact */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
+                                                 * first in current xact */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -517,9 +524,15 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
-- 
2.23.0

From 96ad8bd4537e5055509ec9fdbbef502b52f136b5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2019 12:07:52 +0900
Subject: [PATCH v21 3/5] Fix MarkBufferDirtyHint

---
 src/backend/catalog/storage.c       | 17 +++++++++++++++++
 src/backend/storage/buffer/bufmgr.c |  7 +++++++
 src/include/catalog/storage.h       |  1 +
 3 files changed, 25 insertions(+)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 806f235a24..6d5a3d53e7 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -440,6 +440,23 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if this relfilenode needs WAL
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    PendingRelOp *pending;
+
+    for (pending = pendingSyncs ; pending != NULL ; pending = pending->next)
+    {
+        if (RelFileNodeEquals(pending->relnode, rnode))
+            return true;
+    }
+
+    return false;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 827626b330..06ec7cc186 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3506,6 +3506,13 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             if (RecoveryInProgress())
                 return;
 
+            /*
+             * Skip WAL logging if this buffer belongs to a relation that is
+             * skipping WAL-logging.
+             */
+            if (RelFileNodeSkippingWAL(bufHdr->tag.rnode))
+                return;
+
             /*
              * If the block is already dirty because we either made a change
              * or set a hint already, then we don't need to write a full page
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 24e71651c3..eb2666e001 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,6 +35,7 @@ extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
-- 
2.23.0

From 6ad62905d8a256c3531c9225bdb3212c45f5faff Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:05:30 +0900
Subject: [PATCH v21 4/5] Documentation for wal_skip_threshold

---
 doc/src/sgml/config.sgml | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 886632ff43..f928c5aa0b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1833,6 +1833,32 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-min_size" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+         When wal_level is minimal and a transaction commits after creating or
+         rewriting a permanent table, materialized view, or index, this
+         setting determines how to persist the new data.  If the data is
+         smaller than this setting, write it to the WAL log; otherwise, use an
+         fsync of the data file.  Depending on the properties of your storage,
+         raising or lowering this value might help if such commits are slowing
+         concurrent transactions.  The default is 64 kilobytes (64kB).
+       </para>
+       <para>
+        When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
+        WAL-logging is skipped for tables created in-trasaction.  If a table
+        is smaller than that size at commit, it is WAL-logged instead of
+        issueing <function>fsync</function> on it.
+
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
 
-- 
2.23.0

From 67baae223a93bc4c9827e1c8d99a040a058ad6ad Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 28 Aug 2019 14:12:18 +0900
Subject: [PATCH v21 5/5] Additional test for new GUC setting.

This patchset adds new GUC variable effective_io_block_size that
controls wheter WAL-skipped tables are finally WAL-logged or
fcync'ed. All of the TAP test performs WAL-logging so this adds an
item that performs file sync.
---
 src/test/recovery/t/018_wal_optimize.pl | 38 ++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index b041121745..ba9185e2ba 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -11,7 +11,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 26;
+use Test::More tests => 32;
 
 sub check_orphan_relfilenodes
 {
@@ -43,6 +43,8 @@ sub run_wal_optimize
     $node->append_conf('postgresql.conf', qq(
 wal_level = $wal_level
 max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
 ));
     $node->start;
 
@@ -102,7 +104,23 @@ max_prepared_transactions = 1
     $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;");
     is($result, qq(1),
        "wal_level = $wal_level, optimized truncation with prepared transaction");
+    # Same for file sync mode
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql('postgres', "
+        SET wal_skip_threshold to 0;
+        BEGIN;
+        CREATE TABLE test2b (id serial PRIMARY KEY);
+        INSERT INTO test2b VALUES (DEFAULT);
+        TRUNCATE test2b;
+        INSERT INTO test2b VALUES (DEFAULT);
+        COMMIT;");
+
+    $node->stop('immediate');
+    $node->start;
 
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2b;");
+    is($result, qq(1),
+       "wal_level = $wal_level, optimized truncation with file-sync");
 
     # Data file for COPY query in follow-up tests.
     my $basedir = $node->basedir;
@@ -178,6 +196,24 @@ max_prepared_transactions = 1
     is($result, qq(3),
        "wal_level = $wal_level, SET TABLESPACE in subtransaction");
 
+    $node->safe_psql('postgres', "
+        BEGIN;
+        CREATE TABLE test3a5 (c int PRIMARY KEY);
+        SAVEPOINT q; INSERT INTO test3a5 VALUES (1); ROLLBACK TO q;
+        CHECKPOINT;
+        INSERT INTO test3a5 VALUES (1);  -- set index hint bit
+        INSERT INTO test3a5 VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->psql('postgres', );
+    my($ret, $stdout, $stderr) = $node->psql(
+        'postgres', "INSERT INTO test3a5 VALUES (2);");
+    is($ret, qq(3),
+       "wal_level = $wal_level, unique index LP_DEAD");
+    like($stderr, qr/violates unique/,
+       "wal_level = $wal_level, unique index LP_DEAD message");
+
     # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
     $node->safe_psql('postgres', "
         BEGIN;
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

25 октября 2019 г., 07:46:14

Ugh!

2019年10月25日(金) 13:13 Kyotaro Horiguchi <horikyota.ntt@gmail.com>:

that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003)
regards.

It's wrong that it also skips chnging flags.

I"ll fix it soon

Kyotaro Horiguchi

NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

25 октября 2019 г., 16:20:32

On Fri, Oct 25, 2019 at 1:13 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Hello. Thanks for the comment.
>
> # Sorry in advance for possilbe breaking the thread.
>
> > MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
> > rd_createSubid is set; see attached test case.  It needs to skip WAL whenever
> > RelationNeedsWAL() returns false.
>
> Thanks for pointing out that. And the test patch helped me very much.
>
> Most of callers can tell that to the function, but SetHintBits()
> cannot easily. Rather I think we shouldn't even try to do
> that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
> for sync-pending state of the relfilenode for the buffer. In the
> attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
> but it is called only at the time FPW is added so I believe it doesn't
> affect performance so much. However, we can use hash for pendingSyncs
> instead of liked list. Anyway the change is in its own file
> v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
> 0002.
>
> AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
> non-wal_minimal code paths.
>
> > Cylinder and track sizes are obsolete as user-visible concepts.  (They're not
> > onstant for a given drive, and I think modern disks provide no way to read
> > the relevant parameters.)  I like the name "wal_skip_threshold", and my second
>
> I strongly agree. Thanks for the draft. I used it as-is. I don't come
> up with an appropriate second description of the GUC so I just removed
> it.
>
> # it was "For rotating magnetic disks, it is around the size of a
> # track or sylinder."
>
> > the relevant parameters.)  I like the name "wal_skip_threshold", and
> > my second choice would be "wal_skip_min_size".  Possibly documented
> > as follows:
> ..
> > Any other opinions on the GUC name?
>
> I prefer the first candidate. I already used the terminology in
> storage.c and the name fits more to the context.
>
> > * We emit newpage WAL records for smaller size of relations.
> > *
> > * Small WAL records have a chance to be emitted at once along with
> > * other backends' WAL records. We emit WAL records instead of syncing
> > * for files that are smaller than a certain threshold expecting faster
> - * commit. The threshold is defined by the GUC effective_io_block_size.
> + * commit. The threshold is defined by the GUC wal_skip_threshold.

> It's wrong that it also skips changing flags.
> I"ll fix it soon

This is the fixed verison v22.

The attached are:

- v22-0001-TAP-test-for-copy-truncation-optimization.patch
  Same as v20, 21

- v22-0002-Fix-WAL-skipping-feature.patch
  GUC name changed. Same as v21.

- v22-0003-Fix-MarkBufferDirtyHint.patch
  PoC of fixing the function. will be merged into 0002. (New in v21,
fixed in v22)

- v21-0004-Documentation-for-wal_skip_threshold.patch
  GUC name and description changed. (Previous 0003, same as v21)

- v21-0005-Additional-test-for-new-GUC-setting.patch
  including adjusted version of wal-optimize-noah-tests-v3.patch
  Maybe test names need further adjustment. (Previous 0004, same as v21)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> I started pre-commit editing on 2019-10-28, and comment+README updates have
> been the largest part of that.  I'll check my edits against the things you
> list here, and I'll share on-list before committing.  I've now marked the CF
> entry Ready for Committer.

Having dedicated many days to that, I am attaching v24nm.  I know of two
remaining defects:

=== Defect 1: gistGetFakeLSN()

When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN().  One could reproduce the problem just by running this
sequence in psql:

          begin;
          create table gist_point_tbl(id int4, p point);
          create index gist_pointidx on gist_point_tbl using gist(p);
          insert into gist_point_tbl (id, p)
          select g, point(g*10, g*10) from generate_series(1, 1000) g;

I've included a wrong-in-general hack to make the test pass.  I see two main
options for fixing this:

(a) Introduce an empty WAL record that reserves an LSN and has no other
effect.  Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible.  For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN.  That optimization is probably too
clever, though it would make the new WAL record almost never appear.

(b) Exempt GiST from most WAL skipping.  GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.

Overall, I like the cleanliness of (a).  The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods.  (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.)  Overall, I
lean toward (a).  Any other ideas or preferences?

=== Defect 2: repetitive work when syncing many relations

For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan.  Commit 279628a
introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions.  (One could,
however, work around the problem by raising wal_skip_threshold.)  Kyotaro, if
you agree, could you modify v24nm to implement that?

Notable changes in v24nm:

- Wrote section "Skipping WAL for New RelFileNode" in
  src/backend/access/transam/README to be the main source concerning the new
  coding rules.

- Updated numerous comments and doc sections.

- Eliminated the pendingSyncs list in favor of a "sync" field in
  pendingDeletes.  I mostly did this to eliminate the possibility of the lists
  getting out of sync.  This removed considerable parallel code for managing a
  second list at end-of-xact.  We now call smgrDoPendingSyncs() only when
  committing or preparing a top-level transaction.

- Whenever code sets an rd_*Subid field of a Relation, it must call
  EOXactListAdd().  swap_relation_files() was not doing so, so the field
  remained set during the next transaction.  I introduced
  RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
  so it also affects the mapped relation case.

- In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
  rd_createSubid remained set.  (That happened before this patch, but it has
  been harmless.)  I fixed this in heap_create().

- Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM.  A sync is necessary
  when checksums are enabled.  Observe the precedent that
  RelationCopyStorage() has not been exempting FSM_FORKNUM.

- Pass log_newpage_range() a "false" for page_std, for the same reason
  RelationCopyStorage() does.

- log_newpage_range() ignored its forkNum and page_std arguments, so we logged
  the wrong data for non-main forks.  Before this patch, callers always passed
  MAIN_FORKNUM and "true", hence the lack of complaints.

- Restored table_finish_bulk_insert(), though heapam no longer provides a
  callback.  The API is still well-defined, and other table AMs might have use
  for it.  Removing it feels like a separate proposal.

- Removed TABLE_INSERT_SKIP_WAL.  Any out-of-tree code using it should revisit
  itself in light of this patch.

- Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
  it was overcounting.

- Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.

- Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
  "Write Ahead Log" -> "Settings", between similar settings
  wal_writer_flush_after and commit_delay.  The other place I considered was
  "Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
  backend_flush_after.

- Gave each test a unique name.  Changed test table names to be descriptive,
  e.g. test7 became trunc_trig.

- Squashed all patches into one.  Split patches are good when one could
  reasonably choose to push a subset, but that didn't apply here.  I wouldn't
  push a GUC implementation without its documentation.  Since the tests fail
  without the main bug fix, I wouldn't push tests separately.

By the way, based on the comment at zheap_prepare_insert(), I expect zheap
will exempt itself from skipping WAL.  It may stop calling RelationNeedsWAL()
and instead test for RELPERSISTENCE_PERMANENT.

nm

Вложения

skip-wal-v24nm.patch

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

20 ноября 2019 г., 09:05:46

I'm in the benchmarking week..

Thanks for reviewing!.

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that.  I'll check my edits against the things you
> > list here, and I'll share on-list before committing.  I've now marked the CF
> > entry Ready for Committer.

I'll look into that soon.

By the way, before finalize this, I'd like to share the result of a
brief benchmarking.

First, I measured the direct effect of WAL skipping.
I measured the time required to do the following sequence for the
COMMIT-FPW-WAL case and COMMIT-fsync case. WAL and heap files are on
non-server spec HDD.

  BEGIN;
  TRUNCATE t;
  INSERT INTO t (SELECT a FROM generate_series(1, n) a);
  COMMIT;

REPLICA means the time with wal_level = replica
SYNC    means the time with wal_level = minimal and force file sync.
WAL     means the time with wal_level = minimal and force commit-WAL.
pages is the number of pages of the table.
(REPLICA comes from run.sh 1, SYNC/WAL comes from run.sh 2)

pages REPLICA    SYNC      WAL
    1:   144 ms   683 ms   217 ms
    3:   303 ms   995 ms   385 ms
    5:   271 ms  1007 ms   217 ms
   10:   157 ms  1043 ms   224 ms
   17:   189 ms  1007 ms   193 ms
   31:   202 ms  1091 ms   230 ms
   56:   265 ms  1175 ms   226 ms
  100:   510 ms  1307 ms   270 ms
  177:   790 ms  1523 ms   524 ms
  316:  1827 ms  1643 ms   719 ms
  562:  1904 ms  2109 ms  1148 ms
 1000:  3060 ms  2979 ms  2113 ms
 1778:  6077 ms  3945 ms  3618 ms
 3162: 13038 ms  7078 ms  6734 ms

There was a crossing point around 3000 pages. (bench1() finds that by
bisecting, run.sh 3).


With multiple sessions, the crossing point  but does not go so
small.

10 processes (run.pl 4 10) The numbers in parentheses are WAL[n]/WAL[n-1].
pages    SYNC     WAL
  316:  8436 ms  4694 ms
  562: 12067 ms  9627 ms (x2.1) # WAL wins
 1000: 19154 ms 43262 ms (x4.5) # SYNC wins. WAL's slope becomes steep.
 1778: 32495 ms 63863 ms (x1.4)

100 processes (run.pl 4 100)
pages    SYNC     WAL
   10: 13275 ms  1868 ms 
   17: 15919 ms  4438 ms (x2.3)
   31: 17063 ms  6431 ms (x1.5)
   56: 23193 ms 14276 ms (x2.2)  # WAL wins
  100: 35220 ms 67843 ms (x4.8)  # SYNC wins. WAL's slope becomes steep.

With 10 pgbench sessions.
pages   SYNC     WAL     
    1:   915 ms   301 ms
    3:  1634 ms   508 ms
    5:  1634 ms   293ms
   10:  1671 ms  1043 ms
   17:  1600 ms   333 ms
   31:  1864 ms   314 ms
   56:  1562 ms   448 ms
  100:  1538 ms   394 ms
  177:  1697 ms  1047 ms
  316:  3074 ms  1788 ms
  562:  3306 ms  1245 ms
 1000:  3440 ms  2182 ms
 1778:  5064 ms  6464 ms  # WAL's slope becomes steep
 3162:  8675 ms  8165 ms


I don't think the result of 100 processes is meaningful, so excluding
the result a candidate for wal_skip_threshold can be 1000.

Thoughts? The attached is the benchmark script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes qw (gettimeofday tv_interval);

my $tupperpage = 226;

my @time = ();

sub bench {
    my ($header, $nprocs, $ntups, $threshold) = @_;
    my @result = ();
    my @rds = ();
    
    for (my $ip = 0 ; $ip < $nprocs ; $ip++)
    {
        pipe(my $rd, my $wr);
        $rds[$ip] = $rd;
        
        my $pid = fork();

        die "fork failed: $!\n" if ($pid < 0);
        if ($pid == 0)
        {
            close($rd);
            
            my $pid = open2(my $psqlrd, my $psqlwr, "psql postgres");
            print $psqlwr "SET wal_skip_threshold to $threshold;\n";
            print $psqlwr "DROP TABLE IF EXISTS t$ip;";
            print $psqlwr "CREATE TABLE t$ip (a int);\n";

            my @st = gettimeofday();
            for (my $i = 0 ; $i < 10 ; $i++)
            {
                print $psqlwr "BEGIN;";
                print $psqlwr "TRUNCATE t$ip;";
                print $psqlwr "INSERT INTO t$ip (SELECT a FROM generate_series(1, $ntups) a);";
                print $psqlwr "COMMIT;";
            }
            close($psqlwr);
            waitpid($pid, 0);

            print $wr $ip, " ", 1000 * tv_interval(\@st, [gettimeofday]), "\n";
            exit;
        }
        close($wr);
    }

    my $rpid;
    while (($rpid = wait()) == 0) {}

    my $sum = 0;
    for (my $ip = 0 ; $ip < $nprocs ; $ip++)
    {
        my $ret = readline($rds[$ip]);
        die "format? $ret\n" if ($ret !~ /^([0-9]+) ([0-9.]+)$/);

        $sum += $2;
    }

    printf "$header: procs $nprocs: time %.0f\n", $sum / $nprocs;
}

sub log10 { return log($_[0]) / log(10); }

# benchmark for wal_level = replica, the third parameter of bench
# doesn't affect
sub bench1
{
    print "benchmark for wal_level = replica\n";
    for (my $s = 0 ; $s <= 4 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("size $ss", 1, $ss * $tupperpage, $ss * 8);
    }
}

# benchmark for wal_level = minimal.
sub bench2
{
    print "benchmark for wal_level = minimal\n";
    for (my $s = 0 ; $s <= 3.5 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("size $ss: SYNC ", 1, $ss * $tupperpage, $ss * 8);
        bench("size $ss: WAL  ", 1, $ss * $tupperpage, ($ss + 1) * 8);
    }
}

# find crossing point of WAL and SYNC by bisecting
sub bench3
{
    print "find crossing point of WAL and SYNC by bisecting\n";
    bench("SYNC: size 0", 1, 1, 8);
    bench("WAL : size 0", 1, 1, 16);
    my $s = 1;
    my $st = 10000;
    while (1)
    {
        my $ts = bench("SYNC: size $s", $tupperpage * $s, $s * 8);
        my $tw = bench("WAL : size $s", $tupperpage * $s, ($s + 1) * 8);

        if ($st < 1.0){
            print "DONE\n";
            exit(0);
        }
        if ($ts > $tw)
        {
            $s += $st; $st /= 2;
        }
        else
        {
            $s -= $st; $st /= 2;
        }
    }
}

# benchmark with multiple processes
sub bench4
{
    my $nprocs = $ARGV[1];
    
    print "benchmark for wal_level = minimal, $nprocs processes\n";
    
    for (my $s = 1.0 ; $s <= 3.5 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("size $ss: SYNC ", $nprocs, $ss * $tupperpage, $ss * 8);
        bench("size $ss: WAL  ", $nprocs, $ss * $tupperpage, ($ss + 1) * 8);
    }
}


bench1() if ($ARGV[0] == 1);
bench2() if ($ARGV[0] == 2);
bench3() if ($ARGV[0] == 3);
bench4() if ($ARGV[0] == 4);

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

20 ноября 2019 г., 11:31:43

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that.  I'll check my edits against the things you
> > list here, and I'll share on-list before committing.  I've now marked the CF
> > entry Ready for Committer.

I looked the version.

> Notable changes in v24nm:
> 
> - Wrote section "Skipping WAL for New RelFileNode" in
>   src/backend/access/transam/README to be the main source concerning the new
>   coding rules.

Thanks for writing this.

+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  When using the second method, do not
+call RelationCopyStorage(), which skips WAL.

Even using these methods, TransactionCommit flushes out buffers then
sync files again. Isn't a description something like the following
needed?

===
Even an access method switched a in-transaction created relfilenode to
WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
file then smgrimmedsync() the file.
===

> - Updated numerous comments and doc sections.
> 
> - Eliminated the pendingSyncs list in favor of a "sync" field in
>   pendingDeletes.  I mostly did this to eliminate the possibility of the lists
>   getting out of sync.  This removed considerable parallel code for managing a
>   second list at end-of-xact.  We now call smgrDoPendingSyncs() only when
>   committing or preparing a top-level transaction.

Mmm. Right. The second list was a trace of older versions, maybe that
needed additional works at rollback. Actually as of v23 the function
syncs no files at rollback. It is wiser to merging the two.

> - Whenever code sets an rd_*Subid field of a Relation, it must call
>   EOXactListAdd().  swap_relation_files() was not doing so, so the field
>   remained set during the next transaction.  I introduced
>   RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
>   so it also affects the mapped relation case.

Ugh.. Thanks for pointing out. By the way

+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, AccessExclusiveLock);
+    RelationAssumeNewRelfilenode(rel1);

It cannot be accessed from other sessions. Theoretically it doesn't
need a lock but NoLock cannot be used there since there's a path that
doesn't take lock on the relation. But AEL seems too strong and it
causes unecessary side effect. Couldn't we use weaker locks?

... Time is up. I'll continue looking this.

regards.

> - In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
>   rd_createSubid remained set.  (That happened before this patch, but it has
>   been harmless.)  I fixed this in heap_create().
> 
> - Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM.  A sync is necessary
>   when checksums are enabled.  Observe the precedent that
>   RelationCopyStorage() has not been exempting FSM_FORKNUM.
> 
> - Pass log_newpage_range() a "false" for page_std, for the same reason
>   RelationCopyStorage() does.
> 
> - log_newpage_range() ignored its forkNum and page_std arguments, so we logged
>   the wrong data for non-main forks.  Before this patch, callers always passed
>   MAIN_FORKNUM and "true", hence the lack of complaints.
> 
> - Restored table_finish_bulk_insert(), though heapam no longer provides a
>   callback.  The API is still well-defined, and other table AMs might have use
>   for it.  Removing it feels like a separate proposal.
> 
> - Removed TABLE_INSERT_SKIP_WAL.  Any out-of-tree code using it should revisit
>   itself in light of this patch.
> 
> - Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
>   it was overcounting.
> 
> - Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.
> 
> - Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
>   "Write Ahead Log" -> "Settings", between similar settings
>   wal_writer_flush_after and commit_delay.  The other place I considered was
>   "Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
>   backend_flush_after.
> 
> - Gave each test a unique name.  Changed test table names to be descriptive,
>   e.g. test7 became trunc_trig.
> 
> - Squashed all patches into one.  Split patches are good when one could
>   reasonably choose to push a subset, but that didn't apply here.  I wouldn't
>   push a GUC implementation without its documentation.  Since the tests fail
>   without the main bug fix, I wouldn't push tests separately.
> 
> By the way, based on the comment at zheap_prepare_insert(), I expect zheap
> will exempt itself from skipping WAL.  It may stop calling RelationNeedsWAL()
> and instead test for RELPERSISTENCE_PERMANENT.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

21 ноября 2019 г., 10:01:07

I should have replied this first.

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that.  I'll check my edits against the things you
> > list here, and I'll share on-list before committing.  I've now marked the CF
> > entry Ready for Committer.
> 
> Having dedicated many days to that, I am attaching v24nm.  I know of two
> remaining defects:
> 
> === Defect 1: gistGetFakeLSN()
> 
> When I modified pg_regress.c to use wal_level=minimal for all suites,
> src/test/isolation/specs/predicate-gist.spec failed the assertion in
> gistGetFakeLSN().  One could reproduce the problem just by running this
> sequence in psql:
> 
>           begin;
>           create table gist_point_tbl(id int4, p point);
>           create index gist_pointidx on gist_point_tbl using gist(p);
>           insert into gist_point_tbl (id, p)
>           select g, point(g*10, g*10) from generate_series(1, 1000) g;
> 
> I've included a wrong-in-general hack to make the test pass.  I see two main
> options for fixing this:
> 
> (a) Introduce an empty WAL record that reserves an LSN and has no other
> effect.  Make GiST use that for permanent relations that are skipping WAL.
> Further optimizations are possible.  For example, we could use a backend-local
> counter (like the one gistGetFakeLSN() uses for temp relations) until the
> counter is greater a recent real LSN.  That optimization is probably too
> clever, though it would make the new WAL record almost never appear.
> 
> (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> WAL, but it would do its own smgrimmedsync() in addition to the one done at
> commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> other AM-independent code that skips WAL.
> 
> Overall, I like the cleanliness of (a).  The main argument for (b) is that it
> ensures we have all the features to opt-out of WAL skipping, which could be
> useful for out-of-tree index access methods.  (I think we currently have the
> features for a tableam to do so, but not for an indexam to do so.)  Overall, I
> lean toward (a).  Any other ideas or preferences?

I don't like (b), too.

What we need there is any sequential numbers for page LSN but really
compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
case?  Or, I'm not sure but I suppose that nothing happenes when
UNLOGGED GiST index gets turned into LOGGED one.

Rewriting table like SET LOGGED will work but not realistic.

> === Defect 2: repetitive work when syncing many relations
> 
> For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> sophisticated about optimizing the shared buffers scan.  Commit 279628a
> introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, to
> further reduce the chance of causing performance regressions.  (One could,
> however, work around the problem by raising wal_skip_threshold.)  Kyotaro, if
> you agree, could you modify v24nm to implement that?

Seems reasonable. Please wait a minite.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 66c52d6dd6..387b1f7d18 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1017,8 +1017,7 @@ gistGetFakeLSN(Relation rel)
      * XXX before commit fix this.  This is not correct for
      * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
      */
-    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
-        || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
@@ -1026,6 +1025,15 @@ gistGetFakeLSN(Relation rel)
          */
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * Even though we are skipping WAL-logging of a permanent relations,
+         * the LSN must be a real one because WAL-logging starts after commit.
+         */
+        Assert(!RelationNeedsWAL(rel));
+        return GetXLogInsertRecPtr();
+    }
     else
     {
         /*

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

21 ноября 2019 г., 10:11:23

Wow.. This is embarrassing.. *^^*.

At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> I should have replied this first.
> 
> At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > > been the largest part of that.  I'll check my edits against the things you
> > > list here, and I'll share on-list before committing.  I've now marked the CF
> > > entry Ready for Committer.
> > 
> > Having dedicated many days to that, I am attaching v24nm.  I know of two
> > remaining defects:
> > 
> > === Defect 1: gistGetFakeLSN()
> > 
> > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > gistGetFakeLSN().  One could reproduce the problem just by running this
> > sequence in psql:
> > 
> >           begin;
> >           create table gist_point_tbl(id int4, p point);
> >           create index gist_pointidx on gist_point_tbl using gist(p);
> >           insert into gist_point_tbl (id, p)
> >           select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > 
> > I've included a wrong-in-general hack to make the test pass.  I see two main
> > options for fixing this:
> > 
> > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > effect.  Make GiST use that for permanent relations that are skipping WAL.
> > Further optimizations are possible.  For example, we could use a backend-local
> > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > counter is greater a recent real LSN.  That optimization is probably too
> > clever, though it would make the new WAL record almost never appear.
> > 
> > (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > other AM-independent code that skips WAL.
> > 
> > Overall, I like the cleanliness of (a).  The main argument for (b) is that it
> > ensures we have all the features to opt-out of WAL skipping, which could be
> > useful for out-of-tree index access methods.  (I think we currently have the
> > features for a tableam to do so, but not for an indexam to do so.)  Overall, I
> > lean toward (a).  Any other ideas or preferences?
> 
> I don't like (b), too.
> 
> What we need there is any sequential numbers for page LSN but really
> compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the

> case?  Or, I'm not sure but I suppose that nothing happenes when
> UNLOGGED GiST index gets turned into LOGGED one.

Yes, I just forgot to remove these lines when writing the following.

> Rewriting table like SET LOGGED will work but not realistic.
> 
> > === Defect 2: repetitive work when syncing many relations
> > 
> > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> > sophisticated about optimizing the shared buffers scan.  Commit 279628a
> > introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, to
> > further reduce the chance of causing performance regressions.  (One could,
> > however, work around the problem by raising wal_skip_threshold.)  Kyotaro, if
> > you agree, could you modify v24nm to implement that?
> 
> Seems reasonable. Please wait a minite.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

21 ноября 2019 г., 13:48:58

At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> > sophisticated about optimizing the shared buffers scan.  Commit 279628a
> > introduced that, in 2013.  I think smgrDoPendingSyncs() should do likewise, to
> Seems reasonable. Please wait a minite.

This is the first cut of that. This makes the function FlushRelationBuffersWithoutRelcache useless, which was
introducedin this work. The first patch reverts it, then the second patch adds the bulk sync feature.
 

The new function FlushRelFileNodesAllBuffers, differently from
DropRelFileNodesAllBuffers, takes SMgrRelation which is required by
FlushBuffer(). So it takes somewhat tricky way, where type
SMgrSortArray pointer to which is compatible with RelFileNode is used.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c51b44734d88fb19b568c4c0240848c8be2b7cf4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH 1/2] Revert FlushRelationBuffersWithoutRelcache.

Succeeding patch makes the function useless and the function is no
longer useful globally. Revert it.
---
 src/backend/storage/buffer/bufmgr.c | 27 ++++++++++-----------------
 src/include/storage/bufmgr.h        |  2 --
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..67bbb26cae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,27 +3203,20 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    RelationOpenSmgr(rel);
-
-    FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
-                                        RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
-    RelFileNode rnode = smgr->smgr_rnode.node;
-    int i;
+    int            i;
     BufferDesc *bufHdr;
 
-    if (islocal)
+    /* Open rel at the smgr level if not already done */
+    RelationOpenSmgr(rel);
+
+    if (RelationUsesLocalBuffers(rel))
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3240,7 +3233,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(smgr,
+                smgrwrite(rel->rd_smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3270,18 +3263,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, smgr);
+            FlushBuffer(bufHdr, rel->rd_smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..8cd1cf25d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
-                                                bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
-- 
2.23.0

From 882731fcf063269d0bf85c57f23c83b9570e5df5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH 2/2] Improve the performance of relation syncs.

We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
 src/backend/catalog/storage.c       |  28 +++++--
 src/backend/storage/buffer/bufmgr.c | 113 ++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c     |  38 +++++++++-
 src/include/storage/bufmgr.h        |   1 +
 src/include/storage/smgr.h          |   1 +
 5 files changed, 174 insertions(+), 7 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 51c233dac6..65811b2a9e 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,6 +533,9 @@ smgrDoPendingSyncs(void)
 {
     PendingRelDelete *pending;
     HTAB    *delhash = NULL;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
 
     if (XLogIsNeeded())
         return;  /* no relation can use this */
@@ -573,7 +576,7 @@ smgrDoPendingSyncs(void)
 
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
-        bool to_be_removed = false; /* don't sync if aborted */
+        bool to_be_removed = false;
         ForkNumber fork;
         BlockNumber nblocks[MAX_FORKNUM + 1];
         BlockNumber total_blocks = 0;
@@ -623,14 +626,21 @@ smgrDoPendingSyncs(void)
          */
         if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
         {
-            /* Flush all buffers then sync the file */
-            FlushRelationBuffersWithoutRelcache(srel, false);
+            /* relations to sync are passed to smgrdosyncall at once */
 
-            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
             {
-                if (smgrexists(srel, fork))
-                    smgrimmedsync(srel, fork);
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
             }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
         }
         else
         {
@@ -658,6 +668,12 @@ smgrDoPendingSyncs(void)
 
     if (delhash)
         hash_destroy(delhash);
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 67bbb26cae..56314653ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelFileNodesAllBuffers shares the same comparator function with
+ * DropRelFileNodeBuffers. Pointer to this struct and RelFileNode must
+ * be compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode        rnode;    /* This must be the first member */
+    SMgrRelation    srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3283,6 +3296,106 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelFileNodesAllBuffers
+ *
+ *        This function flushes out the buffer pool all the pages of all
+ *      forks of the specified smgr relations.  It's equivalent to
+ *      calling FlushRelationBuffers once per fork per relation, but the
+ *      parameter is not Relation but SMgrRelation
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelFileNodesAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0 ; i < nrels ; i++)
+    {
+        Assert (!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel  = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to
+     * sync. See DropRelFileNodesAllBuffers for details. The name DROP_*
+     * is for historical reasons.
+     */
+    use_bsearch = nrels > DROP_RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        /* Ensure there's a free array slot for PinBuffer_Locked */
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b50c69b438..f79f2df40f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,43 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are syncd out to the store.
+ *
+ *        This is equivalent to flusing all buffers FlushRelationBuffers for each
+ *        smgr relation then calling smgrimmedsync for all forks of each smgr
+ *        relation, but it's significantly quicker so should be preferred when
+ *        possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    /* We need to flush all buffers for the relations before sync. */
+    FlushRelFileNodesAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
@@ -469,7 +506,6 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     pfree(rnodes);
 }
 
-
 /*
  *    smgrextend() -- Add a new block to a file.
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8cd1cf25d9..3f85e8c6fe 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -195,6 +195,7 @@ extern void FlushRelationBuffers(Relation rel);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
+extern void FlushRelFileNodesAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
 extern void DropDatabaseBuffers(Oid dbid);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 1543d8d870..31a5ecd059 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Peter Eisentraut

Дата:

22 ноября 2019 г., 15:21:31

On 2019-11-05 22:16, Robert Haas wrote:
> First, I'd like to restate my understanding of the problem just to see
> whether I've got the right idea and whether we're all on the same
> page. When wal_level=minimal, we sometimes try to skip WAL logging on
> newly-created relations in favor of fsync-ing the relation at commit
> time.

How useful is this behavior, relative to all the effort required?

Even if the benefit is significant, how many users can accept running 
with wal_level=minimal and thus without replication or efficient backups?

Is there perhaps an alternative approach involving unlogged tables to 
get a similar performance benefit?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

23 ноября 2019 г., 19:35:09

On Fri, Nov 22, 2019 at 01:21:31PM +0100, Peter Eisentraut wrote:
> On 2019-11-05 22:16, Robert Haas wrote:
> >First, I'd like to restate my understanding of the problem just to see
> >whether I've got the right idea and whether we're all on the same
> >page. When wal_level=minimal, we sometimes try to skip WAL logging on
> >newly-created relations in favor of fsync-ing the relation at commit
> >time.
> 
> How useful is this behavior, relative to all the effort required?
> 
> Even if the benefit is significant, how many users can accept running with
> wal_level=minimal and thus without replication or efficient backups?

That longstanding optimization is too useful to remove, but likely not useful
enough to add today if we didn't already have it.  The initial-data-load use
case remains plausible.  I can also imagine using wal_level=minimal for data
warehouse applications where one can quickly rebuild from the authoritative
data.

> Is there perhaps an alternative approach involving unlogged tables to get a
> similar performance benefit?

At wal_level=replica, it seems inevitable that ALTER TABLE SET LOGGED will
need to WAL-log the table contents.  I suppose we could keep wal_level=minimal
and change its only difference from wal_level=replica to be that ALTER TABLE
SET LOGGED skips WAL.  Currently, ALTER TABLE SET LOGGED also rewrites the
table; that would need to change.  I'd want to add ALTER INDEX SET LOGGED,
too.  After all that, users would need to modify their applications.  Overall,
it's possible, but it's not a clear win over the status quo.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

24 ноября 2019 г., 00:21:36

On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> By the way, before finalize this, I'd like to share the result of a
> brief benchmarking.

What non-default settings did you use?  Please give the output of this or a
similar command:

  select name, setting from pg_settings where setting <> boot_val;

If you run more benchmarks and weren't already using wal_buffers=16MB, I
recommend using it.

> With 10 pgbench sessions.
> pages   SYNC     WAL     
>     1:   915 ms   301 ms
>     3:  1634 ms   508 ms
>     5:  1634 ms   293ms
>    10:  1671 ms  1043 ms
>    17:  1600 ms   333 ms
>    31:  1864 ms   314 ms
>    56:  1562 ms   448 ms
>   100:  1538 ms   394 ms
>   177:  1697 ms  1047 ms
>   316:  3074 ms  1788 ms
>   562:  3306 ms  1245 ms
>  1000:  3440 ms  2182 ms
>  1778:  5064 ms  6464 ms  # WAL's slope becomes steep
>  3162:  8675 ms  8165 ms

For picking a default wal_skip_threshold, it would have been more informative
to see how this changes pgbench latency statistics.  Some people want DDL to
be fast, but more people want DDL not to reduce the performance of concurrent
non-DDL.  This benchmark procedure may help:

1. Determine $DDL_COUNT, a number of DDL transactions that take about one
   minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I would compare pgbench tps and latency between the seconds when DDL is and is
not running.  As you did in earlier tests, I would repeat it using various
page counts, with and without sync.

On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> +Prefer to do the same in future access methods.  However, two other approaches
> +can work.  First, an access method can irreversibly transition a given fork
> +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> +smgrimmedsync().  Second, an access method can opt to write WAL
> +unconditionally for permanent relations.  When using the second method, do not
> +call RelationCopyStorage(), which skips WAL.
> 
> Even using these methods, TransactionCommit flushes out buffers then
> sync files again. Isn't a description something like the following
> needed?
> 
> ===
> Even an access method switched a in-transaction created relfilenode to
> WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> file then smgrimmedsync() the file.
> ===

It is enough that the text says to prefer the approach that core access
methods use.  The extra flush and sync when using a non-preferred approach
wastes some performance, but it is otherwise harmless.

> +    rel1 = relation_open(r1, AccessExclusiveLock);
> +    RelationAssumeNewRelfilenode(rel1);
> 
> It cannot be accessed from other sessions. Theoretically it doesn't
> need a lock but NoLock cannot be used there since there's a path that
> doesn't take lock on the relation. But AEL seems too strong and it
> causes unecessary side effect. Couldn't we use weaker locks?

We could use NoLock.  I assumed we already hold AccessExclusiveLock, in which
case this has no side effects.

On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > === Defect 1: gistGetFakeLSN()
> > 
> > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > gistGetFakeLSN().  One could reproduce the problem just by running this
> > sequence in psql:
> > 
> >           begin;
> >           create table gist_point_tbl(id int4, p point);
> >           create index gist_pointidx on gist_point_tbl using gist(p);
> >           insert into gist_point_tbl (id, p)
> >           select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > 
> > I've included a wrong-in-general hack to make the test pass.  I see two main
> > options for fixing this:
> > 
> > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > effect.  Make GiST use that for permanent relations that are skipping WAL.
> > Further optimizations are possible.  For example, we could use a backend-local
> > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > counter is greater a recent real LSN.  That optimization is probably too
> > clever, though it would make the new WAL record almost never appear.
> > 
> > (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > other AM-independent code that skips WAL.
> > 
> > Overall, I like the cleanliness of (a).  The main argument for (b) is that it
> > ensures we have all the features to opt-out of WAL skipping, which could be
> > useful for out-of-tree index access methods.  (I think we currently have the
> > features for a tableam to do so, but not for an indexam to do so.)  Overall, I
> > lean toward (a).  Any other ideas or preferences?
> 
> I don't like (b), too.
> 
> What we need there is any sequential numbers for page LSN but really
> compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
> case?

No.  If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
GiST pages need an increasing LSN value.


I noticed an additional defect:

BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit record

If we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

24 ноября 2019 г., 15:53:16

On Sat, Nov 23, 2019 at 11:35:09AM -0500, Noah Misch wrote:
> That longstanding optimization is too useful to remove, but likely not useful
> enough to add today if we didn't already have it.  The initial-data-load use
> case remains plausible.  I can also imagine using wal_level=minimal for data
> warehouse applications where one can quickly rebuild from the authoritative
> data.

I can easily imagine cases where a user would like to use the benefit
of the optimization for an initial data load, and afterwards update
wal_level to replica so as they avoid the initial WAL burst which
serves no real purpose.  So the first argument is pretty strong IMO,
the second much less.
--
Michael

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

25 ноября 2019 г., 05:08:54

At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah@leadboat.com> wrote in 
> On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> > By the way, before finalize this, I'd like to share the result of a
> > brief benchmarking.
> 
> What non-default settings did you use?  Please give the output of this or a
> similar command:

Only wal_level=minimal and max_wal_senders=0.

>   select name, setting from pg_settings where setting <> boot_val;
> 
> If you run more benchmarks and weren't already using wal_buffers=16MB, I
> recommend using it.

Roger.

> > With 10 pgbench sessions.
> > pages   SYNC     WAL     
> >     1:   915 ms   301 ms
> >     3:  1634 ms   508 ms
> >     5:  1634 ms   293ms
> >    10:  1671 ms  1043 ms
> >    17:  1600 ms   333 ms
> >    31:  1864 ms   314 ms
> >    56:  1562 ms   448 ms
> >   100:  1538 ms   394 ms
> >   177:  1697 ms  1047 ms
> >   316:  3074 ms  1788 ms
> >   562:  3306 ms  1245 ms
> >  1000:  3440 ms  2182 ms
> >  1778:  5064 ms  6464 ms  # WAL's slope becomes steep
> >  3162:  8675 ms  8165 ms
> 
> For picking a default wal_skip_threshold, it would have been more informative
> to see how this changes pgbench latency statistics.  Some people want DDL to
> be fast, but more people want DDL not to reduce the performance of concurrent
> non-DDL.  This benchmark procedure may help:
> 
> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
>    minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> I would compare pgbench tps and latency between the seconds when DDL is and is
> not running.  As you did in earlier tests, I would repeat it using various
> page counts, with and without sync.

I understood the "DDL" is not pure DDLs but a kind of
define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
FROM".

> On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> > +Prefer to do the same in future access methods.  However, two other approaches
> > +can work.  First, an access method can irreversibly transition a given fork
> > +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> > +smgrimmedsync().  Second, an access method can opt to write WAL
> > +unconditionally for permanent relations.  When using the second method, do not
> > +call RelationCopyStorage(), which skips WAL.
> > 
> > Even using these methods, TransactionCommit flushes out buffers then
> > sync files again. Isn't a description something like the following
> > needed?
> > 
> > ===
> > Even an access method switched a in-transaction created relfilenode to
> > WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> > file then smgrimmedsync() the file.
> > ===
> 
> It is enough that the text says to prefer the approach that core access
> methods use.  The extra flush and sync when using a non-preferred approach
> wastes some performance, but it is otherwise harmless.

Ah, right and I agreed.

> > +    rel1 = relation_open(r1, AccessExclusiveLock);
> > +    RelationAssumeNewRelfilenode(rel1);
> > 
> > It cannot be accessed from other sessions. Theoretically it doesn't
> > need a lock but NoLock cannot be used there since there's a path that
> > doesn't take lock on the relation. But AEL seems too strong and it
> > causes unecessary side effect. Couldn't we use weaker locks?
> 
> We could use NoLock.  I assumed we already hold AccessExclusiveLock, in which
> case this has no side effects.

I forgot that this optimization is used only in non-replication
configuragion. So I agree that AEL doesn't have no side
effect.

> On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > > === Defect 1: gistGetFakeLSN()
> > > 
> > > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > > gistGetFakeLSN().  One could reproduce the problem just by running this
> > > sequence in psql:
> > > 
> > >           begin;
> > >           create table gist_point_tbl(id int4, p point);
> > >           create index gist_pointidx on gist_point_tbl using gist(p);
> > >           insert into gist_point_tbl (id, p)
> > >           select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > > 
> > > I've included a wrong-in-general hack to make the test pass.  I see two main
> > > options for fixing this:
> > > 
> > > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > > effect.  Make GiST use that for permanent relations that are skipping WAL.
> > > Further optimizations are possible.  For example, we could use a backend-local
> > > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > > counter is greater a recent real LSN.  That optimization is probably too
> > > clever, though it would make the new WAL record almost never appear.
> > > 
> > > (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> > > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > > commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > > other AM-independent code that skips WAL.
> > > 
> > > Overall, I like the cleanliness of (a).  The main argument for (b) is that it
> > > ensures we have all the features to opt-out of WAL skipping, which could be
> > > useful for out-of-tree index access methods.  (I think we currently have the
> > > features for a tableam to do so, but not for an indexam to do so.)  Overall, I
> > > lean toward (a).  Any other ideas or preferences?
> > 
> > I don't like (b), too.
> > 
> > What we need there is any sequential numbers for page LSN but really
> > compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
> > case?
> 
> No.  If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
> GiST pages need an increasing LSN value.

Sorry, I noticed that after the mail went out. I agree to (a) and will
do that.

> I noticed an additional defect:
> 
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
> 
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

The TRUNCATE replaces relfilenode in the catalog and the pre-TRUNCATE
content wouldn't be seen after COMMIT.  Since the file has no pages,
it's right that no FPI emitted. What we should make sure the empty
file's metadata is synced out. But I think that kind of failure
shoudn't happen on modern file systems. If we don't rely on such
behavior, we can make sure thhat by turning the zero-pages case from
WAL into file sync. I'll do that in the next version.

I'll post the next version as a single patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

25 ноября 2019 г., 06:08:39

On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah@leadboat.com> wrote in 
> > This benchmark procedure may help:
> > 
> > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> >    minute when done via syncs.
> > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > 3. Wait 10s.
> > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> > 
> > I would compare pgbench tps and latency between the seconds when DDL is and is
> > not running.  As you did in earlier tests, I would repeat it using various
> > page counts, with and without sync.
> 
> I understood the "DDL" is not pure DDLs but a kind of
> define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
> FROM".

When I wrote "DDL", I meant the four-command transaction that you already used
in benchmarks.

> > I noticed an additional defect:
> > 
> > BEGIN;
> > CREATE TABLE t (c) AS SELECT 1;
> > CHECKPOINT; -- write and fsync the table's one page
> > TRUNCATE t; -- no WAL
> > COMMIT; -- no FPI, just the commit record
> > 
> > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > the table's file, the table will stay on disk with its pre-TRUNCATE content.
> 
> The TRUNCATE replaces relfilenode in the catalog

No, it does not.  Since the relation is new in the transaction, the TRUNCATE
uses the heap_truncate_one_rel() strategy.

> Since the file has no pages, it's right that no FPI emitted.

Correct.

> If we don't rely on such
> behavior, we can make sure thhat by turning the zero-pages case from
> WAL into file sync. I'll do that in the next version.

The zero-pages case is not special.  Here's an example of the problem with a
nonzero size:

BEGIN;
CREATE TABLE t (c) AS SELECT * FROM generate_series(1,100000);
CHECKPOINT; -- write and fsync the table's many pages
TRUNCATE t; -- no WAL
INSERT INTO t VALUES (0); -- no WAL
COMMIT; -- FPI for one page; nothing removes the additional pages

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

25 ноября 2019 г., 23:58:14

On Sat, Nov 23, 2019 at 4:21 PM Noah Misch <noah@leadboat.com> wrote:
> I noticed an additional defect:
>
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
>
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

Shouldn't the TRUNCATE be triggering an fsync() to happen before
COMMIT is permitted to complete? You'd have the same problem if the
TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

26 ноября 2019 г., 00:50:25

On Mon, Nov 25, 2019 at 03:58:14PM -0500, Robert Haas wrote:
> On Sat, Nov 23, 2019 at 4:21 PM Noah Misch <noah@leadboat.com> wrote:
> > I noticed an additional defect:
> >
> > BEGIN;
> > CREATE TABLE t (c) AS SELECT 1;
> > CHECKPOINT; -- write and fsync the table's one page
> > TRUNCATE t; -- no WAL
> > COMMIT; -- no FPI, just the commit record
> >
> > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > the table's file, the table will stay on disk with its pre-TRUNCATE content.
> 
> Shouldn't the TRUNCATE be triggering an fsync() to happen before
> COMMIT is permitted to complete?

With wal_skip_threshold=0, you do get an fsync().  The patch tries to avoid
at-commit fsync of small files by WAL-logging file contents instead.  However,
the patch doesn't WAL-log enough to handle files that decreased in size.

> You'd have the same problem if the
> TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

I think an insert would be fine.  You'd get an FPI record for the relation's
one page, which fully reproduces the relation.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

26 ноября 2019 г., 15:37:52

At Sun, 24 Nov 2019 22:08:39 -0500, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah@leadboat.com> wrote in 
> > > I noticed an additional defect:
> > > 
> > > BEGIN;
> > > CREATE TABLE t (c) AS SELECT 1;
> > > CHECKPOINT; -- write and fsync the table's one page
> > > TRUNCATE t; -- no WAL
> > > COMMIT; -- no FPI, just the commit record
> > > 
> > > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > > the table's file, the table will stay on disk with its pre-TRUNCATE content.
> > 
> > The TRUNCATE replaces relfilenode in the catalog
> 
> No, it does not.  Since the relation is new in the transaction, the TRUNCATE
> uses the heap_truncate_one_rel() strategy.
..
> The zero-pages case is not special.  Here's an example of the problem with a
> nonzero size:

I got it. That is, if the file has had blocks beyond the size at
commit, we should sync the file even if it is small enough. It nees to
track beore-trunction size as this patch used to have.

pendingSyncHash is resurrected to do truncate-size tracking. That
information cannot be stored in SMgrRelation, which will be dissapper
by invalidation, or Relation, which is not available in storage layer.
smgrDoPendingDeletes is needed to be called at aboft again to clean up
useless hash. I'm not sure the exact cause but
AssertPendingSyncs_RelationCache() fails at abort (so it is not called
at abort).

smgrDoPendingSyncs and RelFileNodeSkippingWAL() become simpler by
using the hash.

Is is not fully checked. I didn't merged and mesured performance yet,
but I post the status-quo patch for now.

- v25-0001-version-nm.patch

Noah's v24 patch.

- v25-0002-Revert-FlushRelationBuffersWithoutRelcache.patch

Remove useless function (added by this patch..).

- v25-0003-Improve-the-performance-of-relation-syncs.patch

Make smgrDoPendingSyncs scan shared buffer once.

v25-0004-Adjust-gistGetFakeLSN.patch

Amendment for gistGetFakeLSN. This uses GetXLogInsertRecPtr as long as
it is different from the previous call and emits dummy WAL if we need
a new LSN. Since other than switch_wal record cannot be empty so the
dummy WAL has an integer content for now.

v25-0005-Sync-files-shrinked-by-truncation.patch

Amendment for the truncation problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 86d7c2dee819b1171f0a02c56e4cda065c64246f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v25 1/5] version nm

---
 doc/src/sgml/config.sgml                 |  43 +++--
 doc/src/sgml/perform.sgml                |  47 ++----
 src/backend/access/gist/gistutil.c       |   7 +-
 src/backend/access/heap/heapam.c         |  45 +-----
 src/backend/access/heap/heapam_handler.c |  22 +--
 src/backend/access/heap/rewriteheap.c    |  21 +--
 src/backend/access/nbtree/nbtsort.c      |  41 ++---
 src/backend/access/transam/README        |  47 +++++-
 src/backend/access/transam/xact.c        |  14 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c               |   4 +
 src/backend/catalog/storage.c            | 198 +++++++++++++++++++++--
 src/backend/commands/cluster.c           |  11 ++
 src/backend/commands/copy.c              |  58 +------
 src/backend/commands/createas.c          |  11 +-
 src/backend/commands/matview.c           |  12 +-
 src/backend/commands/tablecmds.c         |  11 +-
 src/backend/storage/buffer/bufmgr.c      |  37 +++--
 src/backend/storage/smgr/md.c            |   9 +-
 src/backend/utils/cache/relcache.c       | 122 ++++++++++----
 src/backend/utils/misc/guc.c             |  13 ++
 src/include/access/heapam.h              |   3 -
 src/include/access/rewriteheap.h         |   2 +-
 src/include/access/tableam.h             |  18 +--
 src/include/catalog/storage.h            |   5 +
 src/include/storage/bufmgr.h             |   5 +
 src/include/utils/rel.h                  |  57 +++++--
 src/include/utils/relcache.h             |   8 +-
 src/test/regress/pg_regress.c            |   2 +
 30 files changed, 551 insertions(+), 349 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..d0f7dbd7d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,21 +2483,14 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
-        <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
-         <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
-        </simplelist>
-        But minimal WAL does not contain enough information to reconstruct the
-        data from a base backup and the WAL logs, so <literal>replica</literal> or
-        higher must be used to enable WAL archiving
-        (<xref linkend="guc-archive-mode"/>) and streaming replication.
+        In <literal>minimal</literal> level, no information is logged for
+        tables or indexes for the remainder of a transaction that creates or
+        truncates them.  This can make bulk operations much faster (see
+        <xref linkend="populate-pitr"/>).  But minimal WAL does not contain
+        enough information to reconstruct the data from a base backup and the
+        WAL logs, so <literal>replica</literal> or higher must be used to
+        enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+        streaming replication.
        </para>
        <para>
         In <literal>logical</literal> level, the same information is logged as
@@ -2889,6 +2882,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent table,
+        materialized view, or index, this setting determines how to persist
+        the new data.  If the data is smaller than this setting, write it to
+        the WAL log; otherwise, use an fsync of the data file.  Depending on
+        the properties of your storage, raising or lowering this value might
+        help if such commits are slowing concurrent transactions.  The default
+        is 64 kilobytes (<literal>64kB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 715aff63c8..fcc60173fb 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1605,8 +1605,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1706,42 +1706,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 553a6d67b1..66c52d6dd6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1013,7 +1013,12 @@ gistGetFakeLSN(Relation rel)
 {
     static XLogRecPtr counter = FirstNormalUnloggedLSN;
 
-    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+    /*
+     * XXX before commit fix this.  This is not correct for
+     * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
+     */
+    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
+        || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..be19c34cbd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 92073fec54..07fe717faa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2515,7 +2500,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d285b1f390..3e564838fa 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1dd39a9535..b61692aefc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.heap = btspool->heap;
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..641809cfda 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,40 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that
+RollbackAndReleaseCurrentSubTransaction() would unlink, in-tree access methods
+write no WAL for that change.  For any access method, CommitTransaction()
+writes and fsyncs affected blocks before recording the commit.  This skipping
+is mandatory; if a WAL-writing change preceded a WAL-skipping change for the
+same block, REDO could overwrite the WAL-skipping change.  Code that writes
+WAL without calling RelationNeedsWAL() must check for this case.
+
+If skipping were not mandatory, a related problem would arise.  Suppose, under
+full_page_writes=off, a WAL-writing change follows a WAL-skipping change.
+When a WAL record contains no full-page image, REDO expects the page to match
+its contents from just before record insertion.  A WAL-skipping change may not
+reach disk at all, violating REDO's expectation.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  When using the second method, do not
+call RelationCopyStorage(), which skips WAL.
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +854,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5c0d0f2af0..750f95c482 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs();
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs();
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0036..dda1dea08b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
                   BlockNumber startblk, BlockNumber endblk,
                   bool page_std)
 {
+    int            flags;
     BlockNumber blkno;
 
+    flags = REGBUF_FORCE_IMAGE;
+    if (page_std)
+        flags |= REGBUF_STANDARD;
+
     /*
      * Iterate over all the pages in the range. They are collected into
      * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBuffer(rel, blkno);
+            Buffer        buf = ReadBufferExtended(rel, forkNum, blkno,
+                                                 RBM_NORMAL, NULL);
 
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         START_CRIT_SECTION();
         for (i = 0; i < nbufs; i++)
         {
-            XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+            XLogRegisterBuffer(i, bufpack[i], flags);
             MarkBufferDirty(bufpack[i]);
         }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 446760ed6e..9561e30b08 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +575,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index b7bcdd9d0f..293ea9a9dd 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -440,6 +440,10 @@ heap_create(const char *relname,
                 break;
         }
     }
+    else
+    {
+        rel->rd_createSubid = InvalidSubTransactionId;
+    }
 
     return rel;
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 056ea3d5d3..51c233dac6 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int    wal_skip_threshold = 64;  /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -58,6 +62,7 @@ typedef struct PendingRelDelete
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
     bool        atCommit;        /* T=delete at commit; F=delete at abort */
     int            nestLevel;        /* xact nesting level of request */
+    bool        sync;            /* whether to fsync at commit */
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
@@ -114,6 +119,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
     pending->nestLevel = GetCurrentTransactionNestLevel();
+    pending->sync =
+        relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
@@ -155,6 +162,7 @@ RelationDropStorage(Relation rel)
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
     pending->nestLevel = GetCurrentTransactionNestLevel();
+    pending->sync = false;
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
@@ -355,7 +363,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +407,43 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    PendingRelDelete *pending;
+
+    if (XLogIsNeeded())
+        return false;  /* no permanent relfilenode skips WAL */
+
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
+            return true;
+    }
+
+    return false;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +521,145 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at commit.
+ *
+ * This should be called before smgrDoPendingDeletes() at every commit or
+ * prepare. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ */
+void
+smgrDoPendingSyncs(void)
+{
+    PendingRelDelete *pending;
+    HTAB    *delhash = NULL;
+
+    if (XLogIsNeeded())
+        return;  /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+    AssertPendingSyncs_RelationCache();
+
+    /*
+     * Pending syncs on the relation that are to be deleted in this
+     * transaction-end should be ignored. Collect pending deletes that will
+     * happen in the following call to smgrDoPendingDeletes().
+     */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        bool found PG_USED_FOR_ASSERTS_ONLY;
+
+        if (!pending->atCommit)
+            continue;
+
+        /* create the hash if not yet */
+        if (delhash == NULL)
+        {
+            HASHCTL hash_ctl;
+
+            memset(&hash_ctl, 0, sizeof(hash_ctl));
+            hash_ctl.keysize = sizeof(RelFileNode);
+            hash_ctl.entrysize = sizeof(RelFileNode);
+            hash_ctl.hcxt = CurrentMemoryContext;
+            delhash =
+                hash_create("pending del temporary hash", 8, &hash_ctl,
+                            HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        (void) hash_search(delhash, (void *) &pending->relnode,
+                           HASH_ENTER, &found);
+        Assert(!found);
+    }
+
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        bool to_be_removed = false; /* don't sync if aborted */
+        ForkNumber fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        if (!pending->sync)
+            continue;
+        Assert(!pending->atCommit);
+
+        /* don't sync relnodes that is being deleted */
+        if (delhash)
+            hash_search(delhash, (void *) &pending->relnode,
+                        HASH_FIND, &to_be_removed);
+        if (to_be_removed)
+            continue;
+
+        /* Now the time to sync the rnode */
+        srel = smgropen(pending->relnode, pending->backend);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records. We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit. The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+        {
+            if (smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL record for the file according to the total
+         * size.
+         */
+        if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+        {
+            /* Flush all buffers then sync the file */
+            FlushRelationBuffersWithoutRelcache(srel, false);
+
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                if (smgrexists(srel, fork))
+                    smgrimmedsync(srel, fork);
+            }
+        }
+        else
+        {
+            /* Emit WAL records for all blocks. The file is small enough. */
+            for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+            {
+                int    n        = nblocks[fork];
+                Relation rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    if (delhash)
+        hash_destroy(delhash);
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b8c349f245..093fff8c5c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
                 relfilenode2;
     Oid            swaptemp;
     char        swptmpchr;
+    Relation    rel1;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1039,6 +1040,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
          */
         Assert(!target_is_pg_class);
 
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
@@ -1173,6 +1175,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         CacheInvalidateRelcacheByTuple(reltup2);
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, AccessExclusiveLock);
+    RelationAssumeNewRelfilenode(rel1);
+    relation_close(rel1, NoLock);
+
     /*
      * Post alter hook for modified relations. The change to r2 is always
      * internal, but r1 depends on the invocation context.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b67d..607e2558a3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2711,63 +2711,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bf7083719..20225dc62f 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
+    myState->bistate = GetBulkInsertState();
 
     /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
-    myState->bistate = GetBulkInsertState();
-
-    /* Not using WAL requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..ae809c9801 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5440eb9015..0e2f5f4259 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4770,19 +4770,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -12462,6 +12457,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad10736d5..746ce477fc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,20 +3203,27 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
-    /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
+                                        RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3233,7 +3240,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3263,18 +3270,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
@@ -3484,13 +3491,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8a9eaf6430..1d408c339c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ad1ff01b32..f3831f0077 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -262,6 +262,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1095,6 +1098,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1828,6 +1832,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2035,6 +2040,12 @@ RelationIdGetRelation(Oid relationId)
     rd = RelationBuildDesc(relationId, true);
     if (RelationIsValid(rd))
         RelationIncrementReferenceCount(rd);
+
+#ifdef USE_ASSERT_CHECKING
+    if (!XLogIsNeeded() && RelationIsValid(rd))
+        AssertPendingSyncConsistency(rd);
+#endif
+
     return rd;
 }
 
@@ -2093,7 +2104,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2509,13 +2520,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2599,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2666,7 +2678,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2751,11 +2763,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2806,7 +2817,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2918,6 +2929,40 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool relcache_verdict =
+        relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+        ((relation->rd_createSubid != InvalidSubTransactionId &&
+          RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+         relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL.  It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry.  It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    RelIdCacheEnt *idhentry;
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3029,10 +3074,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      *
      * During commit, reset the flag to zero, since we are now out of the
      * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-     * new-ness of a new relation due to a forced cache flush, the entry will
-     * get deleted anyway by shared-cache-inval processing of the aborted
-     * pg_class insertion.)
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
@@ -3060,9 +3102,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     }
 
     /*
-     * Likewise, reset the hint about the relfilenode being new.
+     * Likewise, reset any record of the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3154,7 +3197,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3163,6 +3206,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3252,6 +3303,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3549,14 +3601,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
+    RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5591,6 +5658,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..eecaf398c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2651,6 +2652,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        64,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..22916e8e0e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 64022917e2..aca88d0620 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1087,10 +1086,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1309,10 +1304,9 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 }
 
 /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Perform operations necessary to complete insertions made via tuple_insert
+ * and multi_insert with a BulkInsertState specified. In-tree access methods
+ * ceased to use this.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..108115a023 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,23 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(void);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7e38..8097d5ab22 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -189,6 +192,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
+                                                bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a10e..9db3d23897 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,22 +63,40 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
-    /*
+    /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-     * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * survived into or zero if the rel was not created in the current top
+     * transaction.  rd_firstRelfilenodeSubid is the ID of the highest
+     * subtransaction an rd_node change has survived into or zero if rd_node
+     * matches the value it had at the start of the current top transaction.
+     * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+     * would restore rd_node to the value it had at the start of the current
+     * top transaction.  Rolling back any lower subtransaction would not.)
+     * Their accuracy is critical to RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
+     *        BEGIN;
+     *        TRUNCATE t;
+     *        SAVEPOINT save;
+     *        TRUNCATE t;
+     *        ROLLBACK TO save;
+     *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * These fields are read-only outside relcache.c.  Other files trigger
+     * rd_node changes by updating pg_class.reltablespace and/or
+     * pg_class.relfilenode.  They must call RelationAssumeNewRelfilenode() to
+     * update these fields.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -520,9 +538,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f2ace35b0..d3e8348c1b 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -105,9 +105,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -120,6 +121,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 297b8fbd6f..1ddde3ecce 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2354,6 +2354,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
         fputs("log_lock_waits = on\n", pg_conf);
         fputs("log_temp_files = 128kB\n", pg_conf);
         fputs("max_prepared_transactions = 2\n", pg_conf);
+        fputs("wal_level = minimal\n", pg_conf); /* XXX before commit remove */
+        fputs("max_wal_senders = 0\n", pg_conf);
 
         for (sl = temp_configs; sl != NULL; sl = sl->next)
         {
-- 
2.23.0

From 630f770a77f1cf57a3d9c805ab154a2e31f2134e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH v25 2/5] Revert FlushRelationBuffersWithoutRelcache.

Succeeding patch makes the function useless and the function is no
longer useful globally. Revert it.
---
 src/backend/storage/buffer/bufmgr.c | 27 ++++++++++-----------------
 src/include/storage/bufmgr.h        |  2 --
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..67bbb26cae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,27 +3203,20 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    RelationOpenSmgr(rel);
-
-    FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
-                                        RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
-    RelFileNode rnode = smgr->smgr_rnode.node;
-    int i;
+    int            i;
     BufferDesc *bufHdr;
 
-    if (islocal)
+    /* Open rel at the smgr level if not already done */
+    RelationOpenSmgr(rel);
+
+    if (RelationUsesLocalBuffers(rel))
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3240,7 +3233,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(smgr,
+                smgrwrite(rel->rd_smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3270,18 +3263,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, smgr);
+            FlushBuffer(bufHdr, rel->rd_smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..8cd1cf25d9 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
-                                                bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
-- 
2.23.0

From 12409838ef6eee0e35dd2730bda19bbb9f889931 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH v25 3/5] Improve the performance of relation syncs.

We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
 src/backend/catalog/storage.c       |  28 +++++--
 src/backend/storage/buffer/bufmgr.c | 113 ++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c     |  38 +++++++++-
 src/include/storage/bufmgr.h        |   1 +
 src/include/storage/smgr.h          |   1 +
 5 files changed, 174 insertions(+), 7 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 51c233dac6..65811b2a9e 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,6 +533,9 @@ smgrDoPendingSyncs(void)
 {
     PendingRelDelete *pending;
     HTAB    *delhash = NULL;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
 
     if (XLogIsNeeded())
         return;  /* no relation can use this */
@@ -573,7 +576,7 @@ smgrDoPendingSyncs(void)
 
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
-        bool to_be_removed = false; /* don't sync if aborted */
+        bool to_be_removed = false;
         ForkNumber fork;
         BlockNumber nblocks[MAX_FORKNUM + 1];
         BlockNumber total_blocks = 0;
@@ -623,14 +626,21 @@ smgrDoPendingSyncs(void)
          */
         if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
         {
-            /* Flush all buffers then sync the file */
-            FlushRelationBuffersWithoutRelcache(srel, false);
+            /* relations to sync are passed to smgrdosyncall at once */
 
-            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
             {
-                if (smgrexists(srel, fork))
-                    smgrimmedsync(srel, fork);
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
             }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
         }
         else
         {
@@ -658,6 +668,12 @@ smgrDoPendingSyncs(void)
 
     if (delhash)
         hash_destroy(delhash);
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 67bbb26cae..56314653ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelFileNodesAllBuffers shares the same comparator function with
+ * DropRelFileNodeBuffers. Pointer to this struct and RelFileNode must
+ * be compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode        rnode;    /* This must be the first member */
+    SMgrRelation    srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3283,6 +3296,106 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelFileNodesAllBuffers
+ *
+ *        This function flushes out the buffer pool all the pages of all
+ *      forks of the specified smgr relations.  It's equivalent to
+ *      calling FlushRelationBuffers once per fork per relation, but the
+ *      parameter is not Relation but SMgrRelation
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelFileNodesAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0 ; i < nrels ; i++)
+    {
+        Assert (!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel  = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to
+     * sync. See DropRelFileNodesAllBuffers for details. The name DROP_*
+     * is for historical reasons.
+     */
+    use_bsearch = nrels > DROP_RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        /* Ensure there's a free array slot for PinBuffer_Locked */
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b50c69b438..f79f2df40f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,43 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are syncd out to the store.
+ *
+ *        This is equivalent to flusing all buffers FlushRelationBuffers for each
+ *        smgr relation then calling smgrimmedsync for all forks of each smgr
+ *        relation, but it's significantly quicker so should be preferred when
+ *        possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    /* We need to flush all buffers for the relations before sync. */
+    FlushRelFileNodesAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
@@ -469,7 +506,6 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     pfree(rnodes);
 }
 
-
 /*
  *    smgrextend() -- Add a new block to a file.
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8cd1cf25d9..3f85e8c6fe 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -195,6 +195,7 @@ extern void FlushRelationBuffers(Relation rel);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
+extern void FlushRelFileNodesAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
 extern void DropDatabaseBuffers(Oid dbid);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 1543d8d870..31a5ecd059 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
-- 
2.23.0

From 9a47b1faaae7c5e12596cc172dcb1f37e2fc971a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 16:12:03 +0900
Subject: [PATCH v25 4/5] Adjust gistGetFakeLSN()

GiST needs to set page LSN to monotically incresing numbers on updates
even if it is not WAL-logged at all.  We use a simple counter for
UNLOGGESD/TEMP relations but the number must be smaller than the LSN
at the next commit for WAL-skipped relations. WAL-insertione pointer
works in major cases but we sometimes need to emit a WAL record to
generate an unique LSN for update. This patch adds a new WAL record
kind XLOG_GIST_ASSIGN_LSN, which conveys no substantial content and
emits it if needed.
---
 src/backend/access/gist/gistutil.c     | 30 +++++++++++++++++++-------
 src/backend/access/gist/gistxlog.c     | 17 +++++++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |  5 +++++
 src/include/access/gist_private.h      |  2 ++
 src/include/access/gistxlog.h          |  1 +
 5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 66c52d6dd6..eebc1a9647 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1011,21 +1011,35 @@ gistproperty(Oid index_oid, int attno,
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
-    /*
-     * XXX before commit fix this.  This is not correct for
-     * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
-     */
-    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
-        || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so the LSN
+         * must be distinct numbers smaller than the LSN at the next
+         * commit. Emit a dummy WAL record if insert-LSN hasn't advanced after
+         * the last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr currlsn = GetXLogInsertRecPtr();
+
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need of an actual record if we alredy have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3b28f54646..cc63c17aba 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,20 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int dummy = 0;
+
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char*) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index eccb6fd942..48cda40ac0 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
             break;
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
             break;
     }
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index a409975db1..3455dd242d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index e44922d915..1eae06c0fb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign an new LSN */
 
 /*
  * Backup Blk 0: updated page.
-- 
2.23.0

From 656b739e60f5c07e4eb91ec2ba016abf1db39e69 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 26 Nov 2019 21:25:09 +0900
Subject: [PATCH v25 5/5] Sync files shrinked by truncation

If truncation made a WAL-skipped file become smaller at commit than
the maximum size during the transaction, the file must not be
at-commit-WAL-logged and must be synced.
---
 src/backend/access/transam/xact.c  |   5 +-
 src/backend/catalog/storage.c      | 155 ++++++++++++++++++-----------
 src/backend/utils/cache/relcache.c |   1 +
 src/include/catalog/storage.h      |   2 +-
 4 files changed, 102 insertions(+), 61 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 750f95c482..f681cd3a23 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2114,7 +2114,7 @@ CommitTransaction(void)
      * transaction. This must happen before AtEOXact_RelationMap(), so that we
      * don't see committed-but-broken files after a crash.
      */
-    smgrDoPendingSyncs();
+    smgrDoPendingSyncs(true);
 
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
@@ -2354,7 +2354,7 @@ PrepareTransaction(void)
      * transaction. This must happen before EndPrepare(), so that we don't see
      * committed-but-broken files after a crash and COMMIT PREPARED.
      */
-    smgrDoPendingSyncs();
+    smgrDoPendingSyncs(true);
 
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
@@ -2674,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 65811b2a9e..ea499490b8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -62,11 +62,17 @@ typedef struct PendingRelDelete
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
     bool        atCommit;        /* T=delete at commit; F=delete at abort */
     int            nestLevel;        /* xact nesting level of request */
-    bool        sync;            /* whether to fsync at commit */
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    BlockNumber max_truncated;
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -119,11 +125,39 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
     pending->nestLevel = GetCurrentTransactionNestLevel();
-    pending->sync =
-        relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * If the relation needs at-commit sync, we also need to track the maximum
+     * unsynced truncated block used to decide whether we can WAL-logging or we
+     * must sync the file in smgrDoPendingSyncs.
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool         found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize =  sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("max truncatd block hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->max_truncated = InvalidBlockNumber;
+    }
+
     return srel;
 }
 
@@ -162,7 +196,6 @@ RelationDropStorage(Relation rel)
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
     pending->nestLevel = GetCurrentTransactionNestLevel();
-    pending->sync = false;
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
@@ -320,6 +353,21 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         if (fsm || vm)
             XLogFlush(lsn);
     }
+    else if (pendingSyncHash)
+    {
+        pendingSync *pending;
+
+        pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                              HASH_FIND, NULL);
+        if (pending)
+        {
+            BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+            if (!BlockNumberIsValid(pending->max_truncated) ||
+                pending->max_truncated < nblocks)
+                pending->max_truncated = nblocks;
+        }
+    }
 
     /* Do the real work to truncate relation forks */
     smgrtruncate(rel->rd_smgr, forks, nforks, blocks);
@@ -430,18 +478,17 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 bool
 RelFileNodeSkippingWAL(RelFileNode rnode)
 {
-    PendingRelDelete *pending;
-
     if (XLogIsNeeded())
         return false;  /* no permanent relfilenode skips WAL */
 
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
-    {
-        if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
-            return true;
-    }
+    if (!pendingSyncHash)
+        return false;  /* we don't have a to-be-synced relation */
 
-    return false;
+    /* the relation is not tracked as to-be-synced */
+    if (hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
 }
 
 /*
@@ -529,72 +576,60 @@ smgrDoPendingDeletes(bool isCommit)
  * failure prevents commit.
  */
 void
-smgrDoPendingSyncs(void)
+smgrDoPendingSyncs(bool isCommit)
 {
     PendingRelDelete *pending;
-    HTAB    *delhash = NULL;
     int            nrels = 0,
                 maxrels = 0;
     SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
 
     if (XLogIsNeeded())
         return;  /* no relation can use this */
 
     Assert(GetCurrentTransactionNestLevel() == 1);
+
+    if (!pendingSyncHash)
+        return; /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        if (pendingSyncHash)
+        {
+            hash_destroy(pendingSyncHash);
+            pendingSyncHash = NULL;
+        }
+        return;
+    }
+
     AssertPendingSyncs_RelationCache();
 
     /*
      * Pending syncs on the relation that are to be deleted in this
-     * transaction-end should be ignored. Collect pending deletes that will
-     * happen in the following call to smgrDoPendingDeletes().
+     * transaction-end should be ignored. Remove sync hash entries entries for
+     * relations that will be deleted in the following call to
+     * smgrDoPendingDeletes().
      */
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
-        bool found PG_USED_FOR_ASSERTS_ONLY;
-
         if (!pending->atCommit)
             continue;
 
-        /* create the hash if not yet */
-        if (delhash == NULL)
-        {
-            HASHCTL hash_ctl;
-
-            memset(&hash_ctl, 0, sizeof(hash_ctl));
-            hash_ctl.keysize = sizeof(RelFileNode);
-            hash_ctl.entrysize = sizeof(RelFileNode);
-            hash_ctl.hcxt = CurrentMemoryContext;
-            delhash =
-                hash_create("pending del temporary hash", 8, &hash_ctl,
-                            HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-        }
-
-        (void) hash_search(delhash, (void *) &pending->relnode,
-                           HASH_ENTER, &found);
-        Assert(!found);
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
     }
 
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
     {
-        bool to_be_removed = false;
-        ForkNumber fork;
-        BlockNumber nblocks[MAX_FORKNUM + 1];
-        BlockNumber total_blocks = 0;
-        SMgrRelation srel;
-
-        if (!pending->sync)
-            continue;
-        Assert(!pending->atCommit);
-
-        /* don't sync relnodes that is being deleted */
-        if (delhash)
-            hash_search(delhash, (void *) &pending->relnode,
-                        HASH_FIND, &to_be_removed);
-        if (to_be_removed)
-            continue;
+        ForkNumber        fork;
+        BlockNumber        nblocks[MAX_FORKNUM + 1];
+        BlockNumber        total_blocks = 0;
+        SMgrRelation    srel;
 
-        /* Now the time to sync the rnode */
-        srel = smgropen(pending->relnode, pending->backend);
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
 
         /*
          * We emit newpage WAL records for smaller relations.
@@ -622,9 +657,12 @@ smgrDoPendingSyncs(void)
 
         /*
          * Sync file or emit WAL record for the file according to the total
-         * size.
+         * size. Do file sync if the size is larger than the threshold or
+         * truncates may have left blocks beyond the current size.
          */
-        if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+        if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024 ||
+            (BlockNumberIsValid(pendingsync->max_truncated) &&
+             smgrnblocks(srel, MAIN_FORKNUM) < pendingsync->max_truncated))
         {
             /* relations to sync are passed to smgrdosyncall at once */
 
@@ -666,8 +704,9 @@ smgrDoPendingSyncs(void)
         }
     }
 
-    if (delhash)
-        hash_destroy(delhash);
+    Assert (pendingSyncHash);
+    hash_destroy(pendingSyncHash);
+    pendingSyncHash = NULL;
 
     if (nrels > 0)
     {
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index f3831f0077..ea11ceb4d3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3619,6 +3619,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
 void
 RelationAssumeNewRelfilenode(Relation relation)
 {
+    elog(LOG, "ASSUME: %d", relation->rd_node.relNode);
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
     if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 108115a023..bf076657e7 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,7 +35,7 @@ extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
-extern void smgrDoPendingSyncs(void);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

28 ноября 2019 г., 14:56:20

At Tue, 26 Nov 2019 21:37:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail> Is is not fully checked. I didn't
mergedand mesured performance yet,
 
> but I post the status-quo patch for now.

It was actually inconsistency caused by swap_relation_files.

1. rd_createSubid of relcache for r2 is not turned off. This prevents
   the relcache entry from flushed. Commit processes pendingSyncs and
   leaves the relcache entry with rd_createSubid != Invalid. It is
   inconsistency.

2. relation_open(r1) returns a relcache entry with its relfilenode has
   the old value (relfilenode1) since command counter has not been
   incremented. On the other hand if it is incremented just before,
   AssertPendingSyncConsistency() aborts because of the inconsistency
   between relfilenode and rd_firstRel*.

As the result, I returned to think that we need to modify both
relcache entries with right relfilenode.

I once thought that taking AEL in the function has no side effect but
the code path is executed also when wal_level = replica or higher. And
as I mentioned upthread, we can even get there without taking any lock
on r1 or sometimes ShareLock. So upgrading to AEL emits Standby/LOCK
WAL and propagates to standby. After all I'd like to take the weakest
lock (AccessShareLock) there.

The attached is the new version of the patch.

- v26-0001-version-nm24.patch
  Same with v24

- v26-0002-change-swap_relation_files.patch

 Changes to swap_relation_files as mentioned above.

- v26-0003-Improve-the-performance-of-relation-syncs.patch

 Do multiple pending syncs by one shared_buffers scanning.

- v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patch

 v26-0003 makes the function useless. Remove it.

- v26-0005-Fix-gistGetFakeLSN.patch

 gistGetFakeLSN fix.

- v26-0006-Sync-files-shrinked-by-truncation.patch

 Fix the problem of commit-time-FPI after truncation after checkpoint.
 I'm not sure this is the right direction but pendingSyncHash is
 removed from pendingDeletes list again.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ee96bb1e14969823eab79ab1531d68e8aadc1915 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v26 1/6] version nm24

Noah Misch's version 24.
---
 doc/src/sgml/config.sgml                 |  43 +++--
 doc/src/sgml/perform.sgml                |  47 ++----
 src/backend/access/gist/gistutil.c       |   7 +-
 src/backend/access/heap/heapam.c         |  45 +-----
 src/backend/access/heap/heapam_handler.c |  22 +--
 src/backend/access/heap/rewriteheap.c    |  21 +--
 src/backend/access/nbtree/nbtsort.c      |  41 ++---
 src/backend/access/transam/README        |  47 +++++-
 src/backend/access/transam/xact.c        |  14 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c               |   4 +
 src/backend/catalog/storage.c            | 198 +++++++++++++++++++++--
 src/backend/commands/cluster.c           |  11 ++
 src/backend/commands/copy.c              |  58 +------
 src/backend/commands/createas.c          |  11 +-
 src/backend/commands/matview.c           |  12 +-
 src/backend/commands/tablecmds.c         |  11 +-
 src/backend/storage/buffer/bufmgr.c      |  37 +++--
 src/backend/storage/smgr/md.c            |   9 +-
 src/backend/utils/cache/relcache.c       | 122 ++++++++++----
 src/backend/utils/misc/guc.c             |  13 ++
 src/include/access/heapam.h              |   3 -
 src/include/access/rewriteheap.h         |   2 +-
 src/include/access/tableam.h             |  18 +--
 src/include/catalog/storage.h            |   5 +
 src/include/storage/bufmgr.h             |   5 +
 src/include/utils/rel.h                  |  57 +++++--
 src/include/utils/relcache.h             |   8 +-
 src/test/regress/pg_regress.c            |   2 +
 30 files changed, 551 insertions(+), 349 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..d0f7dbd7d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,21 +2483,14 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
-        <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
-         <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
-        </simplelist>
-        But minimal WAL does not contain enough information to reconstruct the
-        data from a base backup and the WAL logs, so <literal>replica</literal> or
-        higher must be used to enable WAL archiving
-        (<xref linkend="guc-archive-mode"/>) and streaming replication.
+        In <literal>minimal</literal> level, no information is logged for
+        tables or indexes for the remainder of a transaction that creates or
+        truncates them.  This can make bulk operations much faster (see
+        <xref linkend="populate-pitr"/>).  But minimal WAL does not contain
+        enough information to reconstruct the data from a base backup and the
+        WAL logs, so <literal>replica</literal> or higher must be used to
+        enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+        streaming replication.
        </para>
        <para>
         In <literal>logical</literal> level, the same information is logged as
@@ -2889,6 +2882,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent table,
+        materialized view, or index, this setting determines how to persist
+        the new data.  If the data is smaller than this setting, write it to
+        the WAL log; otherwise, use an fsync of the data file.  Depending on
+        the properties of your storage, raising or lowering this value might
+        help if such commits are slowing concurrent transactions.  The default
+        is 64 kilobytes (<literal>64kB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 715aff63c8..fcc60173fb 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1605,8 +1605,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1706,42 +1706,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 553a6d67b1..66c52d6dd6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1013,7 +1013,12 @@ gistGetFakeLSN(Relation rel)
 {
     static XLogRecPtr counter = FirstNormalUnloggedLSN;
 
-    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+    /*
+     * XXX before commit fix this.  This is not correct for
+     * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
+     */
+    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
+        || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0128bb34ef..be19c34cbd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 92073fec54..07fe717faa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2515,7 +2500,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d285b1f390..3e564838fa 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 1dd39a9535..b61692aefc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.heap = btspool->heap;
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..641809cfda 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,40 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that
+RollbackAndReleaseCurrentSubTransaction() would unlink, in-tree access methods
+write no WAL for that change.  For any access method, CommitTransaction()
+writes and fsyncs affected blocks before recording the commit.  This skipping
+is mandatory; if a WAL-writing change preceded a WAL-skipping change for the
+same block, REDO could overwrite the WAL-skipping change.  Code that writes
+WAL without calling RelationNeedsWAL() must check for this case.
+
+If skipping were not mandatory, a related problem would arise.  Suppose, under
+full_page_writes=off, a WAL-writing change follows a WAL-skipping change.
+When a WAL record contains no full-page image, REDO expects the page to match
+its contents from just before record insertion.  A WAL-skipping change may not
+reach disk at all, violating REDO's expectation.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  When using the second method, do not
+call RelationCopyStorage(), which skips WAL.
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +854,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5c0d0f2af0..750f95c482 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs();
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs();
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index aa9dca0036..dda1dea08b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
                   BlockNumber startblk, BlockNumber endblk,
                   bool page_std)
 {
+    int            flags;
     BlockNumber blkno;
 
+    flags = REGBUF_FORCE_IMAGE;
+    if (page_std)
+        flags |= REGBUF_STANDARD;
+
     /*
      * Iterate over all the pages in the range. They are collected into
      * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBuffer(rel, blkno);
+            Buffer        buf = ReadBufferExtended(rel, forkNum, blkno,
+                                                 RBM_NORMAL, NULL);
 
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         START_CRIT_SECTION();
         for (i = 0; i < nbufs; i++)
         {
-            XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+            XLogRegisterBuffer(i, bufpack[i], flags);
             MarkBufferDirty(bufpack[i]);
         }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 446760ed6e..9561e30b08 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +575,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index b7bcdd9d0f..293ea9a9dd 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -440,6 +440,10 @@ heap_create(const char *relname,
                 break;
         }
     }
+    else
+    {
+        rel->rd_createSubid = InvalidSubTransactionId;
+    }
 
     return rel;
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 056ea3d5d3..51c233dac6 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int    wal_skip_threshold = 64;  /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -58,6 +62,7 @@ typedef struct PendingRelDelete
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
     bool        atCommit;        /* T=delete at commit; F=delete at abort */
     int            nestLevel;        /* xact nesting level of request */
+    bool        sync;            /* whether to fsync at commit */
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
@@ -114,6 +119,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
     pending->nestLevel = GetCurrentTransactionNestLevel();
+    pending->sync =
+        relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
@@ -155,6 +162,7 @@ RelationDropStorage(Relation rel)
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
     pending->nestLevel = GetCurrentTransactionNestLevel();
+    pending->sync = false;
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
@@ -355,7 +363,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +407,43 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    PendingRelDelete *pending;
+
+    if (XLogIsNeeded())
+        return false;  /* no permanent relfilenode skips WAL */
+
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
+            return true;
+    }
+
+    return false;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +521,145 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at commit.
+ *
+ * This should be called before smgrDoPendingDeletes() at every commit or
+ * prepare. Also this should be called before emitting WAL record so that sync
+ * failure prevents commit.
+ */
+void
+smgrDoPendingSyncs(void)
+{
+    PendingRelDelete *pending;
+    HTAB    *delhash = NULL;
+
+    if (XLogIsNeeded())
+        return;  /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+    AssertPendingSyncs_RelationCache();
+
+    /*
+     * Pending syncs on the relation that are to be deleted in this
+     * transaction-end should be ignored. Collect pending deletes that will
+     * happen in the following call to smgrDoPendingDeletes().
+     */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        bool found PG_USED_FOR_ASSERTS_ONLY;
+
+        if (!pending->atCommit)
+            continue;
+
+        /* create the hash if not yet */
+        if (delhash == NULL)
+        {
+            HASHCTL hash_ctl;
+
+            memset(&hash_ctl, 0, sizeof(hash_ctl));
+            hash_ctl.keysize = sizeof(RelFileNode);
+            hash_ctl.entrysize = sizeof(RelFileNode);
+            hash_ctl.hcxt = CurrentMemoryContext;
+            delhash =
+                hash_create("pending del temporary hash", 8, &hash_ctl,
+                            HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        (void) hash_search(delhash, (void *) &pending->relnode,
+                           HASH_ENTER, &found);
+        Assert(!found);
+    }
+
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        bool to_be_removed = false; /* don't sync if aborted */
+        ForkNumber fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        if (!pending->sync)
+            continue;
+        Assert(!pending->atCommit);
+
+        /* don't sync relnodes that is being deleted */
+        if (delhash)
+            hash_search(delhash, (void *) &pending->relnode,
+                        HASH_FIND, &to_be_removed);
+        if (to_be_removed)
+            continue;
+
+        /* Now the time to sync the rnode */
+        srel = smgropen(pending->relnode, pending->backend);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records. We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit. The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+        {
+            if (smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL record for the file according to the total
+         * size.
+         */
+        if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+        {
+            /* Flush all buffers then sync the file */
+            FlushRelationBuffersWithoutRelcache(srel, false);
+
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                if (smgrexists(srel, fork))
+                    smgrimmedsync(srel, fork);
+            }
+        }
+        else
+        {
+            /* Emit WAL records for all blocks. The file is small enough. */
+            for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
+            {
+                int    n        = nblocks[fork];
+                Relation rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    if (delhash)
+        hash_destroy(delhash);
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b8c349f245..093fff8c5c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
                 relfilenode2;
     Oid            swaptemp;
     char        swptmpchr;
+    Relation    rel1;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1039,6 +1040,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
          */
         Assert(!target_is_pg_class);
 
+        /* swap relfilenodes, reltablespaces, relpersistence */
         swaptemp = relform1->relfilenode;
         relform1->relfilenode = relform2->relfilenode;
         relform2->relfilenode = swaptemp;
@@ -1173,6 +1175,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         CacheInvalidateRelcacheByTuple(reltup2);
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, AccessExclusiveLock);
+    RelationAssumeNewRelfilenode(rel1);
+    relation_close(rel1, NoLock);
+
     /*
      * Post alter hook for modified relations. The change to r2 is always
      * internal, but r1 depends on the invocation context.
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b67d..607e2558a3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2711,63 +2711,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bf7083719..20225dc62f 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
+    myState->bistate = GetBulkInsertState();
 
     /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
-    myState->bistate = GetBulkInsertState();
-
-    /* Not using WAL requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 537d0e8cef..ae809c9801 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5440eb9015..0e2f5f4259 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4770,19 +4770,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -12462,6 +12457,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ad10736d5..746ce477fc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3203,20 +3203,27 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    int            i;
-    BufferDesc *bufHdr;
-
-    /* Open rel at the smgr level if not already done */
     RelationOpenSmgr(rel);
 
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
+                                        RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int i;
+    BufferDesc *bufHdr;
+
+    if (islocal)
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3233,7 +3240,7 @@ FlushRelationBuffers(Relation rel)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3263,18 +3270,18 @@ FlushRelationBuffers(Relation rel)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
@@ -3484,13 +3491,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8a9eaf6430..1d408c339c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ad1ff01b32..f3831f0077 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -262,6 +262,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1095,6 +1098,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1828,6 +1832,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2035,6 +2040,12 @@ RelationIdGetRelation(Oid relationId)
     rd = RelationBuildDesc(relationId, true);
     if (RelationIsValid(rd))
         RelationIncrementReferenceCount(rd);
+
+#ifdef USE_ASSERT_CHECKING
+    if (!XLogIsNeeded() && RelationIsValid(rd))
+        AssertPendingSyncConsistency(rd);
+#endif
+
     return rd;
 }
 
@@ -2093,7 +2104,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2509,13 +2520,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2599,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2666,7 +2678,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2751,11 +2763,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2806,7 +2817,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2918,6 +2929,40 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool relcache_verdict =
+        relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+        ((relation->rd_createSubid != InvalidSubTransactionId &&
+          RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+         relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL.  It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry.  It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    RelIdCacheEnt *idhentry;
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3029,10 +3074,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      *
      * During commit, reset the flag to zero, since we are now out of the
      * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-     * new-ness of a new relation due to a forced cache flush, the entry will
-     * get deleted anyway by shared-cache-inval processing of the aborted
-     * pg_class insertion.)
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
@@ -3060,9 +3102,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     }
 
     /*
-     * Likewise, reset the hint about the relfilenode being new.
+     * Likewise, reset any record of the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3154,7 +3197,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3163,6 +3206,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3252,6 +3303,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3549,14 +3601,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
+    RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5591,6 +5658,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..eecaf398c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2651,6 +2652,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        64,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..22916e8e0e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 64022917e2..aca88d0620 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1087,10 +1086,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1309,10 +1304,9 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 }
 
 /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Perform operations necessary to complete insertions made via tuple_insert
+ * and multi_insert with a BulkInsertState specified. In-tree access methods
+ * ceased to use this.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3579d3f3eb..108115a023 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,23 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(void);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 17b97f7e38..8097d5ab22 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -189,6 +192,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
+                                                bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 31d8a1a10e..9db3d23897 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,22 +63,40 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
-    /*
+    /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-     * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * survived into or zero if the rel was not created in the current top
+     * transaction.  rd_firstRelfilenodeSubid is the ID of the highest
+     * subtransaction an rd_node change has survived into or zero if rd_node
+     * matches the value it had at the start of the current top transaction.
+     * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+     * would restore rd_node to the value it had at the start of the current
+     * top transaction.  Rolling back any lower subtransaction would not.)
+     * Their accuracy is critical to RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
+     *        BEGIN;
+     *        TRUNCATE t;
+     *        SAVEPOINT save;
+     *        TRUNCATE t;
+     *        ROLLBACK TO save;
+     *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * These fields are read-only outside relcache.c.  Other files trigger
+     * rd_node changes by updating pg_class.reltablespace and/or
+     * pg_class.relfilenode.  They must call RelationAssumeNewRelfilenode() to
+     * update these fields.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -520,9 +538,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2f2ace35b0..d3e8348c1b 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -105,9 +105,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -120,6 +121,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 297b8fbd6f..1ddde3ecce 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2354,6 +2354,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
         fputs("log_lock_waits = on\n", pg_conf);
         fputs("log_temp_files = 128kB\n", pg_conf);
         fputs("max_prepared_transactions = 2\n", pg_conf);
+        fputs("wal_level = minimal\n", pg_conf); /* XXX before commit remove */
+        fputs("max_wal_senders = 0\n", pg_conf);
 
         for (sl = temp_configs; sl != NULL; sl = sl->next)
         {
-- 
2.23.0

From 6b69e19bdae8b282a75ebf373573cdb96adeef06 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 27 Nov 2019 07:38:46 -0500
Subject: [PATCH v26 2/6] change swap_relation_files

The current patch doesn't adjust the new relation in
swap_relation_files. This inhibits the relcache from
invalidated. Adjust relache of the new relfilenode.
Change lock level for relcache adjusting.
---
 src/backend/commands/cluster.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 093fff8c5c..af7733eef4 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1015,6 +1015,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
     Oid            swaptemp;
     char        swptmpchr;
     Relation    rel1;
+    Relation    rel2;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1177,12 +1178,31 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
     /*
      * Recognize that rel1's relfilenode (swapped from rel2) is new in this
-     * subtransaction. Since the next step for rel2 is deletion, don't bother
-     * recording the newness of its relfilenode.
+     * subtransaction. However the next step for rel2 is deletion, we need to
+     * turn off the newness of its relfilenode, that allows the relcache to be
+     * flushed. Requried lock must be held before getting here so we take
+     * AccessShareLock in case no lock is acquired. Since command counter is
+     * not advanced the relcache entries has the contens before the above
+     * updates. We don't bother incrementing it and swap their contents
+     * directly.
+     */
+    rel1 = relation_open(r1, AccessShareLock);
+    rel2 = relation_open(r2, AccessShareLock);
+
+    /* swap relfilenodes */
+    rel1->rd_node.relNode = relfilenode2;
+    rel2->rd_node.relNode = relfilenode1;
+
+    /*
+     * Adjust newness flags. relfilenode2 is already added to EOXact array so
+     * we don't need to do that again here. We assume the new file is created
+     * in the current subtransaction.
      */
-    rel1 = relation_open(r1, AccessExclusiveLock);
     RelationAssumeNewRelfilenode(rel1);
-    relation_close(rel1, NoLock);
+    rel2->rd_createSubid = InvalidSubTransactionId;
+
+    relation_close(rel1, AccessShareLock);
+    relation_close(rel2, AccessShareLock);
 
     /*
      * Post alter hook for modified relations. The change to r2 is always
-- 
2.23.0

From 061e02878dcb3e2a6a54afb591dfec2f3ef88550 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:33:18 +0900
Subject: [PATCH v26 3/6] Improve the performance of relation syncs.

We can improve performance of syncing multiple files at once in the
same way as b41669118. This reduces the number of scans on the whole
shared_bufffers from the number of synced relations to one.
---
 src/backend/catalog/storage.c       |  28 +++++--
 src/backend/storage/buffer/bufmgr.c | 113 ++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c     |  37 +++++++++
 src/include/storage/bufmgr.h        |   1 +
 src/include/storage/smgr.h          |   1 +
 5 files changed, 174 insertions(+), 6 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 51c233dac6..65811b2a9e 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,6 +533,9 @@ smgrDoPendingSyncs(void)
 {
     PendingRelDelete *pending;
     HTAB    *delhash = NULL;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
 
     if (XLogIsNeeded())
         return;  /* no relation can use this */
@@ -573,7 +576,7 @@ smgrDoPendingSyncs(void)
 
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
-        bool to_be_removed = false; /* don't sync if aborted */
+        bool to_be_removed = false;
         ForkNumber fork;
         BlockNumber nblocks[MAX_FORKNUM + 1];
         BlockNumber total_blocks = 0;
@@ -623,14 +626,21 @@ smgrDoPendingSyncs(void)
          */
         if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
         {
-            /* Flush all buffers then sync the file */
-            FlushRelationBuffersWithoutRelcache(srel, false);
+            /* relations to sync are passed to smgrdosyncall at once */
 
-            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
             {
-                if (smgrexists(srel, fork))
-                    smgrimmedsync(srel, fork);
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
             }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
         }
         else
         {
@@ -658,6 +668,12 @@ smgrDoPendingSyncs(void)
 
     if (delhash)
         hash_destroy(delhash);
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 746ce477fc..e0c0b825e9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelFileNodesAllBuffers shares the same comparator function with
+ * DropRelFileNodeBuffers. Pointer to this struct and RelFileNode must
+ * be compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode        rnode;    /* This must be the first member */
+    SMgrRelation    srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3290,6 +3303,106 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelFileNodesAllBuffers
+ *
+ *        This function flushes out the buffer pool all the pages of all
+ *      forks of the specified smgr relations.  It's equivalent to
+ *      calling FlushRelationBuffers once per fork per relation, but the
+ *      parameter is not Relation but SMgrRelation
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelFileNodesAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0 ; i < nrels ; i++)
+    {
+        Assert (!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel  = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to
+     * sync. See DropRelFileNodesAllBuffers for details. The name DROP_*
+     * is for historical reasons.
+     */
+    use_bsearch = nrels > DROP_RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        /* Ensure there's a free array slot for PinBuffer_Locked */
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b50c69b438..191b52ab43 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,43 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are syncd out to the store.
+ *
+ *        This is equivalent to flusing all buffers FlushRelationBuffers for each
+ *        smgr relation then calling smgrimmedsync for all forks of each smgr
+ *        relation, but it's significantly quicker so should be preferred when
+ *        possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    /* We need to flush all buffers for the relations before sync. */
+    FlushRelFileNodesAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 8097d5ab22..558bac7e05 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -197,6 +197,7 @@ extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
+extern void FlushRelFileNodesAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
 extern void DropDatabaseBuffers(Oid dbid);
 
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 1543d8d870..31a5ecd059 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
-- 
2.23.0

From 25aa85b8b0c0b329de6b84942759797bfc912461 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 19:28:35 +0900
Subject: [PATCH v26 4/6] Revert FlushRelationBuffersWithoutRelcache.

The previous patch makes the function useless. Revert it.
---
 src/backend/storage/buffer/bufmgr.c | 27 ++++++++++-----------------
 src/include/storage/bufmgr.h        |  2 --
 2 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e0c0b825e9..56314653ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3216,27 +3216,20 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-    RelationOpenSmgr(rel);
-
-    FlushRelationBuffersWithoutRelcache(rel->rd_smgr,
-                                        RelationUsesLocalBuffers(rel));
-}
-
-void
-FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
-{
-    RelFileNode rnode = smgr->smgr_rnode.node;
-    int i;
+    int            i;
     BufferDesc *bufHdr;
 
-    if (islocal)
+    /* Open rel at the smgr level if not already done */
+    RelationOpenSmgr(rel);
+
+    if (RelationUsesLocalBuffers(rel))
     {
         for (i = 0; i < NLocBuffer; i++)
         {
             uint32        buf_state;
 
             bufHdr = GetLocalBufferDescriptor(i);
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
                 ((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
             {
@@ -3253,7 +3246,7 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
 
                 PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-                smgrwrite(smgr,
+                smgrwrite(rel->rd_smgr,
                           bufHdr->tag.forkNum,
                           bufHdr->tag.blockNum,
                           localpage,
@@ -3283,18 +3276,18 @@ FlushRelationBuffersWithoutRelcache(SMgrRelation smgr, bool islocal)
          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
          * and saves some cycles.
          */
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
             continue;
 
         ReservePrivateRefCountEntry();
 
         buf_state = LockBufHdr(bufHdr);
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
             (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
         {
             PinBuffer_Locked(bufHdr);
             LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
-            FlushBuffer(bufHdr, smgr);
+            FlushBuffer(bufHdr, rel->rd_smgr);
             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
             UnpinBuffer(bufHdr, true);
         }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 558bac7e05..3f85e8c6fe 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,8 +192,6 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
-extern void FlushRelationBuffersWithoutRelcache(struct SMgrRelationData *smgr,
-                                                bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
-- 
2.23.0

From 29af080eb433af96baf0e64de0dcbded7a128263 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 16:12:03 +0900
Subject: [PATCH v26 5/6] Fix gistGetFakeLSN()

GiST needs to set page LSN to monotically incresing numbers on updates
even if it is not WAL-logged at all.  We use a simple counter for
UNLOGGESD/TEMP relations but the number must be smaller than the LSN
at the next commit for WAL-skipped relations. WAL-insertione pointer
works in major cases but we sometimes need to emit a WAL record to
generate an unique LSN for update. This patch adds a new WAL record
kind XLOG_GIST_ASSIGN_LSN, which conveys no substantial content and
emits it if needed.
---
 src/backend/access/gist/gistutil.c     | 38 ++++++++++++++++++--------
 src/backend/access/gist/gistxlog.c     | 21 ++++++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |  5 ++++
 src/include/access/gist_private.h      |  2 ++
 src/include/access/gistxlog.h          |  1 +
 5 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 66c52d6dd6..8347673c5e 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,28 +1004,44 @@ gistproperty(Oid index_oid, int attno,
 }
 
 /*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Temporary, unlogged GiST and WAL-skipped indexes are not WAL-logged, but we
+ * need LSNs to detect concurrent page splits anyway. This function provides a
+ * fake sequence of LSNs for that purpose.
  */
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
-    /*
-     * XXX before commit fix this.  This is not correct for
-     * RELPERSISTENCE_PERMANENT, but it suffices to make tests pass.
-     */
-    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP
-        || rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so the LSN
+         * must be distinct numbers smaller than the LSN at the next
+         * commit. Emit a dummy WAL record if insert-LSN hasn't advanced after
+         * the last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr currlsn = GetXLogInsertRecPtr();
+
+        /* Shouldn't be called for WAL-logging relations */
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need for an actual record if we alredy have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3b28f54646..ce17bc9dc3 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int dummy = 0;
+
+    /*
+     * Records other than SWITCH_WAL must have content. We use an integer 0 to
+     * follow the restriction.
+     */
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char*) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index eccb6fd942..48cda40ac0 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
             break;
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
             break;
     }
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index a409975db1..3455dd242d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index e44922d915..1eae06c0fb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign an new LSN */
 
 /*
  * Backup Blk 0: updated page.
-- 
2.23.0

From 70d8236c375c6dc115e6023707b8a53a28f0b872 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 26 Nov 2019 21:25:09 +0900
Subject: [PATCH v26 6/6] Sync files shrinked by truncation

If truncation made a WAL-skipped file become smaller at commit than
the maximum size during the transaction, the file must not be
at-commit-WAL-logged and must be synced.
---
 src/backend/access/transam/xact.c |   5 +-
 src/backend/catalog/storage.c     | 161 +++++++++++++++++++-----------
 src/include/catalog/storage.h     |   2 +-
 3 files changed, 106 insertions(+), 62 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 750f95c482..f681cd3a23 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2114,7 +2114,7 @@ CommitTransaction(void)
      * transaction. This must happen before AtEOXact_RelationMap(), so that we
      * don't see committed-but-broken files after a crash.
      */
-    smgrDoPendingSyncs();
+    smgrDoPendingSyncs(true);
 
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
@@ -2354,7 +2354,7 @@ PrepareTransaction(void)
      * transaction. This must happen before EndPrepare(), so that we don't see
      * committed-but-broken files after a crash and COMMIT PREPARED.
      */
-    smgrDoPendingSyncs();
+    smgrDoPendingSyncs(true);
 
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
@@ -2674,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 65811b2a9e..aa68c77d44 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -62,11 +62,17 @@ typedef struct PendingRelDelete
     BackendId    backend;        /* InvalidBackendId if not a temp rel */
     bool        atCommit;        /* T=delete at commit; F=delete at abort */
     int            nestLevel;        /* xact nesting level of request */
-    bool        sync;            /* whether to fsync at commit */
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    BlockNumber max_truncated;
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -119,11 +125,39 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->backend = backend;
     pending->atCommit = false;    /* delete if abort */
     pending->nestLevel = GetCurrentTransactionNestLevel();
-    pending->sync =
-        relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded();
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * If the relation needs at-commit sync, we also need to track the maximum
+     * unsynced truncated block used to decide whether we can WAL-logging or we
+     * must sync the file in smgrDoPendingSyncs.
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool         found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize =  sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("max truncatd block hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->max_truncated = InvalidBlockNumber;
+    }
+
     return srel;
 }
 
@@ -162,7 +196,6 @@ RelationDropStorage(Relation rel)
     pending->backend = rel->rd_backend;
     pending->atCommit = true;    /* delete if commit */
     pending->nestLevel = GetCurrentTransactionNestLevel();
-    pending->sync = false;
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
@@ -320,6 +353,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         if (fsm || vm)
             XLogFlush(lsn);
     }
+    else if (pendingSyncHash)
+    {
+        pendingSync *pending;
+
+        /* Record largest maybe-unsynced block of files under tracking  */
+        pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                              HASH_FIND, NULL);
+        if (pending)
+        {
+            BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+            if (!BlockNumberIsValid(pending->max_truncated) ||
+                pending->max_truncated < nblocks)
+                pending->max_truncated = nblocks;
+        }
+    }
 
     /* Do the real work to truncate relation forks */
     smgrtruncate(rel->rd_smgr, forks, nforks, blocks);
@@ -430,18 +479,17 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 bool
 RelFileNodeSkippingWAL(RelFileNode rnode)
 {
-    PendingRelDelete *pending;
-
     if (XLogIsNeeded())
         return false;  /* no permanent relfilenode skips WAL */
 
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
-    {
-        if (RelFileNodeEquals(pending->relnode, rnode) && pending->sync)
-            return true;
-    }
+    if (!pendingSyncHash)
+        return false;  /* we don't have a to-be-synced relation */
 
-    return false;
+    /* the relation is not tracked as to-be-synced */
+    if (hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
 }
 
 /*
@@ -529,13 +577,14 @@ smgrDoPendingDeletes(bool isCommit)
  * failure prevents commit.
  */
 void
-smgrDoPendingSyncs(void)
+smgrDoPendingSyncs(bool isCommit)
 {
     PendingRelDelete *pending;
-    HTAB    *delhash = NULL;
     int            nrels = 0,
                 maxrels = 0;
     SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
 
     if (XLogIsNeeded())
         return;  /* no relation can use this */
@@ -543,58 +592,44 @@ smgrDoPendingSyncs(void)
     Assert(GetCurrentTransactionNestLevel() == 1);
     AssertPendingSyncs_RelationCache();
 
+    if (!pendingSyncHash)
+        return; /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        if (pendingSyncHash)
+        {
+            hash_destroy(pendingSyncHash);
+            pendingSyncHash = NULL;
+        }
+        return;
+    }
+
     /*
      * Pending syncs on the relation that are to be deleted in this
-     * transaction-end should be ignored. Collect pending deletes that will
-     * happen in the following call to smgrDoPendingDeletes().
+     * transaction-end should be ignored. Remove sync hash entries entries for
+     * relations that will be deleted in the following call to
+     * smgrDoPendingDeletes().
      */
     for (pending = pendingDeletes; pending != NULL; pending = pending->next)
     {
-        bool found PG_USED_FOR_ASSERTS_ONLY;
-
         if (!pending->atCommit)
             continue;
 
-        /* create the hash if not yet */
-        if (delhash == NULL)
-        {
-            HASHCTL hash_ctl;
-
-            memset(&hash_ctl, 0, sizeof(hash_ctl));
-            hash_ctl.keysize = sizeof(RelFileNode);
-            hash_ctl.entrysize = sizeof(RelFileNode);
-            hash_ctl.hcxt = CurrentMemoryContext;
-            delhash =
-                hash_create("pending del temporary hash", 8, &hash_ctl,
-                            HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-        }
-
-        (void) hash_search(delhash, (void *) &pending->relnode,
-                           HASH_ENTER, &found);
-        Assert(!found);
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
     }
 
-    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
     {
-        bool to_be_removed = false;
-        ForkNumber fork;
-        BlockNumber nblocks[MAX_FORKNUM + 1];
-        BlockNumber total_blocks = 0;
-        SMgrRelation srel;
-
-        if (!pending->sync)
-            continue;
-        Assert(!pending->atCommit);
-
-        /* don't sync relnodes that is being deleted */
-        if (delhash)
-            hash_search(delhash, (void *) &pending->relnode,
-                        HASH_FIND, &to_be_removed);
-        if (to_be_removed)
-            continue;
+        ForkNumber        fork;
+        BlockNumber        nblocks[MAX_FORKNUM + 1];
+        BlockNumber        total_blocks = 0;
+        SMgrRelation    srel;
 
-        /* Now the time to sync the rnode */
-        srel = smgropen(pending->relnode, pending->backend);
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
 
         /*
          * We emit newpage WAL records for smaller relations.
@@ -622,9 +657,12 @@ smgrDoPendingSyncs(void)
 
         /*
          * Sync file or emit WAL record for the file according to the total
-         * size.
+         * size. Do file sync if the size is larger than the threshold or
+         * truncates may have left blocks beyond the current size.
          */
-        if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024)
+        if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024 ||
+            (BlockNumberIsValid(pendingsync->max_truncated) &&
+             smgrnblocks(srel, MAIN_FORKNUM) < pendingsync->max_truncated))
         {
             /* relations to sync are passed to smgrdosyncall at once */
 
@@ -644,7 +682,11 @@ smgrDoPendingSyncs(void)
         }
         else
         {
-            /* Emit WAL records for all blocks. The file is small enough. */
+            /*
+             * Emit WAL records for all blocks.  We don't emit
+             * XLOG_SMGR_TRUNCATE record because the past truncations haven't
+             * left unlogged pages here.
+             */
             for (fork = 0 ; fork <= MAX_FORKNUM ; fork++)
             {
                 int    n        = nblocks[fork];
@@ -666,8 +708,9 @@ smgrDoPendingSyncs(void)
         }
     }
 
-    if (delhash)
-        hash_destroy(delhash);
+    Assert (pendingSyncHash);
+    hash_destroy(pendingSyncHash);
+    pendingSyncHash = NULL;
 
     if (nrels > 0)
     {
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 108115a023..bf076657e7 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -35,7 +35,7 @@ extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
-extern void smgrDoPendingSyncs(void);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

28 ноября 2019 г., 15:35:08

I measured the performance with the latest patch set.

> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
>    minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I did the following benchmarking.

1. Initialize bench database

  $ pgbench -i -s 20

2. Start server with wal_level = replica (all other variables are not
changed) then run the attached ./bench.sh

  $ ./bench.sh <count> <pages> <mode>

where count is the number of repetition, pages is the number of pages
to write in a run, and mode is "s" (sync) or "w"(WAL). The <mode>
doesn't affect if wal_level = replica. The script shows the following
result.

| before: tps 240.2, lat 44.087 ms (29 samples)
| during: tps 109.1, lat 114.887 ms (14 samples)
| after : tps 269.9, lat 39.557 ms (107 samples)
| DDL time = 13965 ms
| # transaction type: <builtin: TPC-B (sort of)>

before: mean numbers before "the DDL" starts.
during: mean numbers while "the DDL" is running.
after : mean numbers after "the DDL" ends.
DDL time: the time took to run "the DDL".

3. Restart server with wal_level = replica then run the bench.sh
twice.

  $ ./bench.sh <count> <pages> s
  $ ./bench.sh <count> <pages> w


Finally I got three graphs. (attached 1, 2, 3. PNGs)

* Graph 1 - The affect of the DDL on pgbench's TPS

 The virtical axis means "during TPS" / "before TPS" in %. Larger is
 better. The horizontal axis means the table pages size.
 
 Replica and Minimal-sync are almost flat.  Minimal-WAL getting worse
 as table size increases. 500 pages seems to be the crosspoint.


* Graph 2 - The affect of the DDL on pgbench's latency.

 The virtical axis means "during-letency" / "before-latency" in
 %. Smaller is better. Like TPS but more quickly WAL-latency gets
 worse as table size increases. The crosspoint seems to be 300 pages
 or so.


* Graph 3 - The affect of pgbench's work load on DDL runtime.

 The virtical axis means "time the DDL takes to run with pgbench" /
 "time the DDL to run solely". Smaller is better. Replica and
 Minimal-SYNC shows similar tendency. On Minimal-WAL the DDL runs
 quite fast with small tables. The crosspoint seems to be about 2500
 pages.

Seeing this, I became to be worry that the optimization might give far
smaller advantage than expected. Putting aside that, it seems to me
that the default value for the threshold would be 500-1000, same as
the previous benchmark showed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

29 ноября 2019 г., 01:23:19

On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> I measured the performance with the latest patch set.
> 
> > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> >    minute when done via syncs.
> > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > 3. Wait 10s.
> > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

If you have the raw data requested in (5), please share them here so folks
have the option to reproduce your graphs and calculations.

> I did the following benchmarking.
> 
> 1. Initialize bench database
> 
>   $ pgbench -i -s 20
> 
> 2. Start server with wal_level = replica (all other variables are not
> changed) then run the attached ./bench.sh

The bench.sh attachment was missing; please attach it.  Please give the output
of this command:

  select name, setting from pg_settings where setting <> boot_val;

> 3. Restart server with wal_level = replica then run the bench.sh
> twice.

I assume this is wal_level=minimal, not wal_level=replica.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

03 декабря 2019 г., 14:51:46

Hello.

At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch <noah@leadboat.com> wrote in 
> On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > I measured the performance with the latest patch set.
> > 
> > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > >    minute when done via syncs.
> > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > 3. Wait 10s.
> > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> If you have the raw data requested in (5), please share them here so folks
> have the option to reproduce your graphs and calculations.

Sorry, I forgot to attach the scripts. The raw data was vanished into
unstable connection and the steps was quite crude. I prioritized on
showing some numbers at the time. I revised the scripts into more
automated way and will take numbers again.

> > > 2. Start server with wal_level = replica (all other variables are not
> > > changed) then run the attached ./bench.sh
> > 
> > The bench.sh attachment was missing; please attach it.  Please give the output
> > of this command:
> > 
> >   select name, setting from pg_settings where setting <> boot_val;

(I intentionally show all the results..)
=# select name, setting from pg_settings where setting<> boot_val;
            name            |      setting       
----------------------------+--------------------
 application_name           | psql
 archive_command            | (disabled)
 client_encoding            | UTF8
 data_directory_mode        | 0700
 default_text_search_config | pg_catalog.english
 lc_collate                 | en_US.UTF-8
 lc_ctype                   | en_US.UTF-8
 lc_messages                | en_US.UTF-8
 lc_monetary                | en_US.UTF-8
 lc_numeric                 | en_US.UTF-8
 lc_time                    | en_US.UTF-8
 log_checkpoints            | on
 log_file_mode              | 0600
 log_timezone               | Asia/Tokyo
 max_stack_depth            | 2048
 max_wal_senders            | 0
 max_wal_size               | 10240
 server_encoding            | UTF8
 shared_buffers             | 16384
 TimeZone                   | Asia/Tokyo
 unix_socket_permissions    | 0777
 wal_buffers                | 512
 wal_level                  | minimal
(23 rows)

The result for "replica" setting in the benchmark script are used as
base numbers (or the denominator of the percentages).

> > 3. Restart server with wal_level = replica then run the bench.sh
> > twice.
> 
> I assume this is wal_level=minimal, not wal_level=replica.

Oops! It's wrong I ran once with replica, then twice with minimal.


Anyway, I revised the benchmarking scripts and attached them.  The
parameters written in benchmain.sh were decided as ./bench2.pl 5
<count> <pages> s with wal_level=minimal server takes around 60
seconds.

I'll send the complete data tomorrow (in JST). The attached f.txt is
the result of preliminary test only with pages=100 and 250 (with HDD).

The attached files are:
  benchmain.sh    - main script
  bench2.sh       - run a benchmark with a single set of parameters
  bench1.pl       - behchmark client program
  summarize.pl    - script to summarize benchmain.sh's output
  f.txt.gz        - result only for pages=100, DDL count = 2200 (not 2250)

How to run:

$ /..unpatched_path../initdb -D <unpatched_datadir>
 (wal_level=replica, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$ /..patched_path../initdb -D <patched_datadir>
 (wal_level=minimal, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$./benchmain.sh > <result_file>   # output raw data
$./summarize.pl [-v] < <result_file>   # show summary


With the attached f.txt, summarize.pl gives the following output.
WAL wins with the that pages.

$ cat f.txt | ./summarize.pl
## params: wal_level=replica mode=none pages=100 count=353 scale=20
(% are relative to "before")
before: tps  262.3 (100.0%), lat    39.840 ms (100.0%) (29 samples)
during: tps  120.7 ( 46.0%), lat   112.508 ms (282.4%) (35 samples)
 after: tps  106.3 ( 40.5%), lat   163.492 ms (410.4%) (86 samples)
DDL time:  34883 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=100 count=353 scale=20
(% are relative to "before")
before: tps  226.3 (100.0%), lat    48.091 ms (100.0%) (29 samples)
during: tps   83.0 ( 36.7%), lat   184.942 ms (384.6%) (100 samples)
 after: tps   82.6 ( 36.5%), lat   196.863 ms (409.4%) (21 samples)
DDL time:  99239 ms ( 284.5% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=100 count=353 scale=20
(% are relative to "before")
before: tps  240.3 (100.0%), lat    44.686 ms (100.0%) (29 samples)
during: tps  129.6 ( 53.9%), lat   113.585 ms (254.2%) (31 samples)
 after: tps  124.5 ( 51.8%), lat   141.992 ms (317.8%) (90 samples)
DDL time:  30392 ms (  87.1% relative to mode=none)
## params: wal_level=replica mode=none pages=250 count=258 scale=20
(% are relative to "before")
before: tps  266.3 (100.0%), lat    45.884 ms (100.0%) (29 samples)
during: tps   87.9 ( 33.0%), lat   148.433 ms (323.5%) (54 samples)
 after: tps  105.6 ( 39.6%), lat   153.216 ms (333.9%) (67 samples)
DDL time:  53176 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=250 count=258 scale=20
(% are relative to "before")
before: tps  225.1 (100.0%), lat    47.705 ms (100.0%) (29 samples)
during: tps   93.7 ( 41.6%), lat   143.231 ms (300.2%) (83 samples)
 after: tps   93.8 ( 41.7%), lat   186.097 ms (390.1%) (38 samples)
DDL time:  82104 ms ( 154.4% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=250 count=258 scale=20
(% are relative to "before")
before: tps  230.2 (100.0%), lat    48.472 ms (100.0%) (29 samples)
during: tps   90.3 ( 39.2%), lat   183.365 ms (378.3%) (48 samples)
 after: tps  123.9 ( 53.8%), lat   131.129 ms (270.5%) (73 samples)
DDL time:  47660 ms (  89.6% relative to mode=none)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/bash

PGBENCH_SCALE=20
export PATH=/home/horiguti/bin/pgsql_pendsync/bin:$PATH
function run_one_param()
{
    # bench2.sh <ddl_count> <insert_pages> <method (s:sync, w:WAL, n:not-set)> <pgbench scale>
    ./bench2.sh $1 $2 n ${PGBENCH_SCALE} 2>&1
    ./bench2.sh $1 $2 s ${PGBENCH_SCALE} 2>&1
    ./bench2.sh $1 $2 w ${PGBENCH_SCALE} 2>&1
}

killall -9 postgres
sleep 1

#run_one_param <ddl_count> <insert_pages>

# On slow HDD
export PGDATA=/home/horiguti/storage/hdd/data/data_pendsync
run_one_param  353   100
run_one_param  258   250
run_one_param  185   500
run_one_param  118  1000
run_one_param   58  2500
run_one_param   32  5000
run_one_param   18 10000

# On M.2 SSD
export PGDATA=/home/horiguti/storage/ssd/data/data_pendsync
run_one_param 2250   100
run_one_param 1162   250
run_one_param  564   500
run_one_param  297  1000
run_one_param  123  2500
run_one_param   63  5000
run_one_param   32 10000
#! /usr/bin/bash

if [ "$PGDATA" == "" ]; then
    echo "\$PGDATA should be set"
    exit;
fi;

rm -r $PGDATA/*

initdb
if [ "$3" == "n" ]; then
    wal_level=replica
else
    wal_level=minimal
fi    
cat <<EOF >> $PGDATA/postgresql.conf
wal_level=$wal_level
max_wal_senders=0
log_checkpoints=yes
max_wal_size = 10GB
EOF
    
binary=`which postgres`
scale=$4
echo "## params: count=$1 pages=$2 mode=$3 binary=$binary scale=$scale wal_level=$wal_level"
pg_ctl stop -m f
pg_ctl start 2>&1
pgbench -i -s ${scale}
psql -c 'checkpoint;'
((sleep 30; echo "START"; ./bench1.pl 5 $1 $2 $3; echo "END") & pgbench -rP1 --progress-timestamp -T150 -c10 -j10)
2>&1
pg_ctl stop -m i
#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes qw (gettimeofday tv_interval);

my $tupperpage = 226;
my $large_size = 100000000;
my @time = ();

sub bench {
    my ($header, $nprocs, $ntups, $threshold, $ddlcount) = @_;
    my @result = ();
    my @rds = ();
    
    for (my $ip = 0 ; $ip < $nprocs ; $ip++)
    {
        pipe(my $rd, my $wr);
        $rds[$ip] = $rd;
        
        my $pid = fork();

        die "fork failed: $!\n" if ($pid < 0);
        if ($pid == 0)
        {
            close($rd);
            
            my $pid = open2(my $psqlrd, my $psqlwr, "psql postgres > /dev/null");
            if ($threshold >= 0) {
                print $psqlwr "SET wal_skip_threshold to $threshold;\n";
            }
            print $psqlwr "DROP TABLE IF EXISTS t$ip;";
            print $psqlwr "CREATE TABLE t$ip (a int);\n";

            my @st = gettimeofday();
            for (my $i = 0 ; $i < $ddlcount ; $i++)
            {
                print $psqlwr "BEGIN;";
                print $psqlwr "TRUNCATE t$ip;";
                print $psqlwr "INSERT INTO t$ip (SELECT a FROM generate_series(1, $ntups) a);";
                print $psqlwr "COMMIT;";
            }
            close($psqlwr);
            waitpid($pid, 0);

            print $wr $ip, " ", 1000 * tv_interval(\@st, [gettimeofday]), "\n";
            exit;
        }
        close($wr);
    }

    my $rpid;
    while (($rpid = wait()) == 0) {}

    my $sum = 0;
    for (my $ip = 0 ; $ip < $nprocs ; $ip++)
    {
        my $ret = readline($rds[$ip]);
        die "format? $ret\n" if ($ret !~ /^([0-9]+) ([0-9.]+)$/);

        $sum += $2;
    }

    printf "$header: procs $nprocs: time %.0f\n", $sum / $nprocs;
}


sub log10 { return log($_[0]) / log(10); }

# benchmark for wal_level = replica, the third parameter of bench
# doesn't affect
sub bench1
{
    my $ddlcount = 5;
    $ddlcount = $ARGV[1] if ($#ARGV > 0);

    print "benchmark for wal_level = replica\n";
    for (my $s = 0 ; $s <= 4 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("size $ss", 1, $ss * $tupperpage, 0, $ddlcount);
    }
}

# benchmark for wal_level = minimal.
sub bench2
{
    my $ddlcount = 5;
    $ddlcount = $ARGV[1] if ($#ARGV > 0);

    print "benchmark for wal_level = minimal\n";
    for (my $s = 0 ; $s <= 4.5 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("size $ss: SYNC ", 1, $ss * $tupperpage,           0, $ddlcount);
        bench("size $ss: WAL  ", 1, $ss * $tupperpage, $large_size, $ddlcount);
    }
}

# find crossing point of WAL and SYNC by bisecting
sub bench3
{
    my $ddlcount = 5;
    $ddlcount = $ARGV[1] if ($#ARGV > 0);

    print "find crossing point of WAL and SYNC by bisecting\n";
    bench("SYNC: size 0", 1, 1, 8);
    bench("WAL : size 0", 1, 1, 16);
    my $s = 1;
    my $st = 10000;
    while (1)
    {
        my $ts = bench("SYNC: size $s",
                       $tupperpage * $s,           0, $ddlcount);
        my $tw = bench("WAL : size $s",
                       $tupperpage * $s, $large_size, $ddlcount);

        if ($st < 1.0){
            print "DONE\n";
            exit(0);
        }
        if ($ts > $tw)
        {
            $s += $st; $st /= 2;
        }
        else
        {
            $s -= $st; $st /= 2;
        }
    }
}

# benchmark with multiple processes
sub bench4
{
    my $ddlcount = 5;
    my $nprocs = 10;

    $nprocs = $ARGV[1] if ($#ARGV > 0);
    $ddlcount = $ARGV[2] if ($#ARGV > 1);
    
    print "benchmark for wal_level = minimal, $nprocs processes\n";
    print "bench 4: nprocs = $nprocs, DDL count = $ddlcount\n";
    
    for (my $s = 1.0 ; $s <= 3.5 ; $s += 0.25)
    {
        my $ss = int(10 ** $s);
        bench("pages $ss: SYNC ", $nprocs, $ss * $tupperpage,           0, 5);
        bench("pages $ss: WAL  ", $nprocs, $ss * $tupperpage, $large_size, 5);
    }
}

sub bench5
{
    my $ddlcount = 5;
    my $pages = 100;
    my $mode = "s";
    my $threshold = 0;

    $ddlcount = $ARGV[1] if ($#ARGV > 0);
    $pages = $ARGV[2] if ($#ARGV > 1);
    $mode = $ARGV[3] if ($#ARGV > 2);
    if ($mode eq 's') {
        $threshold = 0;
    } elsif ($mode eq 's') {
        $threshold = $large_size;
    } elsif ($mode eq 'n') {
        $threshold = -1;
    } elsif ($mode eq 'w') {
        $threshold = $large_size;
    } else {
        die "mode must be s, w or n\n";
    }
    

    print "bench 5: mode = $mode, DDL count = $ddlcount, pages = $pages\n";
    bench("size $pages", 1, $pages * $tupperpage, $threshold, $ddlcount);
}

bench1() if ($ARGV[0] == 1);
bench2() if ($ARGV[0] == 2);
bench3() if ($ARGV[0] == 3);
bench4() if ($ARGV[0] == 4);
bench5() if ($ARGV[0] == 5);




#! /usr/bin/perl

use strict;
use POSIX qw(floor);

my $state = 0;
my $wal_level = '';
my $pages = 0;
my $binary = '';
my $scale = 0;
my $paramcount = 0;
my $mode = '';
my $sumtps = 0;
my $sumlat = 0;
my $count = 0;
my $trig = 0;
my $title = "(undef)";
my $ddltime = 0;
my @lines = ();
my $trailstr = '';
my $verbose = 1 if ($ARGV[0] eq '-v');
my %tps = ();
my %lat = ();
my %ddltime = ();
my %modestr=("n", "none", "s", "sync", "w", "WAL");

while (<STDIN>) {
    push(@lines, $_);
    chomp;
    next if (/^(END|START)$/);
    next if (/NOTICE:  /);

#    print "$state: $_\n";
    if ($state == 0) {
        if (/^## params: count=([0-9.]+) pages=([0-9.]+) mode=(.) binary=([^ ]+) scale=([0-9.]+) wal_level=([a-z]+)/)
{
            $paramcount = $1;
            $pages = $2;
            $mode = $3;
            $binary = $4;
            $scale = $5;
            $wal_level = $6;
            my $modestr = $modestr{$mode};

            print "## params: wal_level=$wal_level mode=$modestr pages=$pages count=$paramcount scale=$scale\n";
            print "(% are relative to \"before\")\n";
            $state = 1;
            next;
        } else {
            next;
        }
    } elsif ($state == 1) {
        if (/^starting vacuum/) {
            $state = 2;
        }
        next;
    } elsif ($state == 2) {
        if (/^bench.*/) {
            $trig = 1;
            $title = "before";
            $state = 3;
        }
    } elsif ($state == 3) {
        if (/^size ([0-9]+): procs ([0-9]+): time ([0-9]+)$/) {
            $ddltime{$mode} = $3;
            $trig = 1;
            $title = "during";
            $trailstr = '';
            $state = 4;
        }
    } elsif ($state == 4) {
        if (/^transaction type: /) {
            $trig = 1;
            $title = "after";
            $trailstr = "# $_\n";
            $state = 5;
        }
    } elsif ($state == 5) {
        if (!/^statement latencies /) {
            $trailstr .= "# $_\n";
            next;
        }
        printf "DDL time: %6.0f ms (%6.1f%% relative to mode=%s)\n",
            $ddltime{$mode},
            floor(1000.0 * $ddltime{$mode} / $ddltime{n} + 0.5) / 10,
            $modestr{n};
        $trailstr .= "# $_\n";
        $state = 6;
        next;
    } elsif ($state == 6) {
        if (/^ {8}/) {
            $trailstr.= "# $_\n";
            next;
        }
        if ($verbose) {
            print $trailstr;
        }
        $state = 0;
        next;
    }

    if ($trig) {
        die "count 0?\n" if ($count == 0);
        $tps{$title} = $sumtps / $count;
        $lat{$title} = $sumlat / $count;
        printf "%6s: tps %6.1f (%5.1f%%), lat %9.3f ms (%5.1f%%) (%d samples)\n",
            $title,
            $tps{$title}, floor(1000.0 * $tps{$title} / $tps{before} + 0.5)/10,
            $lat{$title}, floor(1000.0 * $lat{$title} / $lat{before} + 0.5)/10,
            $count;
        $sumtps = $sumlat = $count = 0;
        $trig = 0;
        next;
    }

    if (!/^progress: ([0-9.]+) s, ([0-9.]+) tps, lat ([0-9.]+) ms stddev ([0-9.]+|NaN)$/) {
        last;
    }
    $sumtps += $2;
    $sumlat += $3;
    $count++;
}

if ($state != 0) {
    print "Wrong state after EOF: state = $state\n";
    print "=====================================\n";
    foreach (-10 .. -1) {
        printf "%d: %s", ($. + $_), $lines[$. + $_];
    }
    print "=====================================\n";
    exit(1);
}

die "uncounted lines?\n" if ($count > 0);

Вложения

f.txt.gz

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

04 декабря 2019 г., 07:47:35

At Tue, 03 Dec 2019 20:51:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> I'll send the complete data tomorrow (in JST). The attached f.txt is
> the result of preliminary test only with pages=100 and 250 (with HDD).

The attached files are the latest set of the test scripts and the result:
  benchmark_scripts.tar.gz
     benchmain.sh    - main script
     bench2.sh       - run a benchmark with a single set of parameters
     bench1.pl       - behchmark client program
     summarize.pl    - script to summarize benchmain.sh's output
     graph.xlsx         - MS-Excel file for the graph below.
  result.txt.gz      - raw result of benchmain.sh
  summary.txt.gz     - cooked result by summarize.pl -s
  graph.png          - graphs

summarize.pl [-v|-s|-d]
   -s: print summary table for spreadsheets (TSV)
   -v: show pgbench summary
   -d: debug print

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

At Mon, 9 Dec 2019 10:56:40 -0500, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Mon, Dec 9, 2019 at 4:04 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Yeah, only 0.5GB of shared_buffers makes the default value of
> > wal_buffers reach to the heaven. I think I can take numbers on that
> > condition. (I doubt that it's meaningful if I increase only
> > wal_buffers manually.)
> 
> Heaven seems a bit exalted, but I think we really only have a formula
> because somebody might have really small shared_buffers for some
> reason and be unhappy about us gobbling up a comparatively large
> amount of memory for WAL buffers. The current limit means that normal
> installations get what they need without manual tuning, and small
> installations - where performance presumably sucks anyway for other
> reasons - keep a small memory footprint.

True. I meant the ceiling of defaultly tuned value, and the larger
value may work on the larger system.

Anyway, I ran the benchmark with
shared_buffers=1GB/wal_buffers=16MB(defalut). pgbench -s 20 uses 256MB
of storage so all of them can be loaded on shared memory.

The attached graph shows larger benefit in TPS drop and latency
increase for HDD. The DDL pages at the corsspoint between commit-FPW
and commit-sync moves from roughly 300 to 200 in TPS and latency, and
1000 to 600 in DDL runtime. If we can rely on the two graphs, 500 (or
512) pages seems to be the most promising candidate for the default
value of wal_skip_threshold.
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

By improving AssertPendingSyncs_RelationCache() and by testing with
-DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
Would you fix these?

=== Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO

A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
when running "make check" under wal_level=minimal.  I test this way:

printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
make check TEMP_CONFIG=$PWD/minimal.conf

Self-contained demonstration:
  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure

=== Defect 2: Forgets to skip WAL due to oversimplification in heap_create()

In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
to transfer WAL-skipped state to the new index relation.  Before v24nm, the
new index relation skipped WAL unconditionally.  Since v24nm, the new index
relation never skips WAL.  I've added a test to alter_table.sql that reveals
this problem under wal_level=minimal.

=== Defect 3: storage.c checks size decrease of MAIN_FORKNUM only

storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
net size decrease?  I haven't tested, but this sequence seems possible:

  TRUNCATE
    reduces MAIN_FORKNUM from 100 blocks to 0 blocks
    reduces FSM_FORKNUM from 3 blocks to 0 blocks
  COPY
    raises MAIN_FORKNUM from 0 blocks to 110 blocks
    does not change FSM_FORKNUM
  COMMIT
    should fsync, but wrongly chooses log_newpage_range() approach

If that's indeed a problem, beside the obvious option of tracking every fork's
max_truncated, we could convert max_truncated to a bool and use fsync anytime
the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
for database operations, the choice to subject it to checksums entails
protecting it here.)  If that's not a problem, would you explain?

=== Non-defect notes

Once you have a correct patch, would you run check-world with
-DCLOBBER_CACHE_ALWAYS?  That may reveal additional defects.  It may take a
day or more, but that's fine.

The new smgrimmedsync() calls are potentially fragile, because they sometimes
target a file of a dropped relation.  However, the mdexists() test prevents
anything bad from happening.  No change is necessary.  Example:

  SET wal_skip_threshold = 0;
  BEGIN;
  SAVEPOINT q;
  CREATE TABLE t (c) AS SELECT 1;
  ROLLBACK TO q;  -- truncates the relfilenode
  CHECKPOINT;  -- unlinks the relfilenode
  COMMIT;  -- calls mdexists() on the relfilenode

=== Notable changes in v30nm

- Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
  Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

- Changed AssertPendingSyncs_RelationCache() to open all relations on which
  the transaction holds locks.  This permits detection of cases where
  RelationNeedsWAL() returns true but storage.c will sync the relation.

  Removed the assertions from RelationIdGetRelation().  Using
  "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
  actually problematic, since invalidation updates rd_node while other code
  updates rd_firstRelfilenodeSubid.  This is not a significant loss, now that
  AssertPendingSyncs_RelationCache() opens relations.  (I considered making
  the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
  somewhere until the next CommandCounterIncrement(), which would make it
  actually affect RelationNeedsWAL().  That might have been better in general,
  but it felt complex without clear benefits.)

  Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
  that work no matter what does ereport(ERROR) would be tricky and low-value.

- Extracted the RelationTruncate() changes into new function
  RelationPreTruncate(), so table access methods that can't use
  RelationTruncate() have another way to request that work.

- Changed wal_skip_threshold default to 2MB.  My second preference was for
  4MB.  In your data, 2MB and 4MB had similar performance at optimal
  wal_buffers, but 4MB performed worse at low wal_buffers.

- Reverted most post-v24nm changes to swap_relation_files().  Under
  "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
  rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
  we're running CLUSTER for the second time in one transaction.  I used
  relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
  assertion failure if we hold no lock at that point.

- Change toast_get_valid_index() to retain locks until end of transaction.
  When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
  revealed that we retain no lock on the TOAST index.

- Ran pgindent and perltidy.  Updated some comments and names.

On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> Anyway the default value ought to be defined based on the default
> configuration.

PostgreSQL does not follow that principle.  Settings that change permanent
resource consumption, such as wal_buffers, have small defaults.  Settings that
don't change permanent resource consumption can have defaults that favor a
well-tuned system.

Вложения

skip-wal-v30nm.patch

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

26 декабря 2019 г., 06:46:39

Thank you for the findings.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> By improving AssertPendingSyncs_RelationCache() and by testing with
> -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> Would you fix these?

I'd like to do that, please give me som time.

> === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> 
> A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> when running "make check" under wal_level=minimal.  I test this way:
> 
> printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
> make check TEMP_CONFIG=$PWD/minimal.conf
> 
> Self-contained demonstration:
>   begin;
>   create table t (c int);
>   savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
>   commit;  -- assertion failure
> 
> 
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
> 
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation.  Before v24nm, the
> new index relation skipped WAL unconditionally.  Since v24nm, the new index
> relation never skips WAL.  I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.
> 
> 
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
> 
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease?  I haven't tested, but this sequence seems possible:
> 
>   TRUNCATE
>     reduces MAIN_FORKNUM from 100 blocks to 0 blocks
>     reduces FSM_FORKNUM from 3 blocks to 0 blocks
>   COPY
>     raises MAIN_FORKNUM from 0 blocks to 110 blocks
>     does not change FSM_FORKNUM
>   COMMIT
>     should fsync, but wrongly chooses log_newpage_range() approach
> 
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.)  If that's not a problem, would you explain?
> 


> === Non-defect notes
> 
> Once you have a correct patch, would you run check-world with
> -DCLOBBER_CACHE_ALWAYS?  That may reveal additional defects.  It may take a
> day or more, but that's fine.

Sure.

> The new smgrimmedsync() calls are potentially fragile, because they sometimes
> target a file of a dropped relation.  However, the mdexists() test prevents
> anything bad from happening.  No change is necessary.  Example:
> 
>   SET wal_skip_threshold = 0;
>   BEGIN;
>   SAVEPOINT q;
>   CREATE TABLE t (c) AS SELECT 1;
>   ROLLBACK TO q;  -- truncates the relfilenode
>   CHECKPOINT;  -- unlinks the relfilenode
>   COMMIT;  -- calls mdexists() on the relfilenode
> 
> 
> === Notable changes in v30nm
> 
> - Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
>   Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

Ahh.., I wrongly understood that MAX_KILOBYTES inhibits that
setting. work_mem and maintenance_work_mem are casted to double or
long before calculation. In this case it's enough that calculation
unit becomes kilobytes instad of bytes.

> - Changed AssertPendingSyncs_RelationCache() to open all relations on which
>   the transaction holds locks.  This permits detection of cases where
>   RelationNeedsWAL() returns true but storage.c will sync the relation.
>
>   Removed the assertions from RelationIdGetRelation().  Using
>   "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
>   actually problematic, since invalidation updates rd_node while other code
>   updates rd_firstRelfilenodeSubid.  This is not a significant loss, now that
>   AssertPendingSyncs_RelationCache() opens relations.  (I considered making
>   the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
>   somewhere until the next CommandCounterIncrement(), which would make it
>   actually affect RelationNeedsWAL().  That might have been better in general,
>   but it felt complex without clear benefits.)
> 
>   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
>   that work no matter what does ereport(ERROR) would be tricky and low-value.

Right about ereport, but I'm not sure remove the whole assertion from abort.

> - Extracted the RelationTruncate() changes into new function
>   RelationPreTruncate(), so table access methods that can't use
>   RelationTruncate() have another way to request that work.

Sounds reasonable. Also the new behavior of max_truncated looks fine.

> - Changed wal_skip_threshold default to 2MB.  My second preference was for
>   4MB.  In your data, 2MB and 4MB had similar performance at optimal
>   wal_buffers, but 4MB performed worse at low wal_buffers.

That's fine with me.

> - Reverted most post-v24nm changes to swap_relation_files().  Under
>   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
>   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
>   we're running CLUSTER for the second time in one transaction.  I used

I don't agree to that. As I think I have mentioned upthread, rel2 is
wrongly marked as "new in this tranction" at that time, which hinders
the opportunity of removal and such entries wrongly persist for the
backend life and causes problems. (That was found by abort-time
AssertPendingSyncs_RelationCache()..)

>   relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
>   assertion failure if we hold no lock at that point.

I agree to that.

> - Change toast_get_valid_index() to retain locks until end of transaction.
>   When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
>   revealed that we retain no lock on the TOAST index.

Sounds more reasonable than open_relation(AnyLock) in swap_relation_files.

> - Ran pgindent and perltidy.  Updated some comments and names.
> 
> On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> > Anyway the default value ought to be defined based on the default
> > configuration.
> 
> PostgreSQL does not follow that principle.  Settings that change permanent
> resource consumption, such as wal_buffers, have small defaults.  Settings that
> don't change permanent resource consumption can have defaults that favor a
> well-tuned system.

I think I understand that, actually 4MB was too large, though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

26 декабря 2019 г., 07:22:04

On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> >   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
> >   that work no matter what does ereport(ERROR) would be tricky and low-value.
> 
> Right about ereport, but I'm not sure remove the whole assertion from abort.

You may think of a useful assert location that lacks the problems of asserting
at abort.  For example, I considered asserting in PortalRunMulti() and
PortalRun(), just after each command, if still in a transaction.

> > - Reverted most post-v24nm changes to swap_relation_files().  Under
> >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
> >   we're running CLUSTER for the second time in one transaction.  I used
> 
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I can't reproduce rel2's relcache entry wrongly persisting for the life of a
backend.  If that were happening, I would expect repeating a CLUSTER command N
times to increase hash_get_num_entries(RelationIdCache) by at least N.  I
tried that, but hash_get_num_entries(RelationIdCache) did not increase.  In a
non-assert build, how can I reproduce problems caused by incorrect
rd_createSubid on rel2?

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

26 декабря 2019 г., 08:45:26

At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > - Reverted most post-v24nm changes to swap_relation_files().  Under
> >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
> >   we're running CLUSTER for the second time in one transaction.  I used
> 
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I played with the new version for a while and I don't see such a
problem. I don't recall cleary what I saw the time I thought I saw a
problem but I changed my mind to agree to that. It's far reasonable
and clearer as long as it works correctly.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

26 декабря 2019 г., 12:03:21

Hello, Noah.

At Wed, 25 Dec 2019 20:22:04 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > >   Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
> > >   that work no matter what does ereport(ERROR) would be tricky and low-value.
> > 
> > Right about ereport, but I'm not sure remove the whole assertion from abort.
> 
> You may think of a useful assert location that lacks the problems of asserting
> at abort.  For example, I considered asserting in PortalRunMulti() and
> PortalRun(), just after each command, if still in a transaction.

Thanks for the suggestion. I'll consider that

> > > - Reverted most post-v24nm changes to swap_relation_files().  Under
> > >   "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> > >   rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
> > >   we're running CLUSTER for the second time in one transaction.  I used
> > 
> > I don't agree to that. As I think I have mentioned upthread, rel2 is
> > wrongly marked as "new in this tranction" at that time, which hinders
> > the opportunity of removal and such entries wrongly persist for the
> > backend life and causes problems. (That was found by abort-time
> > AssertPendingSyncs_RelationCache()..)
> 
> I can't reproduce rel2's relcache entry wrongly persisting for the life of a
> backend.  If that were happening, I would expect repeating a CLUSTER command N
> times to increase hash_get_num_entries(RelationIdCache) by at least N.  I
> tried that, but hash_get_num_entries(RelationIdCache) did not increase.  In a
> non-assert build, how can I reproduce problems caused by incorrect
> rd_createSubid on rel2?

As wrote in the another mail. I don't see such a problem and agree to
the removal.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

14 января 2020 г., 13:35:22

Hello, this is a fix for the defect 1 of 3.

At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Thank you for the findings.
> 
> At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > By improving AssertPendingSyncs_RelationCache() and by testing with
> > -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> > Would you fix these?
> 
> I'd like to do that, please give me som time.
> 
> > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> > 
> > A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> > when running "make check" under wal_level=minimal.  I test this way:
> > 
> > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
> > make check TEMP_CONFIG=$PWD/minimal.conf
> > 
> > Self-contained demonstration:
> >   begin;
> >   create table t (c int);
> >   savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
> >   commit;  -- assertion failure

This is complex than expected. The DROP TABLE unconditionally removed
relcache entry. To fix that, I tried to use rd_isinvalid but it failed
because there's a state that a relcache invalid but the corresponding
catalog entry is alive.

In the attached patch 0002, I added a boolean in relcache that
indicates that the relation is already removed in catalog but not
committed. I needed to ignore invalid relcache entries in
AssertPendingSyncs_RelationCache but I think it is the right thing to
do.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 75ce09ae56227f9b87b1e9fcae1cad016857344c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH 1/2] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml                    |  43 ++--
 doc/src/sgml/perform.sgml                   |  47 +---
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c          |  31 ++-
 src/backend/access/gist/gistxlog.c          |  21 ++
 src/backend/access/heap/heapam.c            |  45 +---
 src/backend/access/heap/heapam_handler.c    |  22 +-
 src/backend/access/heap/rewriteheap.c       |  21 +-
 src/backend/access/nbtree/nbtsort.c         |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c      |   5 +
 src/backend/access/transam/README           |  45 +++-
 src/backend/access/transam/xact.c           |  15 ++
 src/backend/access/transam/xloginsert.c     |  10 +-
 src/backend/access/transam/xlogutils.c      |  18 +-
 src/backend/catalog/heap.c                  |   4 +
 src/backend/catalog/storage.c               | 248 ++++++++++++++++++--
 src/backend/commands/cluster.c              |  12 +-
 src/backend/commands/copy.c                 |  58 +----
 src/backend/commands/createas.c             |  11 +-
 src/backend/commands/matview.c              |  12 +-
 src/backend/commands/tablecmds.c            |  11 +-
 src/backend/storage/buffer/bufmgr.c         | 125 +++++++++-
 src/backend/storage/lmgr/lock.c             |  12 +
 src/backend/storage/smgr/md.c               |  36 ++-
 src/backend/storage/smgr/smgr.c             |  35 +++
 src/backend/utils/cache/relcache.c          | 159 ++++++++++---
 src/backend/utils/misc/guc.c                |  13 +
 src/include/access/gist_private.h           |   2 +
 src/include/access/gistxlog.h               |   1 +
 src/include/access/heapam.h                 |   3 -
 src/include/access/rewriteheap.h            |   2 +-
 src/include/access/tableam.h                |  15 +-
 src/include/catalog/storage.h               |   6 +
 src/include/storage/bufmgr.h                |   4 +
 src/include/storage/lock.h                  |   3 +
 src/include/storage/smgr.h                  |   1 +
 src/include/utils/rel.h                     |  57 +++--
 src/include/utils/relcache.h                |   8 +-
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql        |   7 +
 40 files changed, 868 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d1c90282f..12d07b11e4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
-        <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
-         <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
-        </simplelist>
-        But minimal WAL does not contain enough information to reconstruct the
-        data from a base backup and the WAL logs, so <literal>replica</literal> or
-        higher must be used to enable WAL archiving
-        (<xref linkend="guc-archive-mode"/>) and streaming replication.
+        In <literal>minimal</literal> level, no information is logged for
+        tables or indexes for the remainder of a transaction that creates or
+        truncates them.  This can make bulk operations much faster (see
+        <xref linkend="populate-pitr"/>).  But minimal WAL does not contain
+        enough information to reconstruct the data from a base backup and the
+        WAL logs, so <literal>replica</literal> or higher must be used to
+        enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+        streaming replication.
        </para>
        <para>
         In <literal>logical</literal> level, the same information is logged as
@@ -2887,6 +2880,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent table,
+        materialized view, or index, this setting determines how to persist
+        the new data.  If the data is smaller than this setting, write it to
+        the WAL log; otherwise, use an fsync of the data file.  Depending on
+        the properties of your storage, raising or lowering this value might
+        help if such commits are slowing concurrent transactions.  The default
+        is two megabytes (<literal>2MB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 0f61b0995d..12fda690fa 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1606,8 +1606,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1707,42 +1707,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 65801a2a84..25a81e5ec6 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -528,8 +528,8 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock)
     validIndexOid = RelationGetRelid(toastidxs[validIndex]);
 
     /* Close the toast relation and all its indexes */
-    toast_close_indexes(toastidxs, num_indexes, lock);
-    table_close(toastrel, lock);
+    toast_close_indexes(toastidxs, num_indexes, NoLock);
+    table_close(toastrel, NoLock);
 
     return validIndexOid;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..765329bbcd 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,23 +1004,44 @@ gistproperty(Oid index_oid, int attno,
 }
 
 /*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
+ * splits anyway. This function provides a fake sequence of LSNs for that
+ * purpose.
  */
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
     if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so its LSNs
+         * must be distinct numbers smaller than the LSN at the next commit.
+         * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+         * last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr    currlsn = GetXLogInsertRecPtr();
+
+        /* Shouldn't be called for WAL-logging relations */
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need for an actual record if we already have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d3f3a7b803..b60dba052f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int            dummy = 0;
+
+    /*
+     * Records other than SWITCH_WAL must have content. We use an integer 0 to
+     * follow the restriction.
+     */
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char *) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ddb6e85e9..92d9da23f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1f6f6d0ea9..14f939d6b1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2524,7 +2509,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 5869922ff8..ba4dab2ba6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..77f03ad4fe 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.heap = btspool->heap;
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 3377367e12..cfcc8885ea 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
             break;
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
             break;
     }
 
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..eb9aac5fd3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,38 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK
+would unlink, in-tree access methods write no WAL for that change.  Code that
+writes WAL without calling RelationNeedsWAL() must check for this case.  This
+skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
+for the same block, REDO could overwrite the WAL-skipping change.  If a
+WAL-writing change followed a WAL-skipping change for the same block, a
+related problem would arise.  When a WAL record contains no full-page image,
+REDO expects the page to match its contents from just before record insertion.
+A WAL-skipping change may not reach disk at all, violating REDO's expectation
+under full_page_writes=off.  For any access method, CommitTransaction() writes
+and fsyncs affected blocks before recording the commit.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  Under these approaches, the access
+method callbacks must not call functions that react to RelationNeedsWAL().
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +852,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b6d8..118f9d521c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2660,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f667..a618dec776 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
                   BlockNumber startblk, BlockNumber endblk,
                   bool page_std)
 {
+    int            flags;
     BlockNumber blkno;
 
+    flags = REGBUF_FORCE_IMAGE;
+    if (page_std)
+        flags |= REGBUF_STANDARD;
+
     /*
      * Iterate over all the pages in the range. They are collected into
      * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBuffer(rel, blkno);
+            Buffer        buf = ReadBufferExtended(rel, forkNum, blkno,
+                                                 RBM_NORMAL, NULL);
 
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         START_CRIT_SECTION();
         for (i = 0; i < nbufs; i++)
         {
-            XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+            XLogRegisterBuffer(i, bufpack[i], flags);
             MarkBufferDirty(bufpack[i]);
         }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b55c383370..2bbce46041 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,20 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +576,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 0fdff2918f..9f58ef1378 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -439,6 +439,10 @@ heap_create(const char *relname,
                 break;
         }
     }
+    else
+    {
+        rel->rd_createSubid = InvalidSubTransactionId;
+    }
 
     return rel;
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..8253c420ef 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int            wal_skip_threshold = 2048;    /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -61,7 +65,14 @@ typedef struct PendingRelDelete
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    BlockNumber max_truncated;
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB       *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -117,6 +128,35 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * If the relation needs at-commit sync, we also need to track the maximum
+     * unsynced truncated block; see smgrDoPendingSyncs().
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool        found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL        ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize = sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("max truncated block hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->max_truncated = 0;
+    }
+
     return srel;
 }
 
@@ -216,6 +256,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
             prev = pending;
         }
     }
+
+    /* FIXME what to do about pending syncs? */
 }
 
 /*
@@ -275,6 +317,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         }
     }
 
+    RelationPreTruncate(rel);
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -325,6 +369,34 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
 }
 
+/*
+ * RelationPreTruncate
+ *        Perform AM-independent work before a physical truncation.
+ *
+ * If an access method's relation_nontransactional_truncate does not call
+ * RelationTruncate(), it must call this before decreasing the table size.
+ */
+void
+RelationPreTruncate(Relation rel)
+{
+    pendingSync *pending;
+
+    if (!pendingSyncHash)
+        return;
+    RelationOpenSmgr(rel);
+
+    /* Record largest maybe-unsynced block of files under tracking  */
+    pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                          HASH_FIND, NULL);
+    if (pending)
+    {
+        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+        if (pending->max_truncated < nblocks)
+            pending->max_truncated = nblocks;
+    }
+}
+
 /*
  * Copy a fork's data, block by block.
  *
@@ -355,7 +427,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +471,39 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    if (XLogIsNeeded())
+        return false;            /* no permanent relfilenode skips WAL */
+
+    if (!pendingSyncHash ||
+        hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +581,135 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    PendingRelDelete *pending;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
+
+    if (XLogIsNeeded())
+        return;                    /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+
+    if (!pendingSyncHash)
+        return;                    /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        pendingSyncHash = NULL;
+        return;
+    }
+
+    AssertPendingSyncs_RelationCache();
+
+    /* Skip syncing nodes that smgrDoPendingDeletes() will delete. */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (!pending->atCommit)
+            continue;
+
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
+    }
+
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
+    {
+        ForkNumber    fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records.  We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL records for its contents.  Do file sync if
+         * the size is larger than the threshold or truncates may have removed
+         * blocks beyond the current size.
+         */
+        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
+            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        {
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
+            {
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
+            }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
+        }
+        else
+        {
+            /* Emit WAL records for all blocks.  The file is small enough. */
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                int            n = nblocks[fork];
+                Relation    rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    pendingSyncHash = NULL;
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e9d7a7ff79..b836ccf2d6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
                 relfilenode2;
     Oid            swaptemp;
     char        swptmpchr;
+    Relation    rel1;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1173,6 +1174,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         CacheInvalidateRelcacheByTuple(reltup2);
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, NoLock);
+    RelationAssumeNewRelfilenode(rel1);
+    relation_close(rel1, NoLock);
+
     /*
      * Post alter hook for modified relations. The change to r2 is always
      * internal, but r1 depends on the invocation context.
@@ -1489,7 +1499,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 
             /* Get the associated valid index to be renamed */
             toastidx = toast_get_valid_index(newrel->rd_rel->reltoastrelid,
-                                             AccessShareLock);
+                                             NoLock);
 
             /* rename the toast table ... */
             snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u",
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c93a788798..02e3761da8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2713,63 +2713,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 9f387b5f5f..fe9a754782 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
+    myState->bistate = GetBulkInsertState();
 
     /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
-    myState->bistate = GetBulkInsertState();
-
-    /* Not using WAL requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 1ee37c1aeb..ea1d0fc850 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 421bc28727..566399dd05 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4769,19 +4769,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -12462,6 +12457,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba3960481..73c38757fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,7 +66,7 @@
 #define BUF_WRITTEN                0x01
 #define BUF_REUSABLE            0x02
 
-#define DROP_RELS_BSEARCH_THRESHOLD        20
+#define RELS_BSEARCH_THRESHOLD        20
 
 typedef struct PrivateRefCountEntry
 {
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelationsAllBuffers shares the same comparator function with
+ * DropRelFileNodesAllBuffers. Pointer to this struct and RelFileNode must be
+ * compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode rnode;            /* This must be the first member */
+    SMgrRelation srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3043,7 +3056,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
      * an exactly determined value, as it depends on many factors (CPU and RAM
      * speeds, amount of shared buffers etc.).
      */
-    use_bsearch = n > DROP_RELS_BSEARCH_THRESHOLD;
+    use_bsearch = n > RELS_BSEARCH_THRESHOLD;
 
     /* sort the list of rnodes if necessary */
     if (use_bsearch)
@@ -3293,6 +3306,104 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelationsAllBuffers
+ *
+ *        This function flushes out of the buffer pool all the pages of all
+ *        forks of the specified smgr relations.  It's equivalent to calling
+ *        FlushRelationBuffers once per fork per relation.  The relations are
+ *        assumed not to use local buffers.
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0; i < nrels; i++)
+    {
+        Assert(!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to sync. See
+     * DropRelFileNodesAllBuffers for details.
+     */
+    use_bsearch = nrels > RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
@@ -3494,13 +3605,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..8f98f665c5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -587,6 +587,18 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
     return (locallock && locallock->nLocks > 0);
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * GetLockMethodLocalHash -- return the hash of local locks, for modules that
+ *        evaluate assertions based on all locks held.
+ */
+HTAB *
+GetLockMethodLocalHash(void)
+{
+    return LockMethodLocalHash;
+}
+#endif
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *        lock would wake up other processes waiting for it.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 85b7115400..e28c5a49a8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
@@ -877,12 +876,18 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *    mdimmedsync() -- Immediately sync a relation to stable storage.
  *
  * Note that only writes already issued are synced; this routine knows
- * nothing of dirty buffers that may exist inside the buffer manager.
+ * nothing of dirty buffers that may exist inside the buffer manager.  We
+ * sync active and inactive segments; smgrDoPendingSyncs() relies on this.
+ * Consider a relation skipping WAL.  Suppose a checkpoint syncs blocks of
+ * some segment, then mdtruncate() renders that segment inactive.  If we
+ * crash before the next checkpoint syncs the newly-inactive segment, that
+ * segment may survive recovery, reintroducing unwanted data into the table.
  */
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
     int            segno;
+    int            min_inactive_seg;
 
     /*
      * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -890,7 +895,16 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
      */
     mdnblocks(reln, forknum);
 
-    segno = reln->md_num_open_segs[forknum];
+    min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+    /*
+     * Temporarily open inactive segments, then close them after sync.  There
+     * may be some inactive segments left opened after fsync() error, but that
+     * is harmless.  We don't bother to clean them up and take a risk of
+     * further trouble.  The next mdclose() will soon close them.
+     */
+    while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+        segno++;
 
     while (segno > 0)
     {
@@ -901,6 +915,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
                     (errcode_for_file_access(),
                      errmsg("could not fsync file \"%s\": %m",
                             FilePathName(v->mdfd_vfd))));
+
+        /* Close inactive segments immediately */
+        if (segno > min_inactive_seg)
+        {
+            FileClose(v->mdfd_vfd);
+            _fdvec_resize(reln, forknum, segno - 1);
+        }
+
         segno--;
     }
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..72c9696ad1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,41 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are synced out to the store.
+ *
+ *        This is equivalent to FlushRelationBuffers() for each smgr relation,
+ *        then calling smgrimmedsync() for all forks of each relation, but it's
+ *        significantly quicker so should be preferred when possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    FlushRelationsAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index df025a5a30..0ac72572e3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -257,6 +257,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1090,6 +1093,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1814,6 +1818,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2021,6 +2026,7 @@ RelationIdGetRelation(Oid relationId)
     rd = RelationBuildDesc(relationId, true);
     if (RelationIsValid(rd))
         RelationIncrementReferenceCount(rd);
+
     return rd;
 }
 
@@ -2089,7 +2095,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2505,13 +2511,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2591,6 +2597,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2669,12 +2676,12 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
-         * forget the "new" status of the relation, which is a useful
-         * optimization to have.  Ditto for the new-relfilenode status.
+         * forget the "new" status of the relation.  Ditto for the
+         * new-relfilenode status.
          *
          * The rel could have zero refcnt here, so temporarily increment the
          * refcnt to ensure it's safe to rebuild it.  We can assume that the
@@ -2754,11 +2761,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2809,7 +2815,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2921,6 +2927,78 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool        relcache_verdict =
+    relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+    ((relation->rd_createSubid != InvalidSubTransactionId &&
+      RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+     relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL.  It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry.  It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    LOCALLOCK  *locallock;
+    Relation   *rels;
+    int            maxrels;
+    int            nrels;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /* open every relation that this transaction has locked */
+    PushActiveSnapshot(GetTransactionSnapshot());
+    maxrels = 1;
+    rels = palloc(maxrels * sizeof(*rels));
+    nrels = 0;
+    hash_seq_init(&status, GetLockMethodLocalHash());
+    while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+    {
+        Oid            relid;
+        Relation    r;
+
+        if (locallock->nLocks <= 0)
+            continue;
+        if ((LockTagType) locallock->tag.lock.locktag_type !=
+            LOCKTAG_RELATION)
+            continue;
+        relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
+        r = RelationIdGetRelation(relid);
+        if (r == NULL)
+            continue;
+        if (nrels >= maxrels)
+        {
+            maxrels *= 2;
+            rels = repalloc(rels, maxrels * sizeof(*rels));
+        }
+        rels[nrels++] = r;
+    }
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+
+    for (i = 0; i < nrels; i++)
+        RelationClose(rels[i]);
+    PopActiveSnapshot();
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3032,10 +3110,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      *
      * During commit, reset the flag to zero, since we are now out of the
      * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-     * new-ness of a new relation due to a forced cache flush, the entry will
-     * get deleted anyway by shared-cache-inval processing of the aborted
-     * pg_class insertion.)
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
@@ -3063,9 +3138,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     }
 
     /*
-     * Likewise, reset the hint about the relfilenode being new.
+     * Likewise, reset any record of the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3157,7 +3233,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3166,6 +3242,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3255,6 +3339,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3552,14 +3637,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
+    RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5625,6 +5725,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 62285792ec..e8a167d038 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2661,6 +2662,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        2048,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..4bfc628000 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 55fc843d3a..673afee1e1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign new LSN */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 580b4caef7..d9be69c124 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index fb2902bd69..e6d7fa1e65 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f728..6547099e84 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1105,10 +1104,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1328,9 +1323,7 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 
 /*
  * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert with a BulkInsertState specified.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 048003c25e..bd37bf311c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,24 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
+extern void RelationPreTruncate(Relation rel);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..292d440eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -189,6 +192,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6e5b..8c180094f0 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -544,6 +544,9 @@ extern void LockReleaseSession(LOCKMETHODID lockmethodid);
 extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
+#ifdef USE_ASSERT_CHECKING
+extern HTAB *GetLockMethodLocalHash(void);
+#endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
                            LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..79dfe0e373 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 44ed04dd3f..ad72a8b910 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -64,22 +64,40 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
-    /*
+    /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-     * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * survived into or zero if the rel was not created in the current top
+     * transaction.  rd_firstRelfilenodeSubid is the ID of the highest
+     * subtransaction an rd_node change has survived into or zero if rd_node
+     * matches the value it had at the start of the current top transaction.
+     * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+     * would restore rd_node to the value it had at the start of the current
+     * top transaction.  Rolling back any lower subtransaction would not.)
+     * Their accuracy is critical to RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
+     *        BEGIN;
+     *        TRUNCATE t;
+     *        SAVEPOINT save;
+     *        TRUNCATE t;
+     *        ROLLBACK TO save;
+     *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * These fields are read-only outside relcache.c.  Other files trigger
+     * rd_node changes by updating pg_class.reltablespace and/or
+     * pg_class.relfilenode.  They must call RelationAssumeNewRelfilenode() to
+     * update these fields.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -526,9 +544,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d77f5beec6..62239a09e8 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -106,9 +106,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -121,6 +122,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index b492c606ab..3ac009f127 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1982,6 +1982,12 @@ select * from another;
 (3 rows)
 
 drop table another;
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index abe7be3223..0420fa495c 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -1358,6 +1358,13 @@ select * from another;
 
 drop table another;
 
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
+
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
-- 
2.23.0

From 27b7c8fdb222508cdf83cbd01b2c7defed3d48d0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Jan 2020 19:24:04 +0900
Subject: [PATCH 2/2] Fix the defect 1

Pending sync is lost by the followig sequence. Fix it.

  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure
---
 src/backend/utils/cache/relcache.c | 67 ++++++++++++++++++++++++++----
 src/include/utils/rel.h            |  1 +
 2 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 0ac72572e3..551a7d40bd 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1094,6 +1094,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_isremoved = false;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1993,7 +1994,7 @@ RelationIdGetRelation(Oid relationId)
     {
         RelationIncrementReferenceCount(rd);
         /* revalidate cache entry if necessary */
-        if (!rd->rd_isvalid)
+        if (!rd->rd_isvalid && !rd->rd_isremoved)
         {
             /*
              * Indexes only have a limited number of possible schema changes,
@@ -2137,7 +2138,7 @@ RelationReloadIndexInfo(Relation relation)
     /* Should be called only for invalidated indexes */
     Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
             relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
-           !relation->rd_isvalid);
+           !relation->rd_isvalid && !relation->rd_isremoved);
 
     /* Ensure it's closed at smgr level */
     RelationCloseSmgr(relation);
@@ -2443,7 +2444,7 @@ RelationClearRelation(Relation relation, bool rebuild)
     if ((relation->rd_rel->relkind == RELKIND_INDEX ||
          relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
         relation->rd_refcnt > 0 &&
-        relation->rd_indexcxt != NULL)
+        relation->rd_indexcxt != NULL && !relation->rd_isremoved)
     {
         relation->rd_isvalid = false;    /* needs to be revalidated */
         if (IsTransactionState())
@@ -2462,6 +2463,18 @@ RelationClearRelation(Relation relation, bool rebuild)
      */
     if (!rebuild)
     {
+        /*
+         * The relcache entry is still needed to perform at-commit sync if the
+         * subtransaction aborts later.  Mark the relcache as "removed" and
+         * leave it live invalid.
+         */
+        if (relation->rd_createSubid != InvalidSubTransactionId ||
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+        {
+            relation->rd_isremoved = true;
+            return;
+        }
+
         /* Remove it from the hash table */
         RelationCacheDelete(relation);
 
@@ -2546,6 +2559,19 @@ RelationClearRelation(Relation relation, bool rebuild)
             if (HistoricSnapshotActive())
                 return;
 
+            /*
+             * Although this relation is already dropped from catalog, the
+             * relcache entry is still needed to perform at-commit sync if the
+             * subtransaction aborts later.  Mark the relcache as "removed" and
+             * leave it live invalid.
+             */
+            if (relation->rd_createSubid != InvalidSubTransactionId ||
+                relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+            {
+                relation->rd_isremoved = true;
+                return;
+            }
+
             /*
              * This shouldn't happen as dropping a relation is intended to be
              * impossible if still referenced (cf. CheckTableNotInUse()). But
@@ -2991,7 +3017,20 @@ AssertPendingSyncs_RelationCache(void)
 
     hash_seq_init(&status, RelationIdCache);
     while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
-        AssertPendingSyncConsistency(idhentry->reldesc);
+    {
+        Relation r = idhentry->reldesc;
+
+        /* Ignore relcache entries of deleted relations */
+        if (r->rd_isremoved)
+        {
+            Assert(!r->rd_isvalid &&
+                   (r->rd_createSubid != InvalidSubTransactionId ||
+                    r->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+            continue;
+        }
+
+        AssertPendingSyncConsistency(r);
+    }
 
     for (i = 0; i < nrels; i++)
         RelationClose(rels[i]);
@@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
-        if (isCommit)
-            relation->rd_createSubid = InvalidSubTransactionId;
+        relation->rd_createSubid = InvalidSubTransactionId;
+
+        if (isCommit && !relation->rd_isremoved)
+        {} /* Nothing to do */
         else if (RelationHasReferenceCountZero(relation))
         {
             RelationClearRelation(relation, false);
@@ -3131,7 +3172,6 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
              * eventually.  This must be just a WARNING to avoid
              * error-during-error-recovery loops.
              */
-            relation->rd_createSubid = InvalidSubTransactionId;
             elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
                  RelationGetRelationName(relation));
         }
@@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         }
     }
 
+    /*
+     * If this relation registered pending sync then removed, subxact rollback
+     * cancels pending remove. Subxact commit propagates it to the parent.
+     */
+    if (relation->rd_isremoved)
+    {
+        Assert (!relation->rd_isvalid &&
+                (relation->rd_createSubid != InvalidSubTransactionId ||
+                 relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+        if (!isCommit)
+            relation->rd_isremoved = false;
+    }
+
     /*
      * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ad72a8b910..970f20b82a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -98,6 +98,7 @@ typedef struct RelationData
                                                  * rd_node to current value */
     SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
                                                  * rd_node to any value */
+    bool             rd_isremoved;                /* is to be removed?  */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

15 января 2020 г., 11:18:57

Hello. I added a fix for the defect 2.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
> 
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation.  Before v24nm, the
> new index relation skipped WAL unconditionally.  Since v24nm, the new index
> relation never skips WAL.  I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.

The fix for this defect utilizes the mechanism that preserves relcache
entry for dropped relation.  If ATExecAddIndex can obtain such a
relcache entry for the old relation, it should hold the newness flags
and we can copy them to the new relcache entry.  I added one member
named oldRelId to the struct IndexStmt to let the function access the
relcache entry for the old index relation.

I forgot to assing version 31 to the last patch, I reused the number
for this version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 591872bdd7b18566fe2529d20e4073900dec32fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v31 1/3] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml                    |  43 ++--
 doc/src/sgml/perform.sgml                   |  47 +---
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c          |  31 ++-
 src/backend/access/gist/gistxlog.c          |  21 ++
 src/backend/access/heap/heapam.c            |  45 +---
 src/backend/access/heap/heapam_handler.c    |  22 +-
 src/backend/access/heap/rewriteheap.c       |  21 +-
 src/backend/access/nbtree/nbtsort.c         |  41 +---
 src/backend/access/rmgrdesc/gistdesc.c      |   5 +
 src/backend/access/transam/README           |  45 +++-
 src/backend/access/transam/xact.c           |  15 ++
 src/backend/access/transam/xloginsert.c     |  10 +-
 src/backend/access/transam/xlogutils.c      |  18 +-
 src/backend/catalog/heap.c                  |   4 +
 src/backend/catalog/storage.c               | 248 ++++++++++++++++++--
 src/backend/commands/cluster.c              |  12 +-
 src/backend/commands/copy.c                 |  58 +----
 src/backend/commands/createas.c             |  11 +-
 src/backend/commands/matview.c              |  12 +-
 src/backend/commands/tablecmds.c            |  11 +-
 src/backend/storage/buffer/bufmgr.c         | 125 +++++++++-
 src/backend/storage/lmgr/lock.c             |  12 +
 src/backend/storage/smgr/md.c               |  36 ++-
 src/backend/storage/smgr/smgr.c             |  35 +++
 src/backend/utils/cache/relcache.c          | 159 ++++++++++---
 src/backend/utils/misc/guc.c                |  13 +
 src/include/access/gist_private.h           |   2 +
 src/include/access/gistxlog.h               |   1 +
 src/include/access/heapam.h                 |   3 -
 src/include/access/rewriteheap.h            |   2 +-
 src/include/access/tableam.h                |  15 +-
 src/include/catalog/storage.h               |   6 +
 src/include/storage/bufmgr.h                |   4 +
 src/include/storage/lock.h                  |   3 +
 src/include/storage/smgr.h                  |   1 +
 src/include/utils/rel.h                     |  57 +++--
 src/include/utils/relcache.h                |   8 +-
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql        |   7 +
 40 files changed, 868 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d45b6f7cb..0e7a0bc0ee 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
-        <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
-         <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
-        </simplelist>
-        But minimal WAL does not contain enough information to reconstruct the
-        data from a base backup and the WAL logs, so <literal>replica</literal> or
-        higher must be used to enable WAL archiving
-        (<xref linkend="guc-archive-mode"/>) and streaming replication.
+        In <literal>minimal</literal> level, no information is logged for
+        tables or indexes for the remainder of a transaction that creates or
+        truncates them.  This can make bulk operations much faster (see
+        <xref linkend="populate-pitr"/>).  But minimal WAL does not contain
+        enough information to reconstruct the data from a base backup and the
+        WAL logs, so <literal>replica</literal> or higher must be used to
+        enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+        streaming replication.
        </para>
        <para>
         In <literal>logical</literal> level, the same information is logged as
@@ -2887,6 +2880,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent table,
+        materialized view, or index, this setting determines how to persist
+        the new data.  If the data is smaller than this setting, write it to
+        the WAL log; otherwise, use an fsync of the data file.  Depending on
+        the properties of your storage, raising or lowering this value might
+        help if such commits are slowing concurrent transactions.  The default
+        is two megabytes (<literal>2MB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 0f61b0995d..12fda690fa 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1606,8 +1606,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1707,42 +1707,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 65801a2a84..25a81e5ec6 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -528,8 +528,8 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock)
     validIndexOid = RelationGetRelid(toastidxs[validIndex]);
 
     /* Close the toast relation and all its indexes */
-    toast_close_indexes(toastidxs, num_indexes, lock);
-    table_close(toastrel, lock);
+    toast_close_indexes(toastidxs, num_indexes, NoLock);
+    table_close(toastrel, NoLock);
 
     return validIndexOid;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..765329bbcd 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,23 +1004,44 @@ gistproperty(Oid index_oid, int attno,
 }
 
 /*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
+ * splits anyway. This function provides a fake sequence of LSNs for that
+ * purpose.
  */
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
     if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so its LSNs
+         * must be distinct numbers smaller than the LSN at the next commit.
+         * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+         * last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr    currlsn = GetXLogInsertRecPtr();
+
+        /* Shouldn't be called for WAL-logging relations */
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need for an actual record if we already have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d3f3a7b803..b60dba052f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int            dummy = 0;
+
+    /*
+     * Records other than SWITCH_WAL must have content. We use an integer 0 to
+     * follow the restriction.
+     */
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char *) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ddb6e85e9..92d9da23f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1f6f6d0ea9..14f939d6b1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2524,7 +2509,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 5869922ff8..ba4dab2ba6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..77f03ad4fe 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.heap = btspool->heap;
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 3377367e12..cfcc8885ea 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
             break;
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
             break;
     }
 
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..eb9aac5fd3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,38 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK
+would unlink, in-tree access methods write no WAL for that change.  Code that
+writes WAL without calling RelationNeedsWAL() must check for this case.  This
+skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
+for the same block, REDO could overwrite the WAL-skipping change.  If a
+WAL-writing change followed a WAL-skipping change for the same block, a
+related problem would arise.  When a WAL record contains no full-page image,
+REDO expects the page to match its contents from just before record insertion.
+A WAL-skipping change may not reach disk at all, violating REDO's expectation
+under full_page_writes=off.  For any access method, CommitTransaction() writes
+and fsyncs affected blocks before recording the commit.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  Under these approaches, the access
+method callbacks must not call functions that react to RelationNeedsWAL().
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +852,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b6d8..118f9d521c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2660,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f667..a618dec776 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
                   BlockNumber startblk, BlockNumber endblk,
                   bool page_std)
 {
+    int            flags;
     BlockNumber blkno;
 
+    flags = REGBUF_FORCE_IMAGE;
+    if (page_std)
+        flags |= REGBUF_STANDARD;
+
     /*
      * Iterate over all the pages in the range. They are collected into
      * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBuffer(rel, blkno);
+            Buffer        buf = ReadBufferExtended(rel, forkNum, blkno,
+                                                 RBM_NORMAL, NULL);
 
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         START_CRIT_SECTION();
         for (i = 0; i < nbufs; i++)
         {
-            XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+            XLogRegisterBuffer(i, bufpack[i], flags);
             MarkBufferDirty(bufpack[i]);
         }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b55c383370..2bbce46041 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,20 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +576,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 0fdff2918f..9f58ef1378 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -439,6 +439,10 @@ heap_create(const char *relname,
                 break;
         }
     }
+    else
+    {
+        rel->rd_createSubid = InvalidSubTransactionId;
+    }
 
     return rel;
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..8253c420ef 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int            wal_skip_threshold = 2048;    /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -61,7 +65,14 @@ typedef struct PendingRelDelete
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    BlockNumber max_truncated;
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB       *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -117,6 +128,35 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * If the relation needs at-commit sync, we also need to track the maximum
+     * unsynced truncated block; see smgrDoPendingSyncs().
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool        found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL        ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize = sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("max truncated block hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->max_truncated = 0;
+    }
+
     return srel;
 }
 
@@ -216,6 +256,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
             prev = pending;
         }
     }
+
+    /* FIXME what to do about pending syncs? */
 }
 
 /*
@@ -275,6 +317,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         }
     }
 
+    RelationPreTruncate(rel);
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -325,6 +369,34 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
 }
 
+/*
+ * RelationPreTruncate
+ *        Perform AM-independent work before a physical truncation.
+ *
+ * If an access method's relation_nontransactional_truncate does not call
+ * RelationTruncate(), it must call this before decreasing the table size.
+ */
+void
+RelationPreTruncate(Relation rel)
+{
+    pendingSync *pending;
+
+    if (!pendingSyncHash)
+        return;
+    RelationOpenSmgr(rel);
+
+    /* Record largest maybe-unsynced block of files under tracking  */
+    pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                          HASH_FIND, NULL);
+    if (pending)
+    {
+        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+        if (pending->max_truncated < nblocks)
+            pending->max_truncated = nblocks;
+    }
+}
+
 /*
  * Copy a fork's data, block by block.
  *
@@ -355,7 +427,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +471,39 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    if (XLogIsNeeded())
+        return false;            /* no permanent relfilenode skips WAL */
+
+    if (!pendingSyncHash ||
+        hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +581,135 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    PendingRelDelete *pending;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
+
+    if (XLogIsNeeded())
+        return;                    /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+
+    if (!pendingSyncHash)
+        return;                    /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        pendingSyncHash = NULL;
+        return;
+    }
+
+    AssertPendingSyncs_RelationCache();
+
+    /* Skip syncing nodes that smgrDoPendingDeletes() will delete. */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (!pending->atCommit)
+            continue;
+
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
+    }
+
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
+    {
+        ForkNumber    fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records.  We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL records for its contents.  Do file sync if
+         * the size is larger than the threshold or truncates may have removed
+         * blocks beyond the current size.
+         */
+        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
+            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        {
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
+            {
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
+            }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
+        }
+        else
+        {
+            /* Emit WAL records for all blocks.  The file is small enough. */
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                int            n = nblocks[fork];
+                Relation    rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    pendingSyncHash = NULL;
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e9d7a7ff79..b836ccf2d6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
                 relfilenode2;
     Oid            swaptemp;
     char        swptmpchr;
+    Relation    rel1;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1173,6 +1174,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         CacheInvalidateRelcacheByTuple(reltup2);
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, NoLock);
+    RelationAssumeNewRelfilenode(rel1);
+    relation_close(rel1, NoLock);
+
     /*
      * Post alter hook for modified relations. The change to r2 is always
      * internal, but r1 depends on the invocation context.
@@ -1489,7 +1499,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 
             /* Get the associated valid index to be renamed */
             toastidx = toast_get_valid_index(newrel->rd_rel->reltoastrelid,
-                                             AccessShareLock);
+                                             NoLock);
 
             /* rename the toast table ... */
             snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u",
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c93a788798..02e3761da8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2713,63 +2713,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 9f387b5f5f..fe9a754782 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
+    myState->bistate = GetBulkInsertState();
 
     /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
-    myState->bistate = GetBulkInsertState();
-
-    /* Not using WAL requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 1ee37c1aeb..ea1d0fc850 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2ec3fc5014..0edb474118 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4781,19 +4781,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -12621,6 +12616,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba3960481..73c38757fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,7 +66,7 @@
 #define BUF_WRITTEN                0x01
 #define BUF_REUSABLE            0x02
 
-#define DROP_RELS_BSEARCH_THRESHOLD        20
+#define RELS_BSEARCH_THRESHOLD        20
 
 typedef struct PrivateRefCountEntry
 {
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelationsAllBuffers shares the same comparator function with
+ * DropRelFileNodesAllBuffers. Pointer to this struct and RelFileNode must be
+ * compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode rnode;            /* This must be the first member */
+    SMgrRelation srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3043,7 +3056,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
      * an exactly determined value, as it depends on many factors (CPU and RAM
      * speeds, amount of shared buffers etc.).
      */
-    use_bsearch = n > DROP_RELS_BSEARCH_THRESHOLD;
+    use_bsearch = n > RELS_BSEARCH_THRESHOLD;
 
     /* sort the list of rnodes if necessary */
     if (use_bsearch)
@@ -3293,6 +3306,104 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelationsAllBuffers
+ *
+ *        This function flushes out of the buffer pool all the pages of all
+ *        forks of the specified smgr relations.  It's equivalent to calling
+ *        FlushRelationBuffers once per fork per relation.  The relations are
+ *        assumed not to use local buffers.
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0; i < nrels; i++)
+    {
+        Assert(!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to sync. See
+     * DropRelFileNodesAllBuffers for details.
+     */
+    use_bsearch = nrels > RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
@@ -3494,13 +3605,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..8f98f665c5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -587,6 +587,18 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
     return (locallock && locallock->nLocks > 0);
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * GetLockMethodLocalHash -- return the hash of local locks, for modules that
+ *        evaluate assertions based on all locks held.
+ */
+HTAB *
+GetLockMethodLocalHash(void)
+{
+    return LockMethodLocalHash;
+}
+#endif
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *        lock would wake up other processes waiting for it.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 85b7115400..e28c5a49a8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
@@ -877,12 +876,18 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *    mdimmedsync() -- Immediately sync a relation to stable storage.
  *
  * Note that only writes already issued are synced; this routine knows
- * nothing of dirty buffers that may exist inside the buffer manager.
+ * nothing of dirty buffers that may exist inside the buffer manager.  We
+ * sync active and inactive segments; smgrDoPendingSyncs() relies on this.
+ * Consider a relation skipping WAL.  Suppose a checkpoint syncs blocks of
+ * some segment, then mdtruncate() renders that segment inactive.  If we
+ * crash before the next checkpoint syncs the newly-inactive segment, that
+ * segment may survive recovery, reintroducing unwanted data into the table.
  */
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
     int            segno;
+    int            min_inactive_seg;
 
     /*
      * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -890,7 +895,16 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
      */
     mdnblocks(reln, forknum);
 
-    segno = reln->md_num_open_segs[forknum];
+    min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+    /*
+     * Temporarily open inactive segments, then close them after sync.  There
+     * may be some inactive segments left opened after fsync() error, but that
+     * is harmless.  We don't bother to clean them up and take a risk of
+     * further trouble.  The next mdclose() will soon close them.
+     */
+    while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+        segno++;
 
     while (segno > 0)
     {
@@ -901,6 +915,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
                     (errcode_for_file_access(),
                      errmsg("could not fsync file \"%s\": %m",
                             FilePathName(v->mdfd_vfd))));
+
+        /* Close inactive segments immediately */
+        if (segno > min_inactive_seg)
+        {
+            FileClose(v->mdfd_vfd);
+            _fdvec_resize(reln, forknum, segno - 1);
+        }
+
         segno--;
     }
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..72c9696ad1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,41 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are synced out to the store.
+ *
+ *        This is equivalent to FlushRelationBuffers() for each smgr relation,
+ *        then calling smgrimmedsync() for all forks of each relation, but it's
+ *        significantly quicker so should be preferred when possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    FlushRelationsAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index df025a5a30..0ac72572e3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -257,6 +257,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1090,6 +1093,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1814,6 +1818,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2021,6 +2026,7 @@ RelationIdGetRelation(Oid relationId)
     rd = RelationBuildDesc(relationId, true);
     if (RelationIsValid(rd))
         RelationIncrementReferenceCount(rd);
+
     return rd;
 }
 
@@ -2089,7 +2095,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2505,13 +2511,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2591,6 +2597,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2669,12 +2676,12 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
-         * forget the "new" status of the relation, which is a useful
-         * optimization to have.  Ditto for the new-relfilenode status.
+         * forget the "new" status of the relation.  Ditto for the
+         * new-relfilenode status.
          *
          * The rel could have zero refcnt here, so temporarily increment the
          * refcnt to ensure it's safe to rebuild it.  We can assume that the
@@ -2754,11 +2761,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2809,7 +2815,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2921,6 +2927,78 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool        relcache_verdict =
+    relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+    ((relation->rd_createSubid != InvalidSubTransactionId &&
+      RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+     relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL.  It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry.  It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    LOCALLOCK  *locallock;
+    Relation   *rels;
+    int            maxrels;
+    int            nrels;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /* open every relation that this transaction has locked */
+    PushActiveSnapshot(GetTransactionSnapshot());
+    maxrels = 1;
+    rels = palloc(maxrels * sizeof(*rels));
+    nrels = 0;
+    hash_seq_init(&status, GetLockMethodLocalHash());
+    while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+    {
+        Oid            relid;
+        Relation    r;
+
+        if (locallock->nLocks <= 0)
+            continue;
+        if ((LockTagType) locallock->tag.lock.locktag_type !=
+            LOCKTAG_RELATION)
+            continue;
+        relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
+        r = RelationIdGetRelation(relid);
+        if (r == NULL)
+            continue;
+        if (nrels >= maxrels)
+        {
+            maxrels *= 2;
+            rels = repalloc(rels, maxrels * sizeof(*rels));
+        }
+        rels[nrels++] = r;
+    }
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+
+    for (i = 0; i < nrels; i++)
+        RelationClose(rels[i]);
+    PopActiveSnapshot();
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3032,10 +3110,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      *
      * During commit, reset the flag to zero, since we are now out of the
      * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-     * new-ness of a new relation due to a forced cache flush, the entry will
-     * get deleted anyway by shared-cache-inval processing of the aborted
-     * pg_class insertion.)
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
@@ -3063,9 +3138,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     }
 
     /*
-     * Likewise, reset the hint about the relfilenode being new.
+     * Likewise, reset any record of the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3157,7 +3233,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3166,6 +3242,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3255,6 +3339,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3552,14 +3637,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
+    RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5625,6 +5725,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e5f8a1301f..ab1091564b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2670,6 +2671,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        2048,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..4bfc628000 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 55fc843d3a..673afee1e1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign new LSN */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 580b4caef7..d9be69c124 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index fb2902bd69..e6d7fa1e65 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f728..6547099e84 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1105,10 +1104,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1328,9 +1323,7 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 
 /*
  * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert with a BulkInsertState specified.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 048003c25e..bd37bf311c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,24 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
+extern void RelationPreTruncate(Relation rel);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..292d440eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -189,6 +192,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6e5b..8c180094f0 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -544,6 +544,9 @@ extern void LockReleaseSession(LOCKMETHODID lockmethodid);
 extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
+#ifdef USE_ASSERT_CHECKING
+extern HTAB *GetLockMethodLocalHash(void);
+#endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
                            LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..79dfe0e373 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 44ed04dd3f..ad72a8b910 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -64,22 +64,40 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
-    /*
+    /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-     * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * survived into or zero if the rel was not created in the current top
+     * transaction.  rd_firstRelfilenodeSubid is the ID of the highest
+     * subtransaction an rd_node change has survived into or zero if rd_node
+     * matches the value it had at the start of the current top transaction.
+     * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+     * would restore rd_node to the value it had at the start of the current
+     * top transaction.  Rolling back any lower subtransaction would not.)
+     * Their accuracy is critical to RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
+     *        BEGIN;
+     *        TRUNCATE t;
+     *        SAVEPOINT save;
+     *        TRUNCATE t;
+     *        ROLLBACK TO save;
+     *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * These fields are read-only outside relcache.c.  Other files trigger
+     * rd_node changes by updating pg_class.reltablespace and/or
+     * pg_class.relfilenode.  They must call RelationAssumeNewRelfilenode() to
+     * update these fields.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -526,9 +544,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d77f5beec6..62239a09e8 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -106,9 +106,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -121,6 +122,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index b492c606ab..3ac009f127 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1982,6 +1982,12 @@ select * from another;
 (3 rows)
 
 drop table another;
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index abe7be3223..0420fa495c 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -1358,6 +1358,13 @@ select * from another;
 
 drop table another;
 
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
+
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
-- 
2.23.0

From e769d75ed56ff3971d9b5cda63c68cad52ed74a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Jan 2020 19:24:04 +0900
Subject: [PATCH v31 2/3] Fix the defect 1

Pending sync is lost by the followig sequence. Fix it.

  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure
---
 src/backend/utils/cache/relcache.c | 67 ++++++++++++++++++++++++++----
 src/include/utils/rel.h            |  1 +
 2 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 0ac72572e3..551a7d40bd 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1094,6 +1094,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_isremoved = false;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1993,7 +1994,7 @@ RelationIdGetRelation(Oid relationId)
     {
         RelationIncrementReferenceCount(rd);
         /* revalidate cache entry if necessary */
-        if (!rd->rd_isvalid)
+        if (!rd->rd_isvalid && !rd->rd_isremoved)
         {
             /*
              * Indexes only have a limited number of possible schema changes,
@@ -2137,7 +2138,7 @@ RelationReloadIndexInfo(Relation relation)
     /* Should be called only for invalidated indexes */
     Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
             relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
-           !relation->rd_isvalid);
+           !relation->rd_isvalid && !relation->rd_isremoved);
 
     /* Ensure it's closed at smgr level */
     RelationCloseSmgr(relation);
@@ -2443,7 +2444,7 @@ RelationClearRelation(Relation relation, bool rebuild)
     if ((relation->rd_rel->relkind == RELKIND_INDEX ||
          relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
         relation->rd_refcnt > 0 &&
-        relation->rd_indexcxt != NULL)
+        relation->rd_indexcxt != NULL && !relation->rd_isremoved)
     {
         relation->rd_isvalid = false;    /* needs to be revalidated */
         if (IsTransactionState())
@@ -2462,6 +2463,18 @@ RelationClearRelation(Relation relation, bool rebuild)
      */
     if (!rebuild)
     {
+        /*
+         * The relcache entry is still needed to perform at-commit sync if the
+         * subtransaction aborts later.  Mark the relcache as "removed" and
+         * leave it live invalid.
+         */
+        if (relation->rd_createSubid != InvalidSubTransactionId ||
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+        {
+            relation->rd_isremoved = true;
+            return;
+        }
+
         /* Remove it from the hash table */
         RelationCacheDelete(relation);
 
@@ -2546,6 +2559,19 @@ RelationClearRelation(Relation relation, bool rebuild)
             if (HistoricSnapshotActive())
                 return;
 
+            /*
+             * Although this relation is already dropped from catalog, the
+             * relcache entry is still needed to perform at-commit sync if the
+             * subtransaction aborts later.  Mark the relcache as "removed" and
+             * leave it live invalid.
+             */
+            if (relation->rd_createSubid != InvalidSubTransactionId ||
+                relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+            {
+                relation->rd_isremoved = true;
+                return;
+            }
+
             /*
              * This shouldn't happen as dropping a relation is intended to be
              * impossible if still referenced (cf. CheckTableNotInUse()). But
@@ -2991,7 +3017,20 @@ AssertPendingSyncs_RelationCache(void)
 
     hash_seq_init(&status, RelationIdCache);
     while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
-        AssertPendingSyncConsistency(idhentry->reldesc);
+    {
+        Relation r = idhentry->reldesc;
+
+        /* Ignore relcache entries of deleted relations */
+        if (r->rd_isremoved)
+        {
+            Assert(!r->rd_isvalid &&
+                   (r->rd_createSubid != InvalidSubTransactionId ||
+                    r->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+            continue;
+        }
+
+        AssertPendingSyncConsistency(r);
+    }
 
     for (i = 0; i < nrels; i++)
         RelationClose(rels[i]);
@@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
-        if (isCommit)
-            relation->rd_createSubid = InvalidSubTransactionId;
+        relation->rd_createSubid = InvalidSubTransactionId;
+
+        if (isCommit && !relation->rd_isremoved)
+        {} /* Nothing to do */
         else if (RelationHasReferenceCountZero(relation))
         {
             RelationClearRelation(relation, false);
@@ -3131,7 +3172,6 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
              * eventually.  This must be just a WARNING to avoid
              * error-during-error-recovery loops.
              */
-            relation->rd_createSubid = InvalidSubTransactionId;
             elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
                  RelationGetRelationName(relation));
         }
@@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         }
     }
 
+    /*
+     * If this relation registered pending sync then removed, subxact rollback
+     * cancels pending remove. Subxact commit propagates it to the parent.
+     */
+    if (relation->rd_isremoved)
+    {
+        Assert (!relation->rd_isvalid &&
+                (relation->rd_createSubid != InvalidSubTransactionId ||
+                 relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+        if (!isCommit)
+            relation->rd_isremoved = false;
+    }
+
     /*
      * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ad72a8b910..970f20b82a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -98,6 +98,7 @@ typedef struct RelationData
                                                  * rd_node to current value */
     SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
                                                  * rd_node to any value */
+    bool             rd_isremoved;                /* is to be removed?  */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
-- 
2.23.0

From 3c756dc06da6ec07be13895fa136ea68fb831be9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 15 Jan 2020 17:00:39 +0900
Subject: [PATCH v31 3/3] Fix the defect 2

ALTER TABLE ALTER TYPE may reuse old indexes that are created in the
current transaction.  Pass the information to the relcache of the new
index relation so that pending sync correctly works.
---
 src/backend/commands/tablecmds.c | 29 ++++++++++++++++++++++++++---
 src/include/nodes/parsenodes.h   |  1 +
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0edb474118..9c7e2d48e4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7420,12 +7420,32 @@ ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
      * this index will have scheduled the storage for deletion at commit, so
      * cancel that pending deletion.
      */
+    Assert (OidIsValid(stmt->oldNode) == OidIsValid(stmt->oldRelId));
     if (OidIsValid(stmt->oldNode))
     {
-        Relation    irel = index_open(address.objectId, NoLock);
+        Relation    newirel = index_open(address.objectId, NoLock);
+        Relation    oldirel = RelationIdGetRelation(stmt->oldRelId);
 
-        RelationPreserveStorage(irel->rd_node, true);
-        index_close(irel, NoLock);
+        RelationPreserveStorage(newirel->rd_node, true);
+
+        /*
+         * We need to copy newness hints to new relation iff the relcache entry
+         * of the already removed relation is avaiable.
+         */
+        if (oldirel != NULL)
+        {
+            Assert(!oldirel->rd_isvalid && oldirel->rd_isremoved &&
+                   (oldirel->rd_createSubid != InvalidSubTransactionId ||
+                    oldirel->rd_firstRelfilenodeSubid !=
+                    InvalidSubTransactionId));
+
+            newirel->rd_createSubid = oldirel->rd_createSubid;
+            newirel->rd_firstRelfilenodeSubid =
+                oldirel->rd_firstRelfilenodeSubid;
+
+            RelationClose(oldirel);
+        }
+        index_close(newirel, NoLock);
     }
 
     return address;
@@ -11680,7 +11700,10 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
 
         /* If it's a partitioned index, there is no storage to share. */
         if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+        {
             stmt->oldNode = irel->rd_node.relNode;
+            stmt->oldRelId = irel->rd_id;
+        }
         index_close(irel, NoLock);
     }
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 28d837b8fa..c4bdf7ccc9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2784,6 +2784,7 @@ typedef struct IndexStmt
     char       *idxcomment;        /* comment to apply to index, or NULL */
     Oid            indexOid;        /* OID of an existing index, if any */
     Oid            oldNode;        /* relfilenode of existing storage, if any */
+    Oid            oldRelId;        /* relid of the old index, if any */
     bool        unique;            /* is index unique? */
     bool        primary;        /* is index a primary key? */
     bool        isconstraint;    /* is it for a pkey/unique constraint? */
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

16 января 2020 г., 08:20:57

All the known defects are fixed.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
> 
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease?  I haven't tested, but this sequence seems possible:
> 
>   TRUNCATE
>     reduces MAIN_FORKNUM from 100 blocks to 0 blocks
>     reduces FSM_FORKNUM from 3 blocks to 0 blocks
>   COPY
>     raises MAIN_FORKNUM from 0 blocks to 110 blocks
>     does not change FSM_FORKNUM
>   COMMIT
>     should fsync, but wrongly chooses log_newpage_range() approach
> 
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.)  If that's not a problem, would you explain?

That causes page-load failure since FSM can offer a nonexistent heap
block, which failure leads to ERROR of an SQL statement. It's not
critical but surely a problem. I'd like to take the bool option
because insert-truncate sequence is rarely happen. That case is not
our main target of the optimization so it is enough for us to make
sure that the operation doesn't lead to such errors.

The attached is nm30 patch followed by the three fix patches for the
three defects. The new member "RelationData.isremoved" is renamed to
"isdropped" in this version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 3a811c76874ae3c596e138369766ad00888c572c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v32 1/4] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml                    |  43 ++-
 doc/src/sgml/perform.sgml                   |  47 +--
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c          |  31 +-
 src/backend/access/gist/gistxlog.c          |  21 ++
 src/backend/access/heap/heapam.c            |  45 +--
 src/backend/access/heap/heapam_handler.c    |  22 +-
 src/backend/access/heap/rewriteheap.c       |  21 +-
 src/backend/access/nbtree/nbtsort.c         |  41 +--
 src/backend/access/rmgrdesc/gistdesc.c      |   5 +
 src/backend/access/transam/README           |  45 ++-
 src/backend/access/transam/xact.c           |  15 +
 src/backend/access/transam/xloginsert.c     |  10 +-
 src/backend/access/transam/xlogutils.c      |  18 +-
 src/backend/catalog/heap.c                  |   4 +
 src/backend/catalog/storage.c               | 248 ++++++++++++-
 src/backend/commands/cluster.c              |  12 +-
 src/backend/commands/copy.c                 |  58 +--
 src/backend/commands/createas.c             |  11 +-
 src/backend/commands/matview.c              |  12 +-
 src/backend/commands/tablecmds.c            |  11 +-
 src/backend/storage/buffer/bufmgr.c         | 125 ++++++-
 src/backend/storage/lmgr/lock.c             |  12 +
 src/backend/storage/smgr/md.c               |  36 +-
 src/backend/storage/smgr/smgr.c             |  35 ++
 src/backend/utils/cache/relcache.c          | 159 +++++++--
 src/backend/utils/misc/guc.c                |  13 +
 src/include/access/gist_private.h           |   2 +
 src/include/access/gistxlog.h               |   1 +
 src/include/access/heapam.h                 |   3 -
 src/include/access/rewriteheap.h            |   2 +-
 src/include/access/tableam.h                |  15 +-
 src/include/catalog/storage.h               |   6 +
 src/include/storage/bufmgr.h                |   4 +
 src/include/storage/lock.h                  |   3 +
 src/include/storage/smgr.h                  |   1 +
 src/include/utils/rel.h                     |  57 ++-
 src/include/utils/relcache.h                |   8 +-
 src/test/recovery/t/018_wal_optimize.pl     | 374 ++++++++++++++++++++
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql        |   7 +
 41 files changed, 1242 insertions(+), 351 deletions(-)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d45b6f7cb..0e7a0bc0ee 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
-        <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
-         <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
-        </simplelist>
-        But minimal WAL does not contain enough information to reconstruct the
-        data from a base backup and the WAL logs, so <literal>replica</literal> or
-        higher must be used to enable WAL archiving
-        (<xref linkend="guc-archive-mode"/>) and streaming replication.
+        In <literal>minimal</literal> level, no information is logged for
+        tables or indexes for the remainder of a transaction that creates or
+        truncates them.  This can make bulk operations much faster (see
+        <xref linkend="populate-pitr"/>).  But minimal WAL does not contain
+        enough information to reconstruct the data from a base backup and the
+        WAL logs, so <literal>replica</literal> or higher must be used to
+        enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+        streaming replication.
        </para>
        <para>
         In <literal>logical</literal> level, the same information is logged as
@@ -2887,6 +2880,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent table,
+        materialized view, or index, this setting determines how to persist
+        the new data.  If the data is smaller than this setting, write it to
+        the WAL log; otherwise, use an fsync of the data file.  Depending on
+        the properties of your storage, raising or lowering this value might
+        help if such commits are slowing concurrent transactions.  The default
+        is two megabytes (<literal>2MB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 0f61b0995d..12fda690fa 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1606,8 +1606,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1707,42 +1707,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 65801a2a84..25a81e5ec6 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -528,8 +528,8 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock)
     validIndexOid = RelationGetRelid(toastidxs[validIndex]);
 
     /* Close the toast relation and all its indexes */
-    toast_close_indexes(toastidxs, num_indexes, lock);
-    table_close(toastrel, lock);
+    toast_close_indexes(toastidxs, num_indexes, NoLock);
+    table_close(toastrel, NoLock);
 
     return validIndexOid;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..765329bbcd 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,23 +1004,44 @@ gistproperty(Oid index_oid, int attno,
 }
 
 /*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
+ * splits anyway. This function provides a fake sequence of LSNs for that
+ * purpose.
  */
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
     if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so its LSNs
+         * must be distinct numbers smaller than the LSN at the next commit.
+         * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+         * last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr    currlsn = GetXLogInsertRecPtr();
+
+        /* Shouldn't be called for WAL-logging relations */
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need for an actual record if we already have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d3f3a7b803..b60dba052f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int            dummy = 0;
+
+    /*
+     * Records other than SWITCH_WAL must have content. We use an integer 0 to
+     * follow the restriction.
+     */
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char *) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ddb6e85e9..92d9da23f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1f6f6d0ea9..14f939d6b1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2524,7 +2509,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 5869922ff8..ba4dab2ba6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..77f03ad4fe 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.heap = btspool->heap;
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 3377367e12..cfcc8885ea 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
             break;
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
             break;
     }
 
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..eb9aac5fd3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,38 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK
+would unlink, in-tree access methods write no WAL for that change.  Code that
+writes WAL without calling RelationNeedsWAL() must check for this case.  This
+skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
+for the same block, REDO could overwrite the WAL-skipping change.  If a
+WAL-writing change followed a WAL-skipping change for the same block, a
+related problem would arise.  When a WAL record contains no full-page image,
+REDO expects the page to match its contents from just before record insertion.
+A WAL-skipping change may not reach disk at all, violating REDO's expectation
+under full_page_writes=off.  For any access method, CommitTransaction() writes
+and fsyncs affected blocks before recording the commit.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  Under these approaches, the access
+method callbacks must not call functions that react to RelationNeedsWAL().
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +852,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b6d8..118f9d521c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2660,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f667..a618dec776 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
                   BlockNumber startblk, BlockNumber endblk,
                   bool page_std)
 {
+    int            flags;
     BlockNumber blkno;
 
+    flags = REGBUF_FORCE_IMAGE;
+    if (page_std)
+        flags |= REGBUF_STANDARD;
+
     /*
      * Iterate over all the pages in the range. They are collected into
      * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBuffer(rel, blkno);
+            Buffer        buf = ReadBufferExtended(rel, forkNum, blkno,
+                                                 RBM_NORMAL, NULL);
 
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         START_CRIT_SECTION();
         for (i = 0; i < nbufs; i++)
         {
-            XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+            XLogRegisterBuffer(i, bufpack[i], flags);
             MarkBufferDirty(bufpack[i]);
         }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b55c383370..2bbce46041 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,20 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +576,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 0fdff2918f..9f58ef1378 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -439,6 +439,10 @@ heap_create(const char *relname,
                 break;
         }
     }
+    else
+    {
+        rel->rd_createSubid = InvalidSubTransactionId;
+    }
 
     return rel;
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..8253c420ef 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int            wal_skip_threshold = 2048;    /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -61,7 +65,14 @@ typedef struct PendingRelDelete
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    BlockNumber max_truncated;
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB       *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -117,6 +128,35 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * If the relation needs at-commit sync, we also need to track the maximum
+     * unsynced truncated block; see smgrDoPendingSyncs().
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool        found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL        ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize = sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("max truncated block hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->max_truncated = 0;
+    }
+
     return srel;
 }
 
@@ -216,6 +256,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
             prev = pending;
         }
     }
+
+    /* FIXME what to do about pending syncs? */
 }
 
 /*
@@ -275,6 +317,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         }
     }
 
+    RelationPreTruncate(rel);
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -325,6 +369,34 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
 }
 
+/*
+ * RelationPreTruncate
+ *        Perform AM-independent work before a physical truncation.
+ *
+ * If an access method's relation_nontransactional_truncate does not call
+ * RelationTruncate(), it must call this before decreasing the table size.
+ */
+void
+RelationPreTruncate(Relation rel)
+{
+    pendingSync *pending;
+
+    if (!pendingSyncHash)
+        return;
+    RelationOpenSmgr(rel);
+
+    /* Record largest maybe-unsynced block of files under tracking  */
+    pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                          HASH_FIND, NULL);
+    if (pending)
+    {
+        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+        if (pending->max_truncated < nblocks)
+            pending->max_truncated = nblocks;
+    }
+}
+
 /*
  * Copy a fork's data, block by block.
  *
@@ -355,7 +427,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +471,39 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    if (XLogIsNeeded())
+        return false;            /* no permanent relfilenode skips WAL */
+
+    if (!pendingSyncHash ||
+        hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +581,135 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    PendingRelDelete *pending;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
+
+    if (XLogIsNeeded())
+        return;                    /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+
+    if (!pendingSyncHash)
+        return;                    /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        pendingSyncHash = NULL;
+        return;
+    }
+
+    AssertPendingSyncs_RelationCache();
+
+    /* Skip syncing nodes that smgrDoPendingDeletes() will delete. */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (!pending->atCommit)
+            continue;
+
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
+    }
+
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
+    {
+        ForkNumber    fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records.  We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL records for its contents.  Do file sync if
+         * the size is larger than the threshold or truncates may have removed
+         * blocks beyond the current size.
+         */
+        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
+            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        {
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
+            {
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
+            }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
+        }
+        else
+        {
+            /* Emit WAL records for all blocks.  The file is small enough. */
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                int            n = nblocks[fork];
+                Relation    rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    pendingSyncHash = NULL;
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e9d7a7ff79..b836ccf2d6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
                 relfilenode2;
     Oid            swaptemp;
     char        swptmpchr;
+    Relation    rel1;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1173,6 +1174,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         CacheInvalidateRelcacheByTuple(reltup2);
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, NoLock);
+    RelationAssumeNewRelfilenode(rel1);
+    relation_close(rel1, NoLock);
+
     /*
      * Post alter hook for modified relations. The change to r2 is always
      * internal, but r1 depends on the invocation context.
@@ -1489,7 +1499,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 
             /* Get the associated valid index to be renamed */
             toastidx = toast_get_valid_index(newrel->rd_rel->reltoastrelid,
-                                             AccessShareLock);
+                                             NoLock);
 
             /* rename the toast table ... */
             snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u",
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c93a788798..02e3761da8 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2713,63 +2713,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 9f387b5f5f..fe9a754782 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
+    myState->bistate = GetBulkInsertState();
 
     /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
-    myState->bistate = GetBulkInsertState();
-
-    /* Not using WAL requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 1ee37c1aeb..ea1d0fc850 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2ec3fc5014..0edb474118 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4781,19 +4781,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -12621,6 +12616,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba3960481..73c38757fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,7 +66,7 @@
 #define BUF_WRITTEN                0x01
 #define BUF_REUSABLE            0x02
 
-#define DROP_RELS_BSEARCH_THRESHOLD        20
+#define RELS_BSEARCH_THRESHOLD        20
 
 typedef struct PrivateRefCountEntry
 {
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelationsAllBuffers shares the same comparator function with
+ * DropRelFileNodesAllBuffers. Pointer to this struct and RelFileNode must be
+ * compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode rnode;            /* This must be the first member */
+    SMgrRelation srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3043,7 +3056,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
      * an exactly determined value, as it depends on many factors (CPU and RAM
      * speeds, amount of shared buffers etc.).
      */
-    use_bsearch = n > DROP_RELS_BSEARCH_THRESHOLD;
+    use_bsearch = n > RELS_BSEARCH_THRESHOLD;
 
     /* sort the list of rnodes if necessary */
     if (use_bsearch)
@@ -3293,6 +3306,104 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelationsAllBuffers
+ *
+ *        This function flushes out of the buffer pool all the pages of all
+ *        forks of the specified smgr relations.  It's equivalent to calling
+ *        FlushRelationBuffers once per fork per relation.  The relations are
+ *        assumed not to use local buffers.
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0; i < nrels; i++)
+    {
+        Assert(!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to sync. See
+     * DropRelFileNodesAllBuffers for details.
+     */
+    use_bsearch = nrels > RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
@@ -3494,13 +3605,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..8f98f665c5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -587,6 +587,18 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
     return (locallock && locallock->nLocks > 0);
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * GetLockMethodLocalHash -- return the hash of local locks, for modules that
+ *        evaluate assertions based on all locks held.
+ */
+HTAB *
+GetLockMethodLocalHash(void)
+{
+    return LockMethodLocalHash;
+}
+#endif
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *        lock would wake up other processes waiting for it.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 85b7115400..e28c5a49a8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
@@ -877,12 +876,18 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *    mdimmedsync() -- Immediately sync a relation to stable storage.
  *
  * Note that only writes already issued are synced; this routine knows
- * nothing of dirty buffers that may exist inside the buffer manager.
+ * nothing of dirty buffers that may exist inside the buffer manager.  We
+ * sync active and inactive segments; smgrDoPendingSyncs() relies on this.
+ * Consider a relation skipping WAL.  Suppose a checkpoint syncs blocks of
+ * some segment, then mdtruncate() renders that segment inactive.  If we
+ * crash before the next checkpoint syncs the newly-inactive segment, that
+ * segment may survive recovery, reintroducing unwanted data into the table.
  */
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
     int            segno;
+    int            min_inactive_seg;
 
     /*
      * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -890,7 +895,16 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
      */
     mdnblocks(reln, forknum);
 
-    segno = reln->md_num_open_segs[forknum];
+    min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+    /*
+     * Temporarily open inactive segments, then close them after sync.  There
+     * may be some inactive segments left opened after fsync() error, but that
+     * is harmless.  We don't bother to clean them up and take a risk of
+     * further trouble.  The next mdclose() will soon close them.
+     */
+    while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+        segno++;
 
     while (segno > 0)
     {
@@ -901,6 +915,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
                     (errcode_for_file_access(),
                      errmsg("could not fsync file \"%s\": %m",
                             FilePathName(v->mdfd_vfd))));
+
+        /* Close inactive segments immediately */
+        if (segno > min_inactive_seg)
+        {
+            FileClose(v->mdfd_vfd);
+            _fdvec_resize(reln, forknum, segno - 1);
+        }
+
         segno--;
     }
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..72c9696ad1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,41 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are synced out to the store.
+ *
+ *        This is equivalent to FlushRelationBuffers() for each smgr relation,
+ *        then calling smgrimmedsync() for all forks of each relation, but it's
+ *        significantly quicker so should be preferred when possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    FlushRelationsAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index df025a5a30..0ac72572e3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -257,6 +257,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1090,6 +1093,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1814,6 +1818,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2021,6 +2026,7 @@ RelationIdGetRelation(Oid relationId)
     rd = RelationBuildDesc(relationId, true);
     if (RelationIsValid(rd))
         RelationIncrementReferenceCount(rd);
+
     return rd;
 }
 
@@ -2089,7 +2095,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2505,13 +2511,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2591,6 +2597,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2669,12 +2676,12 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
-         * forget the "new" status of the relation, which is a useful
-         * optimization to have.  Ditto for the new-relfilenode status.
+         * forget the "new" status of the relation.  Ditto for the
+         * new-relfilenode status.
          *
          * The rel could have zero refcnt here, so temporarily increment the
          * refcnt to ensure it's safe to rebuild it.  We can assume that the
@@ -2754,11 +2761,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2809,7 +2815,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2921,6 +2927,78 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool        relcache_verdict =
+    relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+    ((relation->rd_createSubid != InvalidSubTransactionId &&
+      RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+     relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL.  It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry.  It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    LOCALLOCK  *locallock;
+    Relation   *rels;
+    int            maxrels;
+    int            nrels;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /* open every relation that this transaction has locked */
+    PushActiveSnapshot(GetTransactionSnapshot());
+    maxrels = 1;
+    rels = palloc(maxrels * sizeof(*rels));
+    nrels = 0;
+    hash_seq_init(&status, GetLockMethodLocalHash());
+    while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+    {
+        Oid            relid;
+        Relation    r;
+
+        if (locallock->nLocks <= 0)
+            continue;
+        if ((LockTagType) locallock->tag.lock.locktag_type !=
+            LOCKTAG_RELATION)
+            continue;
+        relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
+        r = RelationIdGetRelation(relid);
+        if (r == NULL)
+            continue;
+        if (nrels >= maxrels)
+        {
+            maxrels *= 2;
+            rels = repalloc(rels, maxrels * sizeof(*rels));
+        }
+        rels[nrels++] = r;
+    }
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+
+    for (i = 0; i < nrels; i++)
+        RelationClose(rels[i]);
+    PopActiveSnapshot();
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3032,10 +3110,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      *
      * During commit, reset the flag to zero, since we are now out of the
      * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-     * new-ness of a new relation due to a forced cache flush, the entry will
-     * get deleted anyway by shared-cache-inval processing of the aborted
-     * pg_class insertion.)
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
@@ -3063,9 +3138,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     }
 
     /*
-     * Likewise, reset the hint about the relfilenode being new.
+     * Likewise, reset any record of the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3157,7 +3233,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3166,6 +3242,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3255,6 +3339,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3552,14 +3637,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
+    RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5625,6 +5725,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e5f8a1301f..ab1091564b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2670,6 +2671,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        2048,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..4bfc628000 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 55fc843d3a..673afee1e1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign new LSN */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 580b4caef7..d9be69c124 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index fb2902bd69..e6d7fa1e65 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f728..6547099e84 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1105,10 +1104,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1328,9 +1323,7 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 
 /*
  * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert with a BulkInsertState specified.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 048003c25e..bd37bf311c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,24 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
+extern void RelationPreTruncate(Relation rel);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..292d440eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -189,6 +192,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6e5b..8c180094f0 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -544,6 +544,9 @@ extern void LockReleaseSession(LOCKMETHODID lockmethodid);
 extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
+#ifdef USE_ASSERT_CHECKING
+extern HTAB *GetLockMethodLocalHash(void);
+#endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
                            LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..79dfe0e373 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 44ed04dd3f..ad72a8b910 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -64,22 +64,40 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
-    /*
+    /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-     * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * survived into or zero if the rel was not created in the current top
+     * transaction.  rd_firstRelfilenodeSubid is the ID of the highest
+     * subtransaction an rd_node change has survived into or zero if rd_node
+     * matches the value it had at the start of the current top transaction.
+     * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+     * would restore rd_node to the value it had at the start of the current
+     * top transaction.  Rolling back any lower subtransaction would not.)
+     * Their accuracy is critical to RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
+     *        BEGIN;
+     *        TRUNCATE t;
+     *        SAVEPOINT save;
+     *        TRUNCATE t;
+     *        ROLLBACK TO save;
+     *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * These fields are read-only outside relcache.c.  Other files trigger
+     * rd_node changes by updating pg_class.reltablespace and/or
+     * pg_class.relfilenode.  They must call RelationAssumeNewRelfilenode() to
+     * update these fields.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -526,9 +544,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d77f5beec6..62239a09e8 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -106,9 +106,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -121,6 +122,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..78d81e12d0
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,374 @@
+# Test WAL replay when some operation has skipped WAL.
+#
+# These tests exercise code that once violated the mandate described in
+# src/backend/access/transam/README section "Skipping WAL for New
+# RelFileNode".  The tests work by committing some transactions, initiating an
+# immediate shutdown, and confirming that the expected data survives recovery.
+# For many years, individual commands made the decision to skip WAL, hence the
+# frequent appearance of COPY in these tests.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 34;
+
+sub check_orphan_relfilenodes
+{
+    my ($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+        "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix               = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql(
+        'postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 AND relpersistence <> 't' AND
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply(
+        [
+            sort(map { "$prefix$_" }
+                  grep(/^[0-9]+$/, slurp_dir($node->data_dir . "/$prefix")))
+        ],
+        [ sort split /\n/, $filepaths_referenced ],
+        $test_name);
+    return;
+}
+
+# We run this same test suite for both wal_level=minimal and replica.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf(
+        'postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
+#wal_debug = on
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+        "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc (id serial PRIMARY KEY);
+        TRUNCATE trunc;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
+    is($result, qq(0), "wal_level = $wal_level, TRUNCATE with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_ins (id serial PRIMARY KEY);
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        TRUNCATE trunc_ins;
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc_ins;");
+    is($result, qq(1), "wal_level = $wal_level, TRUNCATE INSERT");
+
+    # Same for prepared transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE twophase (id serial PRIMARY KEY);
+        INSERT INTO twophase VALUES (DEFAULT);
+        TRUNCATE twophase;
+        INSERT INTO twophase VALUES (DEFAULT);
+        PREPARE TRANSACTION 't';
+        COMMIT PREPARED 't';");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM twophase;");
+    is($result, qq(1), "wal_level = $wal_level, TRUNCATE INSERT PREPARE");
+
+    # Same with writing WAL at end of xact, instead of syncing.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        SET wal_skip_threshold = '1TB';
+        BEGIN;
+        CREATE TABLE noskip (id serial PRIMARY KEY);
+        INSERT INTO noskip VALUES (DEFAULT);
+        TRUNCATE noskip;
+        INSERT INTO noskip VALUES (DEFAULT);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM noskip;");
+    is($result, qq(1),
+        "wal_level = $wal_level, TRUNCATE with end-of-xact WAL");
+
+    # Data file for COPY query in subsequent tests
+    my $basedir   = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file(
+        $copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using both INSERT and COPY.  Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trunc (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_trunc VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE ins_trunc;
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COPY ins_trunc FROM '$copy_file' DELIMITER ',';
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trunc;");
+    is($result, qq(5), "wal_level = $wal_level, TRUNCATE COPY INSERT");
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after
+    # the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO trunc_copy VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE trunc_copy;
+        COPY trunc_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_copy;");
+    is($result, qq(3), "wal_level = $wal_level, TRUNCATE COPY");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_abort (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_abort VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_abort;
+        SAVEPOINT s;
+          ALTER TABLE spc_abort SET TABLESPACE other; ROLLBACK TO s;
+        COPY spc_abort FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_abort;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE abort subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_commit;
+        SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
+        COPY spc_commit FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM spc_commit;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE commit subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_nest (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_nest VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_nest;
+        SAVEPOINT s;
+            ALTER TABLE spc_nest SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY spc_nest FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_nest;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE nested subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        CREATE TABLE spc_hint (id int);
+        INSERT INTO spc_hint VALUES (1);
+        BEGIN;
+        ALTER TABLE spc_hint SET TABLESPACE other;
+        CHECKPOINT;
+        SELECT * FROM spc_hint;  -- set hint bit
+        INSERT INTO spc_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_hint;");
+    is($result, qq(2), "wal_level = $wal_level, SET TABLESPACE, hint bit");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE idx_hint (c int PRIMARY KEY);
+        SAVEPOINT q; INSERT INTO idx_hint VALUES (1); ROLLBACK TO q;
+        CHECKPOINT;
+        INSERT INTO idx_hint VALUES (1);  -- set index hint bit
+        INSERT INTO idx_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->psql('postgres',);
+    my ($ret, $stdout, $stderr) =
+      $node->psql('postgres', "INSERT INTO idx_hint VALUES (2);");
+    is($ret, qq(3), "wal_level = $wal_level, unique index LP_DEAD");
+    like(
+        $stderr,
+        qr/violates unique/,
+        "wal_level = $wal_level, unique index LP_DEAD message");
+
+    # UPDATE touches two buffers for one row.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE upd (id serial PRIMARY KEY, id2 int);
+        INSERT INTO upd (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY upd FROM '$copy_file' DELIMITER ',';
+        UPDATE upd SET id2 = id2 + 1;
+        DELETE FROM upd;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM upd;");
+    is($result, qq(0),
+        "wal_level = $wal_level, UPDATE touches two buffers for one row");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_copy VALUES (DEFAULT, 1);
+        COPY ins_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_copy;");
+    is($result, qq(4), "wal_level = $wal_level, INSERT COPY");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION ins_trig_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION ins_trig_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER ins_trig_before_row_insert
+          BEFORE INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_before_row_trig();
+        CREATE TRIGGER ins_trig_after_row_insert
+          AFTER INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_after_row_trig();
+        COPY ins_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trig;");
+    is($result, qq(9), "wal_level = $wal_level, COPY with INSERT triggers");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION trunc_trig_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION trunc_trig_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER trunc_trig_before_stat_truncate
+          BEFORE TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_before_stat_trig();
+        CREATE TRIGGER trunc_trig_after_stat_truncate
+          AFTER TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_after_stat_trig();
+        INSERT INTO trunc_trig VALUES (DEFAULT, 1);
+        TRUNCATE trunc_trig;
+        COPY trunc_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_trig;");
+    is($result, qq(4),
+        "wal_level = $wal_level, TRUNCATE COPY with TRUNCATE triggers");
+
+    # Test redo of temp table creation.
+    $node->safe_psql(
+        'postgres', "
+        CREATE TEMP TABLE temp (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+    check_orphan_relfilenodes($node,
+        "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index b492c606ab..3ac009f127 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1982,6 +1982,12 @@ select * from another;
 (3 rows)
 
 drop table another;
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index abe7be3223..0420fa495c 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -1358,6 +1358,13 @@ select * from another;
 
 drop table another;
 
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
+
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
-- 
2.23.0

From 871cc92c70f1ec3a38abcaf943ea7bc2fc56dcff Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Jan 2020 19:24:04 +0900
Subject: [PATCH v32 2/4] Fix the defect 1

Pending sync is lost by the followig sequence.

  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure

Relcache entry for a dropped relation is deleted right out. On the
other hand we need the newness information holded in the dropped entry
if the subtransaction is rolled back later. So this patch makes
relcache entry preserved after dropping of a relation that any newness
flag is active, so that it is available later in the current
transaction.
---
 src/backend/utils/cache/relcache.c | 67 ++++++++++++++++++++++++++----
 src/include/utils/rel.h            |  2 +
 2 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 0ac72572e3..7632165ca9 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1094,6 +1094,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_isdropped = false;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1993,7 +1994,7 @@ RelationIdGetRelation(Oid relationId)
     {
         RelationIncrementReferenceCount(rd);
         /* revalidate cache entry if necessary */
-        if (!rd->rd_isvalid)
+        if (!rd->rd_isvalid && !rd->rd_isdropped)
         {
             /*
              * Indexes only have a limited number of possible schema changes,
@@ -2137,7 +2138,7 @@ RelationReloadIndexInfo(Relation relation)
     /* Should be called only for invalidated indexes */
     Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
             relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
-           !relation->rd_isvalid);
+           !relation->rd_isvalid && !relation->rd_isdropped);
 
     /* Ensure it's closed at smgr level */
     RelationCloseSmgr(relation);
@@ -2443,7 +2444,7 @@ RelationClearRelation(Relation relation, bool rebuild)
     if ((relation->rd_rel->relkind == RELKIND_INDEX ||
          relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
         relation->rd_refcnt > 0 &&
-        relation->rd_indexcxt != NULL)
+        relation->rd_indexcxt != NULL && !relation->rd_isdropped)
     {
         relation->rd_isvalid = false;    /* needs to be revalidated */
         if (IsTransactionState())
@@ -2462,6 +2463,18 @@ RelationClearRelation(Relation relation, bool rebuild)
      */
     if (!rebuild)
     {
+        /*
+         * The relcache entry is still needed to perform at-commit sync if the
+         * subtransaction aborts later.  Mark the relcache as "dropped" and
+         * leave it live invalid.
+         */
+        if (relation->rd_createSubid != InvalidSubTransactionId ||
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+        {
+            relation->rd_isdropped = true;
+            return;
+        }
+
         /* Remove it from the hash table */
         RelationCacheDelete(relation);
 
@@ -2546,6 +2559,19 @@ RelationClearRelation(Relation relation, bool rebuild)
             if (HistoricSnapshotActive())
                 return;
 
+            /*
+             * Although this relation is already dropped from catalog, the
+             * relcache entry is still needed to perform at-commit sync if the
+             * subtransaction aborts later.  Mark the relcache as "dropped" and
+             * leave it live invalid.
+             */
+            if (relation->rd_createSubid != InvalidSubTransactionId ||
+                relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+            {
+                relation->rd_isdropped = true;
+                return;
+            }
+
             /*
              * This shouldn't happen as dropping a relation is intended to be
              * impossible if still referenced (cf. CheckTableNotInUse()). But
@@ -2991,7 +3017,20 @@ AssertPendingSyncs_RelationCache(void)
 
     hash_seq_init(&status, RelationIdCache);
     while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
-        AssertPendingSyncConsistency(idhentry->reldesc);
+    {
+        Relation r = idhentry->reldesc;
+
+        /* Ignore relcache entries of deleted relations */
+        if (r->rd_isdropped)
+        {
+            Assert(!r->rd_isvalid &&
+                   (r->rd_createSubid != InvalidSubTransactionId ||
+                    r->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+            continue;
+        }
+
+        AssertPendingSyncConsistency(r);
+    }
 
     for (i = 0; i < nrels; i++)
         RelationClose(rels[i]);
@@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
-        if (isCommit)
-            relation->rd_createSubid = InvalidSubTransactionId;
+        relation->rd_createSubid = InvalidSubTransactionId;
+
+        if (isCommit && !relation->rd_isdropped)
+        {} /* Nothing to do */
         else if (RelationHasReferenceCountZero(relation))
         {
             RelationClearRelation(relation, false);
@@ -3131,7 +3172,6 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
              * eventually.  This must be just a WARNING to avoid
              * error-during-error-recovery loops.
              */
-            relation->rd_createSubid = InvalidSubTransactionId;
             elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
                  RelationGetRelationName(relation));
         }
@@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         }
     }
 
+    /*
+     * If this relation registered pending sync then dropped, subxact rollback
+     * cancels the uncommitted drop, and commit propagates it to the parent.
+     */
+    if (relation->rd_isdropped)
+    {
+        Assert (!relation->rd_isvalid &&
+                (relation->rd_createSubid != InvalidSubTransactionId ||
+                 relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+        if (!isCommit)
+            relation->rd_isdropped = false;
+    }
+
     /*
      * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ad72a8b910..da0c197dcf 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -98,6 +98,8 @@ typedef struct RelationData
                                                  * rd_node to current value */
     SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
                                                  * rd_node to any value */
+    bool             rd_isdropped;                /* has the coressponding catalog
+                                                 * entry been dropped? */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
-- 
2.23.0

From d86a4004ad3121071b8748ffdb423e8f65290756 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 15 Jan 2020 17:00:39 +0900
Subject: [PATCH v32 3/4] Fix the defect 2

Pass newness flags to new index relation inherits the old relfilenode
whie ALTER TABLE ALTER TYPE.

The command may reuse old indexes that are created in the current
transaction.  Pass the information to the relcache of the new index
relation so that pending sync correctly works. This relies on the
relcache preserving feature introduced by the previos fix.
---
 src/backend/commands/tablecmds.c | 29 ++++++++++++++++++++++++++---
 src/include/nodes/parsenodes.h   |  1 +
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0edb474118..59ff5979ad 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7420,12 +7420,32 @@ ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
      * this index will have scheduled the storage for deletion at commit, so
      * cancel that pending deletion.
      */
+    Assert (OidIsValid(stmt->oldNode) == OidIsValid(stmt->oldRelId));
     if (OidIsValid(stmt->oldNode))
     {
-        Relation    irel = index_open(address.objectId, NoLock);
+        Relation    newirel = index_open(address.objectId, NoLock);
+        Relation    oldirel = RelationIdGetRelation(stmt->oldRelId);
 
-        RelationPreserveStorage(irel->rd_node, true);
-        index_close(irel, NoLock);
+        RelationPreserveStorage(newirel->rd_node, true);
+
+        /*
+         * We need to copy the newness hints iff the relation cache entry is
+         * available for the already dropped relation.
+         */
+        if (oldirel != NULL)
+        {
+            Assert(!oldirel->rd_isvalid && oldirel->rd_isdropped &&
+                   (oldirel->rd_createSubid != InvalidSubTransactionId ||
+                    oldirel->rd_firstRelfilenodeSubid !=
+                    InvalidSubTransactionId));
+
+            newirel->rd_createSubid = oldirel->rd_createSubid;
+            newirel->rd_firstRelfilenodeSubid =
+                oldirel->rd_firstRelfilenodeSubid;
+
+            RelationClose(oldirel);
+        }
+        index_close(newirel, NoLock);
     }
 
     return address;
@@ -11680,7 +11700,10 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
 
         /* If it's a partitioned index, there is no storage to share. */
         if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+        {
             stmt->oldNode = irel->rd_node.relNode;
+            stmt->oldRelId = irel->rd_id;
+        }
         index_close(irel, NoLock);
     }
 }
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 28d837b8fa..c4bdf7ccc9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2784,6 +2784,7 @@ typedef struct IndexStmt
     char       *idxcomment;        /* comment to apply to index, or NULL */
     Oid            indexOid;        /* OID of an existing index, if any */
     Oid            oldNode;        /* relfilenode of existing storage, if any */
+    Oid            oldRelId;        /* relid of the old index, if any */
     bool        unique;            /* is index unique? */
     bool        primary;        /* is index a primary key? */
     bool        isconstraint;    /* is it for a pkey/unique constraint? */
-- 
2.23.0

From cd9a63fbcdc12c05ac0207ef94347c49462a020c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 16 Jan 2020 13:24:27 +0900
Subject: [PATCH v32 4/4] Fix the defect 3

Force file sync if the file has been truncated.

The previous verision of the patch allowed to choose WAL when main
fork is larger than ever.  But there's a case where FSM fork gets
shorter while main fork is larger than ever.  Force file sync always
when the file has experienced a truncation.
---
 src/backend/catalog/storage.c | 56 +++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 8253c420ef..447fb606e5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -68,7 +68,7 @@ typedef struct PendingRelDelete
 typedef struct pendingSync
 {
     RelFileNode rnode;
-    BlockNumber max_truncated;
+    bool        is_truncated;    /* Has the file experienced truncation? */
 } pendingSync;
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
@@ -154,7 +154,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 
         pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
         Assert(!found);
-        pending->max_truncated = 0;
+        pending->is_truncated = false;
     }
 
     return srel;
@@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
     /* Record largest maybe-unsynced block of files under tracking  */
     pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
                           HASH_FIND, NULL);
-    if (pending)
-    {
-        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
-
-        if (pending->max_truncated < nblocks)
-            pending->max_truncated = nblocks;
-    }
+    pending->is_truncated = true;
 }
 
 /*
@@ -637,31 +631,43 @@ smgrDoPendingSyncs(bool isCommit)
          * Small WAL records have a chance to be emitted along with other
          * backends' WAL records.  We emit WAL records instead of syncing for
          * files that are smaller than a certain threshold, expecting faster
-         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.  We
+         * don't bother counting the pages when the file has experienced a
+         * truncation.
          */
-        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        if (!pendingsync->is_truncated)
         {
-            if (smgrexists(srel, fork))
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
             {
-                BlockNumber n = smgrnblocks(srel, fork);
-
-                /* we shouldn't come here for unlogged relations */
-                Assert(fork != INIT_FORKNUM);
+                if (smgrexists(srel, fork))
+                {
+                    BlockNumber n = smgrnblocks(srel, fork);
 
-                nblocks[fork] = n;
-                total_blocks += n;
+                    /* we shouldn't come here for unlogged relations */
+                    Assert(fork != INIT_FORKNUM);
+                    nblocks[fork] = n;
+                    total_blocks += n;
+                }
+                else
+                    nblocks[fork] = InvalidBlockNumber;
             }
-            else
-                nblocks[fork] = InvalidBlockNumber;
         }
 
         /*
-         * Sync file or emit WAL records for its contents.  Do file sync if
-         * the size is larger than the threshold or truncates may have removed
-         * blocks beyond the current size.
+         * Sync file or emit WAL records for its contents.
+         *
+         * Alghough we emit WAL record if the file is small enough, do file
+         * sync regardless of the size if the file has experienced a
+         * truncation. It is because the file would be followed by trailing
+         * garbage blocks after a crash recovery if, while a past longer file
+         * had been flushed out, we omitted syncing-out of the file and emit
+         * WAL instead.  You might think that we could choose WAL if the
+         * current main fork is longer than ever, but there's a case where main
+         * fork is longer than ever but FSM fork gets shorter. We don't bother
+         * checking that for every fork.
          */
-        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
-            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        if (pendingsync->is_truncated ||
+            total_blocks * BLCKSZ / 1024 >= wal_skip_threshold)
         {
             /* allocate the initial array, or extend it, if needed */
             if (maxrels == 0)
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

19 января 2020 г., 06:51:39

On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> > > 
> > > A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> > > when running "make check" under wal_level=minimal.  I test this way:
> > > 
> > > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
> > > make check TEMP_CONFIG=$PWD/minimal.conf
> > > 
> > > Self-contained demonstration:
> > >   begin;
> > >   create table t (c int);
> > >   savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
> > >   commit;  -- assertion failure
> 
> This is complex than expected. The DROP TABLE unconditionally removed
> relcache entry. To fix that, I tried to use rd_isinvalid but it failed
> because there's a state that a relcache invalid but the corresponding
> catalog entry is alive.
> 
> In the attached patch 0002, I added a boolean in relcache that
> indicates that the relation is already removed in catalog but not
> committed.

This design could work, but some if its properties aren't ideal.  For example,
RelationIdGetRelation() can return a !rd_isvalid relation when the relation
has been dropped.  What others designs did you consider, if any?

On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c
> @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
>       */
>      if (relation->rd_createSubid != InvalidSubTransactionId)
>      {
> -        if (isCommit)
> -            relation->rd_createSubid = InvalidSubTransactionId;
> +        relation->rd_createSubid = InvalidSubTransactionId;
> +
> +        if (isCommit && !relation->rd_isdropped)
> +        {} /* Nothing to do */

What is the purpose of this particular change?  This executes at the end of a
top-level transaction.  We've already done any necessary syncing, and we're
clearing any flags that caused WAL skipping.  I think it's no longer
productive to treat dropped relations differently.

> @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
>          }
>      }
>  
> +    /*
> +     * If this relation registered pending sync then dropped, subxact rollback
> +     * cancels the uncommitted drop, and commit propagates it to the parent.
> +     */
> +    if (relation->rd_isdropped)
> +    {
> +        Assert (!relation->rd_isvalid &&
> +                (relation->rd_createSubid != InvalidSubTransactionId ||
> +                 relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
> +        if (!isCommit)
> +            relation->rd_isdropped = false;

This does the wrong thing when there exists some subtransaction rollback that
does not rollback the DROP:

\pset null 'NULL'
begin;
create extension pg_visibility;
create table droppedtest (c int);
select 'droppedtest'::regclass::oid as oid \gset
savepoint q; drop table droppedtest; release q; -- rd_dropped==true
select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not ideal)
savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong)
savepoint q; select 1; rollback to q;
select pg_relation_size(:oid), pg_relation_filepath(:oid),
  has_table_privilege(:oid, 'SELECT'); -- all nulls, okay
select * from pg_visibility_map(:oid); -- assertion failure
rollback;

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

21 января 2020 г., 12:45:57

Thank you for the comment.

At Sat, 18 Jan 2020 19:51:39 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
...
> > This is complex than expected. The DROP TABLE unconditionally removed
> > relcache entry. To fix that, I tried to use rd_isinvalid but it failed
> > because there's a state that a relcache invalid but the corresponding
> > catalog entry is alive.
> > 
> > In the attached patch 0002, I added a boolean in relcache that
> > indicates that the relation is already removed in catalog but not
> > committed.
> 
> This design could work, but some if its properties aren't ideal.  For example,
> RelationIdGetRelation() can return a !rd_isvalid relation when the relation
> has been dropped.  What others designs did you consider, if any?

I thought that the entries with rd_isdropped is true cannot be fetched
by other transactions because the relid is not seen there. Still the
same session could do that by repeatedly reindexing or invalidation on
the same relation and I think it is safe because the content of the
entry coudln't be changed and the cache content is reusable. That
being said, it is actually makes things unclear.

I came up with two alternatives. One is a variant of
RelationIdGetRelation for the purpose. The new function
RelationIdGetRelationCache is currently used (only) in ATExecAddIndex
so we could restrict it to return only dropped relations.

Another is another "stashed" relcache, but it seems to make things too
complex.

> On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
> > @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
> >       */
> >      if (relation->rd_createSubid != InvalidSubTransactionId)
> >      {
> > -        if (isCommit)
> > -            relation->rd_createSubid = InvalidSubTransactionId;
> > +        relation->rd_createSubid = InvalidSubTransactionId;
> > +
> > +        if (isCommit && !relation->rd_isdropped)
> > +        {} /* Nothing to do */
> 
> What is the purpose of this particular change?  This executes at the end of a
> top-level transaction.  We've already done any necessary syncing, and we're
> clearing any flags that caused WAL skipping.  I think it's no longer
> productive to treat dropped relations differently.

It executes the pending *relcache* drop we should have done in
ATPostAlterTypeCleanup (or in RelationClearRelation) if the newness
flags were false. The entry misses the chance of being removed (then
bloats the relcache) if we don't do that there.  I added a comment
there to exlaining that.

> > @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
> >          }
> >      }
> >  
> > +    /*
> > +     * If this relation registered pending sync then dropped, subxact rollback
> > +     * cancels the uncommitted drop, and commit propagates it to the parent.
> > +     */
> > +    if (relation->rd_isdropped)
> > +    {
> > +        Assert (!relation->rd_isvalid &&
> > +                (relation->rd_createSubid != InvalidSubTransactionId ||
> > +                 relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
> > +        if (!isCommit)
> > +            relation->rd_isdropped = false;
> 
> This does the wrong thing when there exists some subtransaction rollback that
> does not rollback the DROP:

Sorry for my stupid. I actually thought something like that on the
way. After all, I concluded that the dropped flag ought to behave same
way with rd_createSubid.

> \pset null 'NULL'
> begin;
> create extension pg_visibility;
> create table droppedtest (c int);
> select 'droppedtest'::regclass::oid as oid \gset
> savepoint q; drop table droppedtest; release q; -- rd_dropped==true
> select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not ideal)
> savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong)
> savepoint q; select 1; rollback to q;
> select pg_relation_size(:oid), pg_relation_filepath(:oid),
>   has_table_privilege(:oid, 'SELECT'); -- all nulls, okay
> select * from pg_visibility_map(:oid); -- assertion failure
> rollback;

And I teached RelationIdGetRelation not to return dropped
relations. So the (not ideal) cases just fail as before.

Three other fixes not mentined above are made. One is the useless
rd_firstRelfilenodeSubid in the condition to dicide whether to
preserve or not a relcache entry, and the forgotten copying of other
newness flags. Another is the forgotten SWAPFIELD on
rd_dropedSubid. The last is the forgotten change in
out/equal/copyfuncs.

Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 892ca76faec27a19f5bc17cd21af2a4dd827e56b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v33 1/4] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml                    |  43 ++-
 doc/src/sgml/perform.sgml                   |  47 +--
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c          |  31 +-
 src/backend/access/gist/gistxlog.c          |  21 ++
 src/backend/access/heap/heapam.c            |  45 +--
 src/backend/access/heap/heapam_handler.c    |  22 +-
 src/backend/access/heap/rewriteheap.c       |  21 +-
 src/backend/access/nbtree/nbtsort.c         |  41 +--
 src/backend/access/rmgrdesc/gistdesc.c      |   5 +
 src/backend/access/transam/README           |  45 ++-
 src/backend/access/transam/xact.c           |  15 +
 src/backend/access/transam/xloginsert.c     |  10 +-
 src/backend/access/transam/xlogutils.c      |  18 +-
 src/backend/catalog/heap.c                  |   4 +
 src/backend/catalog/storage.c               | 248 ++++++++++++-
 src/backend/commands/cluster.c              |  12 +-
 src/backend/commands/copy.c                 |  58 +--
 src/backend/commands/createas.c             |  11 +-
 src/backend/commands/matview.c              |  12 +-
 src/backend/commands/tablecmds.c            |  11 +-
 src/backend/storage/buffer/bufmgr.c         | 125 ++++++-
 src/backend/storage/lmgr/lock.c             |  12 +
 src/backend/storage/smgr/md.c               |  36 +-
 src/backend/storage/smgr/smgr.c             |  35 ++
 src/backend/utils/cache/relcache.c          | 159 +++++++--
 src/backend/utils/misc/guc.c                |  13 +
 src/include/access/gist_private.h           |   2 +
 src/include/access/gistxlog.h               |   1 +
 src/include/access/heapam.h                 |   3 -
 src/include/access/rewriteheap.h            |   2 +-
 src/include/access/tableam.h                |  15 +-
 src/include/catalog/storage.h               |   6 +
 src/include/storage/bufmgr.h                |   4 +
 src/include/storage/lock.h                  |   3 +
 src/include/storage/smgr.h                  |   1 +
 src/include/utils/rel.h                     |  57 ++-
 src/include/utils/relcache.h                |   8 +-
 src/test/recovery/t/018_wal_optimize.pl     | 374 ++++++++++++++++++++
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql        |   7 +
 41 files changed, 1242 insertions(+), 351 deletions(-)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3ccacd528b..cd5c065de3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
-        <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
-         <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
-        </simplelist>
-        But minimal WAL does not contain enough information to reconstruct the
-        data from a base backup and the WAL logs, so <literal>replica</literal> or
-        higher must be used to enable WAL archiving
-        (<xref linkend="guc-archive-mode"/>) and streaming replication.
+        In <literal>minimal</literal> level, no information is logged for
+        tables or indexes for the remainder of a transaction that creates or
+        truncates them.  This can make bulk operations much faster (see
+        <xref linkend="populate-pitr"/>).  But minimal WAL does not contain
+        enough information to reconstruct the data from a base backup and the
+        WAL logs, so <literal>replica</literal> or higher must be used to
+        enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+        streaming replication.
        </para>
        <para>
         In <literal>logical</literal> level, the same information is logged as
@@ -2887,6 +2880,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent table,
+        materialized view, or index, this setting determines how to persist
+        the new data.  If the data is smaller than this setting, write it to
+        the WAL log; otherwise, use an fsync of the data file.  Depending on
+        the properties of your storage, raising or lowering this value might
+        help if such commits are slowing concurrent transactions.  The default
+        is two megabytes (<literal>2MB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 0f61b0995d..12fda690fa 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1606,8 +1606,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1707,42 +1707,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 65801a2a84..25a81e5ec6 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -528,8 +528,8 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock)
     validIndexOid = RelationGetRelid(toastidxs[validIndex]);
 
     /* Close the toast relation and all its indexes */
-    toast_close_indexes(toastidxs, num_indexes, lock);
-    table_close(toastrel, lock);
+    toast_close_indexes(toastidxs, num_indexes, NoLock);
+    table_close(toastrel, NoLock);
 
     return validIndexOid;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..765329bbcd 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,23 +1004,44 @@ gistproperty(Oid index_oid, int attno,
 }
 
 /*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
+ * splits anyway. This function provides a fake sequence of LSNs for that
+ * purpose.
  */
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
     if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so its LSNs
+         * must be distinct numbers smaller than the LSN at the next commit.
+         * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+         * last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr    currlsn = GetXLogInsertRecPtr();
+
+        /* Shouldn't be called for WAL-logging relations */
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need for an actual record if we already have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d3f3a7b803..b60dba052f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int            dummy = 0;
+
+    /*
+     * Records other than SWITCH_WAL must have content. We use an integer 0 to
+     * follow the restriction.
+     */
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char *) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ddb6e85e9..92d9da23f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1f6f6d0ea9..14f939d6b1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2524,7 +2509,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 5869922ff8..ba4dab2ba6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..77f03ad4fe 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.heap = btspool->heap;
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 3377367e12..cfcc8885ea 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
             break;
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
             break;
     }
 
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..eb9aac5fd3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,38 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK
+would unlink, in-tree access methods write no WAL for that change.  Code that
+writes WAL without calling RelationNeedsWAL() must check for this case.  This
+skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
+for the same block, REDO could overwrite the WAL-skipping change.  If a
+WAL-writing change followed a WAL-skipping change for the same block, a
+related problem would arise.  When a WAL record contains no full-page image,
+REDO expects the page to match its contents from just before record insertion.
+A WAL-skipping change may not reach disk at all, violating REDO's expectation
+under full_page_writes=off.  For any access method, CommitTransaction() writes
+and fsyncs affected blocks before recording the commit.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  Under these approaches, the access
+method callbacks must not call functions that react to RelationNeedsWAL().
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +852,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b6d8..118f9d521c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2660,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f667..a618dec776 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
                   BlockNumber startblk, BlockNumber endblk,
                   bool page_std)
 {
+    int            flags;
     BlockNumber blkno;
 
+    flags = REGBUF_FORCE_IMAGE;
+    if (page_std)
+        flags |= REGBUF_STANDARD;
+
     /*
      * Iterate over all the pages in the range. They are collected into
      * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBuffer(rel, blkno);
+            Buffer        buf = ReadBufferExtended(rel, forkNum, blkno,
+                                                 RBM_NORMAL, NULL);
 
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         START_CRIT_SECTION();
         for (i = 0; i < nbufs; i++)
         {
-            XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+            XLogRegisterBuffer(i, bufpack[i], flags);
             MarkBufferDirty(bufpack[i]);
         }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b55c383370..2bbce46041 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -552,18 +554,20 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -572,9 +576,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 0fdff2918f..9f58ef1378 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -439,6 +439,10 @@ heap_create(const char *relname,
                 break;
         }
     }
+    else
+    {
+        rel->rd_createSubid = InvalidSubTransactionId;
+    }
 
     return rel;
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..8253c420ef 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int            wal_skip_threshold = 2048;    /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -61,7 +65,14 @@ typedef struct PendingRelDelete
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    BlockNumber max_truncated;
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB       *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -117,6 +128,35 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * If the relation needs at-commit sync, we also need to track the maximum
+     * unsynced truncated block; see smgrDoPendingSyncs().
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool        found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL        ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize = sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("max truncated block hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->max_truncated = 0;
+    }
+
     return srel;
 }
 
@@ -216,6 +256,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
             prev = pending;
         }
     }
+
+    /* FIXME what to do about pending syncs? */
 }
 
 /*
@@ -275,6 +317,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         }
     }
 
+    RelationPreTruncate(rel);
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -325,6 +369,34 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
 }
 
+/*
+ * RelationPreTruncate
+ *        Perform AM-independent work before a physical truncation.
+ *
+ * If an access method's relation_nontransactional_truncate does not call
+ * RelationTruncate(), it must call this before decreasing the table size.
+ */
+void
+RelationPreTruncate(Relation rel)
+{
+    pendingSync *pending;
+
+    if (!pendingSyncHash)
+        return;
+    RelationOpenSmgr(rel);
+
+    /* Record largest maybe-unsynced block of files under tracking  */
+    pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                          HASH_FIND, NULL);
+    if (pending)
+    {
+        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+        if (pending->max_truncated < nblocks)
+            pending->max_truncated = nblocks;
+    }
+}
+
 /*
  * Copy a fork's data, block by block.
  *
@@ -355,7 +427,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +471,39 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    if (XLogIsNeeded())
+        return false;            /* no permanent relfilenode skips WAL */
+
+    if (!pendingSyncHash ||
+        hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +581,135 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    PendingRelDelete *pending;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
+
+    if (XLogIsNeeded())
+        return;                    /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+
+    if (!pendingSyncHash)
+        return;                    /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        pendingSyncHash = NULL;
+        return;
+    }
+
+    AssertPendingSyncs_RelationCache();
+
+    /* Skip syncing nodes that smgrDoPendingDeletes() will delete. */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (!pending->atCommit)
+            continue;
+
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
+    }
+
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
+    {
+        ForkNumber    fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records.  We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL records for its contents.  Do file sync if
+         * the size is larger than the threshold or truncates may have removed
+         * blocks beyond the current size.
+         */
+        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
+            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        {
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
+            {
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
+            }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
+        }
+        else
+        {
+            /* Emit WAL records for all blocks.  The file is small enough. */
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                int            n = nblocks[fork];
+                Relation    rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    pendingSyncHash = NULL;
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e9d7a7ff79..b836ccf2d6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
                 relfilenode2;
     Oid            swaptemp;
     char        swptmpchr;
+    Relation    rel1;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1173,6 +1174,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         CacheInvalidateRelcacheByTuple(reltup2);
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, NoLock);
+    RelationAssumeNewRelfilenode(rel1);
+    relation_close(rel1, NoLock);
+
     /*
      * Post alter hook for modified relations. The change to r2 is always
      * internal, but r1 depends on the invocation context.
@@ -1489,7 +1499,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 
             /* Get the associated valid index to be renamed */
             toastidx = toast_get_valid_index(newrel->rd_rel->reltoastrelid,
-                                             AccessShareLock);
+                                             NoLock);
 
             /* rename the toast table ... */
             snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u",
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 40a8ec1abd..f9acde56a5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2712,63 +2712,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 9f387b5f5f..fe9a754782 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
+    myState->bistate = GetBulkInsertState();
 
     /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
-    myState->bistate = GetBulkInsertState();
-
-    /* Not using WAL requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 1ee37c1aeb..ea1d0fc850 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 30b72b6297..1b8e38d3aa 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5007,19 +5007,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -12901,6 +12896,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba3960481..73c38757fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,7 +66,7 @@
 #define BUF_WRITTEN                0x01
 #define BUF_REUSABLE            0x02
 
-#define DROP_RELS_BSEARCH_THRESHOLD        20
+#define RELS_BSEARCH_THRESHOLD        20
 
 typedef struct PrivateRefCountEntry
 {
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelationsAllBuffers shares the same comparator function with
+ * DropRelFileNodesAllBuffers. Pointer to this struct and RelFileNode must be
+ * compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode rnode;            /* This must be the first member */
+    SMgrRelation srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3043,7 +3056,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
      * an exactly determined value, as it depends on many factors (CPU and RAM
      * speeds, amount of shared buffers etc.).
      */
-    use_bsearch = n > DROP_RELS_BSEARCH_THRESHOLD;
+    use_bsearch = n > RELS_BSEARCH_THRESHOLD;
 
     /* sort the list of rnodes if necessary */
     if (use_bsearch)
@@ -3293,6 +3306,104 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelationsAllBuffers
+ *
+ *        This function flushes out of the buffer pool all the pages of all
+ *        forks of the specified smgr relations.  It's equivalent to calling
+ *        FlushRelationBuffers once per fork per relation.  The relations are
+ *        assumed not to use local buffers.
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0; i < nrels; i++)
+    {
+        Assert(!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to sync. See
+     * DropRelFileNodesAllBuffers for details.
+     */
+    use_bsearch = nrels > RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
@@ -3494,13 +3605,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..8f98f665c5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -587,6 +587,18 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
     return (locallock && locallock->nLocks > 0);
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * GetLockMethodLocalHash -- return the hash of local locks, for modules that
+ *        evaluate assertions based on all locks held.
+ */
+HTAB *
+GetLockMethodLocalHash(void)
+{
+    return LockMethodLocalHash;
+}
+#endif
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *        lock would wake up other processes waiting for it.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 85b7115400..e28c5a49a8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
@@ -877,12 +876,18 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *    mdimmedsync() -- Immediately sync a relation to stable storage.
  *
  * Note that only writes already issued are synced; this routine knows
- * nothing of dirty buffers that may exist inside the buffer manager.
+ * nothing of dirty buffers that may exist inside the buffer manager.  We
+ * sync active and inactive segments; smgrDoPendingSyncs() relies on this.
+ * Consider a relation skipping WAL.  Suppose a checkpoint syncs blocks of
+ * some segment, then mdtruncate() renders that segment inactive.  If we
+ * crash before the next checkpoint syncs the newly-inactive segment, that
+ * segment may survive recovery, reintroducing unwanted data into the table.
  */
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
     int            segno;
+    int            min_inactive_seg;
 
     /*
      * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -890,7 +895,16 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
      */
     mdnblocks(reln, forknum);
 
-    segno = reln->md_num_open_segs[forknum];
+    min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+    /*
+     * Temporarily open inactive segments, then close them after sync.  There
+     * may be some inactive segments left opened after fsync() error, but that
+     * is harmless.  We don't bother to clean them up and take a risk of
+     * further trouble.  The next mdclose() will soon close them.
+     */
+    while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+        segno++;
 
     while (segno > 0)
     {
@@ -901,6 +915,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
                     (errcode_for_file_access(),
                      errmsg("could not fsync file \"%s\": %m",
                             FilePathName(v->mdfd_vfd))));
+
+        /* Close inactive segments immediately */
+        if (segno > min_inactive_seg)
+        {
+            FileClose(v->mdfd_vfd);
+            _fdvec_resize(reln, forknum, segno - 1);
+        }
+
         segno--;
     }
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..72c9696ad1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,41 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are synced out to the store.
+ *
+ *        This is equivalent to FlushRelationBuffers() for each smgr relation,
+ *        then calling smgrimmedsync() for all forks of each relation, but it's
+ *        significantly quicker so should be preferred when possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    FlushRelationsAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index df025a5a30..0ac72572e3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -257,6 +257,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1090,6 +1093,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1814,6 +1818,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2021,6 +2026,7 @@ RelationIdGetRelation(Oid relationId)
     rd = RelationBuildDesc(relationId, true);
     if (RelationIsValid(rd))
         RelationIncrementReferenceCount(rd);
+
     return rd;
 }
 
@@ -2089,7 +2095,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2505,13 +2511,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2591,6 +2597,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2669,12 +2676,12 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
-         * forget the "new" status of the relation, which is a useful
-         * optimization to have.  Ditto for the new-relfilenode status.
+         * forget the "new" status of the relation.  Ditto for the
+         * new-relfilenode status.
          *
          * The rel could have zero refcnt here, so temporarily increment the
          * refcnt to ensure it's safe to rebuild it.  We can assume that the
@@ -2754,11 +2761,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2809,7 +2815,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2921,6 +2927,78 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool        relcache_verdict =
+    relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+    ((relation->rd_createSubid != InvalidSubTransactionId &&
+      RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+     relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL.  It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry.  It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    LOCALLOCK  *locallock;
+    Relation   *rels;
+    int            maxrels;
+    int            nrels;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /* open every relation that this transaction has locked */
+    PushActiveSnapshot(GetTransactionSnapshot());
+    maxrels = 1;
+    rels = palloc(maxrels * sizeof(*rels));
+    nrels = 0;
+    hash_seq_init(&status, GetLockMethodLocalHash());
+    while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+    {
+        Oid            relid;
+        Relation    r;
+
+        if (locallock->nLocks <= 0)
+            continue;
+        if ((LockTagType) locallock->tag.lock.locktag_type !=
+            LOCKTAG_RELATION)
+            continue;
+        relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
+        r = RelationIdGetRelation(relid);
+        if (r == NULL)
+            continue;
+        if (nrels >= maxrels)
+        {
+            maxrels *= 2;
+            rels = repalloc(rels, maxrels * sizeof(*rels));
+        }
+        rels[nrels++] = r;
+    }
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+
+    for (i = 0; i < nrels; i++)
+        RelationClose(rels[i]);
+    PopActiveSnapshot();
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3032,10 +3110,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      *
      * During commit, reset the flag to zero, since we are now out of the
      * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-     * new-ness of a new relation due to a forced cache flush, the entry will
-     * get deleted anyway by shared-cache-inval processing of the aborted
-     * pg_class insertion.)
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
@@ -3063,9 +3138,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     }
 
     /*
-     * Likewise, reset the hint about the relfilenode being new.
+     * Likewise, reset any record of the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3157,7 +3233,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3166,6 +3242,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3255,6 +3339,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3552,14 +3637,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
+    RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5625,6 +5725,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e44f71e991..e69b62e22d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2674,6 +2675,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        2048,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..4bfc628000 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 55fc843d3a..673afee1e1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign new LSN */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 00a17f5f71..14f096d037 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -31,7 +31,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -168,8 +167,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index fb2902bd69..e6d7fa1e65 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f728..6547099e84 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1105,10 +1104,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1328,9 +1323,7 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 
 /*
  * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert with a BulkInsertState specified.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 048003c25e..bd37bf311c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,24 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
+extern void RelationPreTruncate(Relation rel);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..292d440eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -189,6 +192,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6e5b..8c180094f0 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -544,6 +544,9 @@ extern void LockReleaseSession(LOCKMETHODID lockmethodid);
 extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
+#ifdef USE_ASSERT_CHECKING
+extern HTAB *GetLockMethodLocalHash(void);
+#endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
                            LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..79dfe0e373 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 44ed04dd3f..ad72a8b910 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -64,22 +64,40 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
-    /*
+    /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-     * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * survived into or zero if the rel was not created in the current top
+     * transaction.  rd_firstRelfilenodeSubid is the ID of the highest
+     * subtransaction an rd_node change has survived into or zero if rd_node
+     * matches the value it had at the start of the current top transaction.
+     * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+     * would restore rd_node to the value it had at the start of the current
+     * top transaction.  Rolling back any lower subtransaction would not.)
+     * Their accuracy is critical to RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
+     *        BEGIN;
+     *        TRUNCATE t;
+     *        SAVEPOINT save;
+     *        TRUNCATE t;
+     *        ROLLBACK TO save;
+     *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * These fields are read-only outside relcache.c.  Other files trigger
+     * rd_node changes by updating pg_class.reltablespace and/or
+     * pg_class.relfilenode.  They must call RelationAssumeNewRelfilenode() to
+     * update these fields.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -526,9 +544,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d77f5beec6..62239a09e8 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -106,9 +106,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -121,6 +122,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..78d81e12d0
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,374 @@
+# Test WAL replay when some operation has skipped WAL.
+#
+# These tests exercise code that once violated the mandate described in
+# src/backend/access/transam/README section "Skipping WAL for New
+# RelFileNode".  The tests work by committing some transactions, initiating an
+# immediate shutdown, and confirming that the expected data survives recovery.
+# For many years, individual commands made the decision to skip WAL, hence the
+# frequent appearance of COPY in these tests.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 34;
+
+sub check_orphan_relfilenodes
+{
+    my ($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+        "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix               = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql(
+        'postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 AND relpersistence <> 't' AND
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply(
+        [
+            sort(map { "$prefix$_" }
+                  grep(/^[0-9]+$/, slurp_dir($node->data_dir . "/$prefix")))
+        ],
+        [ sort split /\n/, $filepaths_referenced ],
+        $test_name);
+    return;
+}
+
+# We run this same test suite for both wal_level=minimal and replica.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf(
+        'postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
+#wal_debug = on
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+        "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc (id serial PRIMARY KEY);
+        TRUNCATE trunc;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
+    is($result, qq(0), "wal_level = $wal_level, TRUNCATE with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_ins (id serial PRIMARY KEY);
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        TRUNCATE trunc_ins;
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc_ins;");
+    is($result, qq(1), "wal_level = $wal_level, TRUNCATE INSERT");
+
+    # Same for prepared transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE twophase (id serial PRIMARY KEY);
+        INSERT INTO twophase VALUES (DEFAULT);
+        TRUNCATE twophase;
+        INSERT INTO twophase VALUES (DEFAULT);
+        PREPARE TRANSACTION 't';
+        COMMIT PREPARED 't';");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM twophase;");
+    is($result, qq(1), "wal_level = $wal_level, TRUNCATE INSERT PREPARE");
+
+    # Same with writing WAL at end of xact, instead of syncing.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        SET wal_skip_threshold = '1TB';
+        BEGIN;
+        CREATE TABLE noskip (id serial PRIMARY KEY);
+        INSERT INTO noskip VALUES (DEFAULT);
+        TRUNCATE noskip;
+        INSERT INTO noskip VALUES (DEFAULT);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM noskip;");
+    is($result, qq(1),
+        "wal_level = $wal_level, TRUNCATE with end-of-xact WAL");
+
+    # Data file for COPY query in subsequent tests
+    my $basedir   = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file(
+        $copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using both INSERT and COPY.  Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trunc (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_trunc VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE ins_trunc;
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COPY ins_trunc FROM '$copy_file' DELIMITER ',';
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trunc;");
+    is($result, qq(5), "wal_level = $wal_level, TRUNCATE COPY INSERT");
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after
+    # the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO trunc_copy VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE trunc_copy;
+        COPY trunc_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_copy;");
+    is($result, qq(3), "wal_level = $wal_level, TRUNCATE COPY");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_abort (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_abort VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_abort;
+        SAVEPOINT s;
+          ALTER TABLE spc_abort SET TABLESPACE other; ROLLBACK TO s;
+        COPY spc_abort FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_abort;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE abort subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_commit;
+        SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
+        COPY spc_commit FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM spc_commit;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE commit subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_nest (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_nest VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_nest;
+        SAVEPOINT s;
+            ALTER TABLE spc_nest SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY spc_nest FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_nest;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE nested subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        CREATE TABLE spc_hint (id int);
+        INSERT INTO spc_hint VALUES (1);
+        BEGIN;
+        ALTER TABLE spc_hint SET TABLESPACE other;
+        CHECKPOINT;
+        SELECT * FROM spc_hint;  -- set hint bit
+        INSERT INTO spc_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_hint;");
+    is($result, qq(2), "wal_level = $wal_level, SET TABLESPACE, hint bit");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE idx_hint (c int PRIMARY KEY);
+        SAVEPOINT q; INSERT INTO idx_hint VALUES (1); ROLLBACK TO q;
+        CHECKPOINT;
+        INSERT INTO idx_hint VALUES (1);  -- set index hint bit
+        INSERT INTO idx_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->psql('postgres',);
+    my ($ret, $stdout, $stderr) =
+      $node->psql('postgres', "INSERT INTO idx_hint VALUES (2);");
+    is($ret, qq(3), "wal_level = $wal_level, unique index LP_DEAD");
+    like(
+        $stderr,
+        qr/violates unique/,
+        "wal_level = $wal_level, unique index LP_DEAD message");
+
+    # UPDATE touches two buffers for one row.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE upd (id serial PRIMARY KEY, id2 int);
+        INSERT INTO upd (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY upd FROM '$copy_file' DELIMITER ',';
+        UPDATE upd SET id2 = id2 + 1;
+        DELETE FROM upd;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM upd;");
+    is($result, qq(0),
+        "wal_level = $wal_level, UPDATE touches two buffers for one row");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_copy VALUES (DEFAULT, 1);
+        COPY ins_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_copy;");
+    is($result, qq(4), "wal_level = $wal_level, INSERT COPY");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION ins_trig_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION ins_trig_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER ins_trig_before_row_insert
+          BEFORE INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_before_row_trig();
+        CREATE TRIGGER ins_trig_after_row_insert
+          AFTER INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_after_row_trig();
+        COPY ins_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trig;");
+    is($result, qq(9), "wal_level = $wal_level, COPY with INSERT triggers");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION trunc_trig_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION trunc_trig_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER trunc_trig_before_stat_truncate
+          BEFORE TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_before_stat_trig();
+        CREATE TRIGGER trunc_trig_after_stat_truncate
+          AFTER TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_after_stat_trig();
+        INSERT INTO trunc_trig VALUES (DEFAULT, 1);
+        TRUNCATE trunc_trig;
+        COPY trunc_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_trig;");
+    is($result, qq(4),
+        "wal_level = $wal_level, TRUNCATE COPY with TRUNCATE triggers");
+
+    # Test redo of temp table creation.
+    $node->safe_psql(
+        'postgres', "
+        CREATE TEMP TABLE temp (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+    check_orphan_relfilenodes($node,
+        "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 4dd3507c99..57767bab6d 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1984,6 +1984,12 @@ select * from another;
 (3 rows)
 
 drop table another;
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index a16e4c9a29..e11399f2cd 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -1360,6 +1360,13 @@ select * from another;
 
 drop table another;
 
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
+
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
-- 
2.23.0

From 11e6a21ddbc847128e204dac6ba0f62734dd141c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Jan 2020 19:24:04 +0900
Subject: [PATCH v33 2/4] Fix the defect 1

Pending sync is lost by the followig sequence.

  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure

Relcache entry for a dropped relation is deleted right out. On the
other hand we need the newness information holded in the dropped entry
if the subtransaction is rolled back later. So this patch makes
relcache entry preserved after dropping of a relation that any newness
flag is active, so that it is available later in the current
transaction.
---
 src/backend/utils/cache/relcache.c | 138 ++++++++++++++++++++++++++---
 src/include/utils/rel.h            |   2 +
 src/include/utils/relcache.h       |   1 +
 3 files changed, 131 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 0ac72572e3..a223cc0f94 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -18,6 +18,7 @@
  *        RelationCacheInitializePhase2    - initialize shared-catalog entries
  *        RelationCacheInitializePhase3    - finish initializing relcache
  *        RelationIdGetRelation            - get a reldesc by relation id
+ *        RelationIdGetRelationCache        - get a relcache entry by relation id
  *        RelationClose                    - close an open relation
  *
  * NOTES
@@ -1094,6 +1095,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_droppedSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1991,6 +1993,13 @@ RelationIdGetRelation(Oid relationId)
 
     if (RelationIsValid(rd))
     {
+        /* return NULL for dropped relatoins */
+        if (rd->rd_droppedSubid != InvalidSubTransactionId)
+        {
+            Assert (!rd->rd_isvalid);
+            return NULL;
+        }
+
         RelationIncrementReferenceCount(rd);
         /* revalidate cache entry if necessary */
         if (!rd->rd_isvalid)
@@ -2030,6 +2039,31 @@ RelationIdGetRelation(Oid relationId)
     return rd;
 }
 
+/*
+ * RelationIdGetRelationCache: returns an entry exists in relcache.
+ *
+ * This function returns NULL not building new one if no existing entries
+ * found, and may return an invalid or dropped-but-not-commited entry if any.
+ *
+ * This function is intended to be used to lookup the relcache entriy for a
+ * relation that have been created then dropped in the current transaction.
+ */
+Relation
+RelationIdGetRelationCache(Oid relationId)
+{
+    Relation    rd;
+
+    /* Make sure we're in an xact, even if this ends up being a cache hit */
+    Assert(IsTransactionState());
+
+    RelationIdCacheLookup(relationId, rd);
+
+    if (RelationIsValid(rd))
+        RelationIncrementReferenceCount(rd);
+
+    return rd;
+}
+
 /* ----------------------------------------------------------------
  *                cache invalidation support routines
  * ----------------------------------------------------------------
@@ -2134,10 +2168,11 @@ RelationReloadIndexInfo(Relation relation)
     HeapTuple    pg_class_tuple;
     Form_pg_class relp;
 
-    /* Should be called only for invalidated indexes */
+    /* Should be called only for invalidated living indexes */
     Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
             relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
-           !relation->rd_isvalid);
+           !relation->rd_isvalid &&
+           relation->rd_droppedSubid == InvalidSubTransactionId);
 
     /* Ensure it's closed at smgr level */
     RelationCloseSmgr(relation);
@@ -2443,7 +2478,8 @@ RelationClearRelation(Relation relation, bool rebuild)
     if ((relation->rd_rel->relkind == RELKIND_INDEX ||
          relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
         relation->rd_refcnt > 0 &&
-        relation->rd_indexcxt != NULL)
+        relation->rd_indexcxt != NULL &&
+        relation->rd_droppedSubid == InvalidSubTransactionId)
     {
         relation->rd_isvalid = false;    /* needs to be revalidated */
         if (IsTransactionState())
@@ -2462,6 +2498,25 @@ RelationClearRelation(Relation relation, bool rebuild)
      */
     if (!rebuild)
     {
+        /*
+         * The relcache entry is still needed to perform at-commit sync if the
+         * subtransaction aborts later.  Mark the relcache as "dropped" and
+         * leave it live invalid.
+         */
+        if (relation->rd_createSubid != InvalidSubTransactionId)
+        {
+            if (relation->rd_droppedSubid == InvalidSubTransactionId)
+                relation->rd_droppedSubid = GetCurrentSubTransactionId();
+            else
+            {
+                /* shouldn't try to update it */
+                Assert (relation->rd_droppedSubid ==
+                        GetCurrentSubTransactionId());
+            }
+
+            return;
+        }
+
         /* Remove it from the hash table */
         RelationCacheDelete(relation);
 
@@ -2546,6 +2601,26 @@ RelationClearRelation(Relation relation, bool rebuild)
             if (HistoricSnapshotActive())
                 return;
 
+            /*
+             * Although this relation is already dropped from catalog, the
+             * relcache entry is still needed to perform at-commit sync if the
+             * subtransaction aborts later.  Mark the relcache as "dropped" and
+             * leave it live invalid.
+             */
+            if (relation->rd_createSubid != InvalidSubTransactionId)
+            {
+                if (relation->rd_droppedSubid == InvalidSubTransactionId)
+                    relation->rd_droppedSubid = GetCurrentSubTransactionId();
+                else
+                {
+                    /* shouldn't try to update it */
+                    Assert (relation->rd_droppedSubid ==
+                            GetCurrentSubTransactionId());
+                }
+
+                return;
+            }
+
             /*
              * This shouldn't happen as dropping a relation is intended to be
              * impossible if still referenced (cf. CheckTableNotInUse()). But
@@ -2598,6 +2673,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
         SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_droppedSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
             LOCKTAG_RELATION)
             continue;
         relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
-        r = RelationIdGetRelation(relid);
-        if (r == NULL)
+        r = RelationIdGetRelationCache(relid);
+        if (!RelationIsValid(r))
             continue;
         if (nrels >= maxrels)
         {
@@ -2991,7 +3067,20 @@ AssertPendingSyncs_RelationCache(void)
 
     hash_seq_init(&status, RelationIdCache);
     while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
-        AssertPendingSyncConsistency(idhentry->reldesc);
+    {
+        Relation r = idhentry->reldesc;
+
+        /* Ignore relcache entries of deleted relations */
+        if (r->rd_droppedSubid != InvalidSubTransactionId)
+        {
+            Assert(!r->rd_isvalid &&
+                   (r->rd_createSubid != InvalidSubTransactionId ||
+                    r->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+            continue;
+        }
+
+        AssertPendingSyncConsistency(r);
+    }
 
     for (i = 0; i < nrels; i++)
         RelationClose(rels[i]);
@@ -3114,8 +3203,23 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
-        if (isCommit)
-            relation->rd_createSubid = InvalidSubTransactionId;
+        /* dropped subid must be the same with carete subid if any */
+        Assert (relation->rd_droppedSubid == InvalidSubTransactionId ||
+                relation->rd_createSubid == relation->rd_droppedSubid);
+
+        /*
+         * cancel rd_createSubid to let RelationClearRelation drop the relcache
+         * entry.
+         */
+        relation->rd_createSubid = InvalidSubTransactionId;
+
+        /*
+         * Besides at rollback, if we have dropped an in-transaction-created
+         * relation, the corresponding relcache entry is preserved for pending
+         * sync. We must remove such entries at commit.
+         */
+        if (isCommit && relation->rd_droppedSubid == InvalidSubTransactionId)
+        {} /* Nothing to do*/
         else if (RelationHasReferenceCountZero(relation))
         {
             RelationClearRelation(relation, false);
@@ -3131,7 +3235,6 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
              * eventually.  This must be just a WARNING to avoid
              * error-during-error-recovery loops.
              */
-            relation->rd_createSubid = InvalidSubTransactionId;
             elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
                  RelationGetRelationName(relation));
         }
@@ -3210,7 +3313,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
      */
     if (relation->rd_createSubid == mySubid)
     {
-        if (isCommit)
+        if (isCommit && relation->rd_droppedSubid != mySubid)
             relation->rd_createSubid = parentSubid;
         else if (RelationHasReferenceCountZero(relation))
         {
@@ -3232,6 +3335,21 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         }
     }
 
+    /*
+     * Is it a relation create then dropped in the current subtransaction?
+     *
+     * If this relation registered pending sync then dropped, subxact rollback
+     * cancels the uncommitted drop, and commit propagates it to the parent.
+     */
+    if (relation->rd_droppedSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_droppedSubid = parentSubid;
+        else
+            relation->rd_droppedSubid = InvalidSubTransactionId;
+    }
+
+
     /*
      * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ad72a8b910..0b87bc3222 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -98,6 +98,8 @@ typedef struct RelationData
                                                  * rd_node to current value */
     SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
                                                  * rd_node to any value */
+    SubTransactionId rd_droppedSubid;    /* in-transaction created rel has been
+                                         * dropped */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 62239a09e8..0362b6f6ff 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -37,6 +37,7 @@ typedef Relation *RelationPtr;
  * Routines to open (lookup) and close a relcache entry
  */
 extern Relation RelationIdGetRelation(Oid relationId);
+extern Relation RelationIdGetRelationCache(Oid relationId);
 extern void RelationClose(Relation relation);
 
 /*
-- 
2.23.0

From f631de03776f79b0d47ce49c34b221310616b448 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 21 Jan 2020 18:12:52 +0900
Subject: [PATCH v33 3/4] Fix the defect 2

Pass newness flags to new index relation inherits the old relfilenode
whie ALTER TABLE ALTER TYPE.

The command may reuse existing index and the reused index may be
created in the current transaction.  Pass the information to the
relcache of the new index relation so that pending sync correctly
works. This relies on the relcache preserving feature introduced by
the previos fix.
---
 src/backend/commands/tablecmds.c | 32 +++++++++++++++++++++++++++++---
 src/backend/nodes/copyfuncs.c    |  1 +
 src/backend/nodes/equalfuncs.c   |  1 +
 src/backend/nodes/outfuncs.c     |  1 +
 src/include/nodes/parsenodes.h   |  1 +
 5 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1b8e38d3aa..c84ba293d4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7676,12 +7676,35 @@ ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
      * this index will have scheduled the storage for deletion at commit, so
      * cancel that pending deletion.
      */
+    Assert (OidIsValid(stmt->oldNode) == OidIsValid(stmt->oldRelId));
     if (OidIsValid(stmt->oldNode))
     {
-        Relation    irel = index_open(address.objectId, NoLock);
+        Relation    newirel = index_open(address.objectId, NoLock);
+        Relation    oldirel = RelationIdGetRelationCache(stmt->oldRelId);
 
-        RelationPreserveStorage(irel->rd_node, true);
-        index_close(irel, NoLock);
+        RelationPreserveStorage(newirel->rd_node, true);
+
+        /*
+         * oidirel is valid iff the old relation was created then dropped in
+         * the current transaction.  We need to copy the newness hints other
+         * than rd_droppedSubid corresponding to the reused relfilenode in that
+         * case.
+         */
+        if (oldirel != NULL)
+        {
+            Assert(!oldirel->rd_isvalid &&
+                   oldirel->rd_createSubid != InvalidSubTransactionId &&
+                   oldirel->rd_droppedSubid != InvalidSubTransactionId);
+
+            newirel->rd_createSubid = oldirel->rd_createSubid;
+            newirel->rd_firstRelfilenodeSubid =
+                oldirel->rd_firstRelfilenodeSubid;
+            newirel->rd_newRelfilenodeSubid =
+                oldirel->rd_newRelfilenodeSubid;
+
+            RelationClose(oldirel);
+        }
+        index_close(newirel, NoLock);
     }
 
     return address;
@@ -11960,7 +11983,10 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
 
         /* If it's a partitioned index, there is no storage to share. */
         if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+        {
             stmt->oldNode = irel->rd_node.relNode;
+            stmt->oldRelId = irel->rd_id;
+        }
         index_close(irel, NoLock);
     }
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 54ad62bb7f..0e621d74d4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3477,6 +3477,7 @@ _copyIndexStmt(const IndexStmt *from)
     COPY_STRING_FIELD(idxcomment);
     COPY_SCALAR_FIELD(indexOid);
     COPY_SCALAR_FIELD(oldNode);
+    COPY_SCALAR_FIELD(oldRelId);
     COPY_SCALAR_FIELD(unique);
     COPY_SCALAR_FIELD(primary);
     COPY_SCALAR_FIELD(isconstraint);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 5b1ba143b1..5740b6890b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1345,6 +1345,7 @@ _equalIndexStmt(const IndexStmt *a, const IndexStmt *b)
     COMPARE_STRING_FIELD(idxcomment);
     COMPARE_SCALAR_FIELD(indexOid);
     COMPARE_SCALAR_FIELD(oldNode);
+    COMPARE_SCALAR_FIELD(oldRelId);
     COMPARE_SCALAR_FIELD(unique);
     COMPARE_SCALAR_FIELD(primary);
     COMPARE_SCALAR_FIELD(isconstraint);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..bcbdb29ccb 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2650,6 +2650,7 @@ _outIndexStmt(StringInfo str, const IndexStmt *node)
     WRITE_STRING_FIELD(idxcomment);
     WRITE_OID_FIELD(indexOid);
     WRITE_OID_FIELD(oldNode);
+    WRITE_OID_FIELD(oldRelId);
     WRITE_BOOL_FIELD(unique);
     WRITE_BOOL_FIELD(primary);
     WRITE_BOOL_FIELD(isconstraint);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index da0706add5..b114d7a772 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2782,6 +2782,7 @@ typedef struct IndexStmt
     char       *idxcomment;        /* comment to apply to index, or NULL */
     Oid            indexOid;        /* OID of an existing index, if any */
     Oid            oldNode;        /* relfilenode of existing storage, if any */
+    Oid            oldRelId;        /* relid of the old index, if any */
     bool        unique;            /* is index unique? */
     bool        primary;        /* is index a primary key? */
     bool        isconstraint;    /* is it for a pkey/unique constraint? */
-- 
2.23.0

From fb3ef683bcce318b65b91ddce4e1740224c08d59 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 16 Jan 2020 13:24:27 +0900
Subject: [PATCH v33 4/4] Fix the defect 3

Force file sync if the file has been truncated.

The previous verision of the patch allowed to choose WAL when main
fork is larger than ever.  But there's a case where FSM fork gets
shorter while main fork is larger than ever.  Force file sync always
when the file has experienced a truncation.
---
 src/backend/catalog/storage.c | 56 +++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 8253c420ef..447fb606e5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -68,7 +68,7 @@ typedef struct PendingRelDelete
 typedef struct pendingSync
 {
     RelFileNode rnode;
-    BlockNumber max_truncated;
+    bool        is_truncated;    /* Has the file experienced truncation? */
 } pendingSync;
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
@@ -154,7 +154,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 
         pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
         Assert(!found);
-        pending->max_truncated = 0;
+        pending->is_truncated = false;
     }
 
     return srel;
@@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
     /* Record largest maybe-unsynced block of files under tracking  */
     pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
                           HASH_FIND, NULL);
-    if (pending)
-    {
-        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
-
-        if (pending->max_truncated < nblocks)
-            pending->max_truncated = nblocks;
-    }
+    pending->is_truncated = true;
 }
 
 /*
@@ -637,31 +631,43 @@ smgrDoPendingSyncs(bool isCommit)
          * Small WAL records have a chance to be emitted along with other
          * backends' WAL records.  We emit WAL records instead of syncing for
          * files that are smaller than a certain threshold, expecting faster
-         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.  We
+         * don't bother counting the pages when the file has experienced a
+         * truncation.
          */
-        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        if (!pendingsync->is_truncated)
         {
-            if (smgrexists(srel, fork))
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
             {
-                BlockNumber n = smgrnblocks(srel, fork);
-
-                /* we shouldn't come here for unlogged relations */
-                Assert(fork != INIT_FORKNUM);
+                if (smgrexists(srel, fork))
+                {
+                    BlockNumber n = smgrnblocks(srel, fork);
 
-                nblocks[fork] = n;
-                total_blocks += n;
+                    /* we shouldn't come here for unlogged relations */
+                    Assert(fork != INIT_FORKNUM);
+                    nblocks[fork] = n;
+                    total_blocks += n;
+                }
+                else
+                    nblocks[fork] = InvalidBlockNumber;
             }
-            else
-                nblocks[fork] = InvalidBlockNumber;
         }
 
         /*
-         * Sync file or emit WAL records for its contents.  Do file sync if
-         * the size is larger than the threshold or truncates may have removed
-         * blocks beyond the current size.
+         * Sync file or emit WAL records for its contents.
+         *
+         * Alghough we emit WAL record if the file is small enough, do file
+         * sync regardless of the size if the file has experienced a
+         * truncation. It is because the file would be followed by trailing
+         * garbage blocks after a crash recovery if, while a past longer file
+         * had been flushed out, we omitted syncing-out of the file and emit
+         * WAL instead.  You might think that we could choose WAL if the
+         * current main fork is longer than ever, but there's a case where main
+         * fork is longer than ever but FSM fork gets shorter. We don't bother
+         * checking that for every fork.
          */
-        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
-            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        if (pendingsync->is_truncated ||
+            total_blocks * BLCKSZ / 1024 >= wal_skip_threshold)
         {
             /* allocate the initial array, or extend it, if needed */
             if (maxrels == 0)
-- 
2.23.0

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

27 января 2020 г., 07:22:01

Diffing the two latest versions of one patch:
> --- v32-0002-Fix-the-defect-1.patch    2020-01-18 14:32:47.499129940 -0800
> +++ v33-0002-Fix-the-defect-1.patch    2020-01-26 16:23:52.846391035 -0800
> +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> +             LOCKTAG_RELATION)
> +             continue;
> +         relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> +-        r = RelationIdGetRelation(relid);
> +-        if (r == NULL)
> ++        r = RelationIdGetRelationCache(relid);

The purpose of this loop is to create relcache entries for rels locked in the
current transaction.  (The "r == NULL" case happens for rels no longer visible
in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
creates a relcache entry, calling it defeats that purpose.
RelationIdGetRelation() is the right function to call.

On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote:
> Three other fixes not mentined above are made. One is the useless
> rd_firstRelfilenodeSubid in the condition to dicide whether to
> preserve or not a relcache entry

It was not useless.  Test case:

  create table t (c int);
  begin;
  alter table t alter c type bigint;  -- sets rd_firstRelfilenodeSubid
  savepoint q; drop table t; rollback to q;  -- forgets rd_firstRelfilenodeSubid
  commit;  -- assertion failure, after s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

27 января 2020 г., 07:44:13

Thanks!

At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch <noah@leadboat.com> wrote in 
> Diffing the two latest versions of one patch:
> > --- v32-0002-Fix-the-defect-1.patch    2020-01-18 14:32:47.499129940 -0800
> > +++ v33-0002-Fix-the-defect-1.patch    2020-01-26 16:23:52.846391035 -0800
> > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> > +             LOCKTAG_RELATION)
> > +             continue;
> > +         relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> > +-        r = RelationIdGetRelation(relid);
> > +-        if (r == NULL)
> > ++        r = RelationIdGetRelationCache(relid);
> 
> The purpose of this loop is to create relcache entries for rels locked in the
> current transaction.  (The "r == NULL" case happens for rels no longer visible
> in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
> creates a relcache entry, calling it defeats that purpose.
> RelationIdGetRelation() is the right function to call.

I thought that the all required entry exist in the cache but actually
it's safer that recreate dropped caches. Does the following works?

    r = RelationIdGetRelation(relid);
+       /* if not found, fetch a "dropped" entry if any  */
+    if (r == NULL)
+        r = RelationIdGetRelationCache(relid);
    if (r == NULL)
        continue;

> On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote:
> > Three other fixes not mentined above are made. One is the useless
> > rd_firstRelfilenodeSubid in the condition to dicide whether to
> > preserve or not a relcache entry
> 
> It was not useless.  Test case:
> 
>   create table t (c int);
>   begin;
>   alter table t alter c type bigint;  -- sets rd_firstRelfilenodeSubid
>   savepoint q; drop table t; rollback to q;  -- forgets rd_firstRelfilenodeSubid
>   commit;  -- assertion failure, after s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above

Mmm? I thought somehow that that relcache entry never be dropped and I
believe I considered that case, of course.  But yes, you're right.

I'll post upated version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

27 января 2020 г., 07:57:00

On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > Diffing the two latest versions of one patch:
> > > --- v32-0002-Fix-the-defect-1.patch    2020-01-18 14:32:47.499129940 -0800
> > > +++ v33-0002-Fix-the-defect-1.patch    2020-01-26 16:23:52.846391035 -0800
> > > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> > > +             LOCKTAG_RELATION)
> > > +             continue;
> > > +         relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> > > +-        r = RelationIdGetRelation(relid);
> > > +-        if (r == NULL)
> > > ++        r = RelationIdGetRelationCache(relid);
> > 
> > The purpose of this loop is to create relcache entries for rels locked in the
> > current transaction.  (The "r == NULL" case happens for rels no longer visible
> > in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
> > creates a relcache entry, calling it defeats that purpose.
> > RelationIdGetRelation() is the right function to call.
> 
> I thought that the all required entry exist in the cache but actually
> it's safer that recreate dropped caches. Does the following works?
> 
>     r = RelationIdGetRelation(relid);
> +       /* if not found, fetch a "dropped" entry if any  */
> +    if (r == NULL)
> +        r = RelationIdGetRelationCache(relid);
>     if (r == NULL)
>         continue;

That does not materially change the function's behavior.  Notice that the
function does one thing with "r", which is to call RelationClose(r).  The
function calls RelationIdGetRelation() for its side effects, not for its
return value.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

27 января 2020 г., 09:08:18

By the way, the previous version looks somewhat different from what I
thought I posted..

At Sun, 26 Jan 2020 20:57:00 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote:
> > > The purpose of this loop is to create relcache entries for rels locked in the
> > > current transaction.  (The "r == NULL" case happens for rels no longer visible
> > > in catalogs.  It is harmless.)  Since RelationIdGetRelationCache() never
> > > creates a relcache entry, calling it defeats that purpose.
> > > RelationIdGetRelation() is the right function to call.
> > 
> > I thought that the all required entry exist in the cache but actually
> > it's safer that recreate dropped caches. Does the following works?
> > 
> >     r = RelationIdGetRelation(relid);
> > +       /* if not found, fetch a "dropped" entry if any  */
> > +    if (r == NULL)
> > +        r = RelationIdGetRelationCache(relid);
> >     if (r == NULL)
> >         continue;
> 
> That does not materially change the function's behavior.  Notice that the
> function does one thing with "r", which is to call RelationClose(r).  The
> function calls RelationIdGetRelation() for its side effects, not for its
> return value.

..Right.  The following loop accesses relcache hash directly and no
need for storing returned r to the array rels..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

27 января 2020 г., 13:28:31

Hello, this is rebased then addressed version.

- Now valid rd_firstRelfilenodeSubid causes drop-pending of relcache
  as well as rd_createSubid.  The oblivion in the last example no
  longer happens.

- Revert the (really) useless change of AssertPendingSyncs_RelationCache.

- Fix several comments. Some of the fixes are just rewording and some
  are related to the first change above.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 91ae812abd0fcfe0172b7bb0ad563d3a7e5dd009 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v34 1/4] Rework WAL-skipping optimization

While wal_level=minimal we omit WAL-logging for certain some
operations on relfilenodes that are created in the current
transaction. The files are fsynced at commit. The machinery
accelerates bulk-insertion operations but it fails in certain sequence
of operations and a crash just after commit may leave broken table
files.

This patch overhauls the machinery so that WAL-loggings on all
operations are omitted for such relfilenodes. This patch also
introduces a new feature that small files are emitted as a WAL record
instead of syncing. The new GUC variable wal_skip_threshold controls
the threshold.
---
 doc/src/sgml/config.sgml                    |  43 ++-
 doc/src/sgml/perform.sgml                   |  47 +--
 src/backend/access/common/toast_internals.c |   4 +-
 src/backend/access/gist/gistutil.c          |  31 +-
 src/backend/access/gist/gistxlog.c          |  21 ++
 src/backend/access/heap/heapam.c            |  45 +--
 src/backend/access/heap/heapam_handler.c    |  22 +-
 src/backend/access/heap/rewriteheap.c       |  21 +-
 src/backend/access/nbtree/nbtsort.c         |  41 +--
 src/backend/access/rmgrdesc/gistdesc.c      |   5 +
 src/backend/access/transam/README           |  45 ++-
 src/backend/access/transam/xact.c           |  15 +
 src/backend/access/transam/xloginsert.c     |  10 +-
 src/backend/access/transam/xlogutils.c      |  18 +-
 src/backend/catalog/heap.c                  |   4 +
 src/backend/catalog/storage.c               | 248 ++++++++++++-
 src/backend/commands/cluster.c              |  12 +-
 src/backend/commands/copy.c                 |  58 +--
 src/backend/commands/createas.c             |  13 +-
 src/backend/commands/matview.c              |  12 +-
 src/backend/commands/tablecmds.c            |  11 +-
 src/backend/storage/buffer/bufmgr.c         | 125 ++++++-
 src/backend/storage/lmgr/lock.c             |  12 +
 src/backend/storage/smgr/md.c               |  36 +-
 src/backend/storage/smgr/smgr.c             |  35 ++
 src/backend/utils/cache/relcache.c          | 161 +++++++--
 src/backend/utils/misc/guc.c                |  13 +
 src/include/access/gist_private.h           |   2 +
 src/include/access/gistxlog.h               |   1 +
 src/include/access/heapam.h                 |   3 -
 src/include/access/rewriteheap.h            |   2 +-
 src/include/access/tableam.h                |  15 +-
 src/include/catalog/storage.h               |   6 +
 src/include/storage/bufmgr.h                |   4 +
 src/include/storage/lock.h                  |   3 +
 src/include/storage/smgr.h                  |   1 +
 src/include/utils/rel.h                     |  55 ++-
 src/include/utils/relcache.h                |   8 +-
 src/test/recovery/t/018_wal_optimize.pl     | 374 ++++++++++++++++++++
 src/test/regress/expected/alter_table.out   |   6 +
 src/test/regress/sql/alter_table.sql        |   7 +
 41 files changed, 1243 insertions(+), 352 deletions(-)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e07dc01e80..ea1c866a15 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2481,21 +2481,14 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
-        <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
-         <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
-        </simplelist>
-        But minimal WAL does not contain enough information to reconstruct the
-        data from a base backup and the WAL logs, so <literal>replica</literal> or
-        higher must be used to enable WAL archiving
-        (<xref linkend="guc-archive-mode"/>) and streaming replication.
+        In <literal>minimal</literal> level, no information is logged for
+        tables or indexes for the remainder of a transaction that creates or
+        truncates them.  This can make bulk operations much faster (see
+        <xref linkend="populate-pitr"/>).  But minimal WAL does not contain
+        enough information to reconstruct the data from a base backup and the
+        WAL logs, so <literal>replica</literal> or higher must be used to
+        enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
+        streaming replication.
        </para>
        <para>
         In <literal>logical</literal> level, the same information is logged as
@@ -2887,6 +2880,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent table,
+        materialized view, or index, this setting determines how to persist
+        the new data.  If the data is smaller than this setting, write it to
+        the WAL log; otherwise, use an fsync of the data file.  Depending on
+        the properties of your storage, raising or lowering this value might
+        help if such commits are slowing concurrent transactions.  The default
+        is two megabytes (<literal>2MB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 0f61b0995d..12fda690fa 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1606,8 +1606,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1707,42 +1707,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 65801a2a84..25a81e5ec6 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -528,8 +528,8 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock)
     validIndexOid = RelationGetRelid(toastidxs[validIndex]);
 
     /* Close the toast relation and all its indexes */
-    toast_close_indexes(toastidxs, num_indexes, lock);
-    table_close(toastrel, lock);
+    toast_close_indexes(toastidxs, num_indexes, NoLock);
+    table_close(toastrel, NoLock);
 
     return validIndexOid;
 }
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..765329bbcd 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,23 +1004,44 @@ gistproperty(Oid index_oid, int attno,
 }
 
 /*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
+ * splits anyway. This function provides a fake sequence of LSNs for that
+ * purpose.
  */
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
     if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so its LSNs
+         * must be distinct numbers smaller than the LSN at the next commit.
+         * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+         * last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr    currlsn = GetXLogInsertRecPtr();
+
+        /* Shouldn't be called for WAL-logging relations */
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need for an actual record if we already have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d3f3a7b803..b60dba052f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int            dummy = 0;
+
+    /*
+     * Records other than SWITCH_WAL must have content. We use an integer 0 to
+     * follow the restriction.
+     */
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char *) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5ddb6e85e9..92d9da23f7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1f6f6d0ea9..14f939d6b1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2524,7 +2509,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 5869922ff8..ba4dab2ba6 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f163491d60..77f03ad4fe 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.heap = btspool->heap;
     wstate.index = btspool->index;
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 3377367e12..cfcc8885ea 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -104,6 +107,8 @@ gist_identify(uint8 info)
             break;
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
             break;
     }
 
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..eb9aac5fd3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,38 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK
+would unlink, in-tree access methods write no WAL for that change.  Code that
+writes WAL without calling RelationNeedsWAL() must check for this case.  This
+skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
+for the same block, REDO could overwrite the WAL-skipping change.  If a
+WAL-writing change followed a WAL-skipping change for the same block, a
+related problem would arise.  When a WAL record contains no full-page image,
+REDO expects the page to match its contents from just before record insertion.
+A WAL-skipping change may not reach disk at all, violating REDO's expectation
+under full_page_writes=off.  For any access method, CommitTransaction() writes
+and fsyncs affected blocks before recording the commit.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  Under these approaches, the access
+method callbacks must not call functions that react to RelationNeedsWAL().
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +852,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b6d8..118f9d521c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2660,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2fa0a7f667..a618dec776 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
                   BlockNumber startblk, BlockNumber endblk,
                   bool page_std)
 {
+    int            flags;
     BlockNumber blkno;
 
+    flags = REGBUF_FORCE_IMAGE;
+    if (page_std)
+        flags |= REGBUF_STANDARD;
+
     /*
      * Iterate over all the pages in the range. They are collected into
      * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written
@@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         nbufs = 0;
         while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk)
         {
-            Buffer        buf = ReadBuffer(rel, blkno);
+            Buffer        buf = ReadBufferExtended(rel, forkNum, blkno,
+                                                 RBM_NORMAL, NULL);
 
             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum,
         START_CRIT_SECTION();
         for (i = 0; i < nbufs; i++)
         {
-            XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+            XLogRegisterBuffer(i, bufpack[i], flags);
             MarkBufferDirty(bufpack[i]);
         }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..6cb143e161 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -549,6 +549,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -557,18 +559,20 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -577,9 +581,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 0fdff2918f..9f58ef1378 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -439,6 +439,10 @@ heap_create(const char *relname,
                 break;
         }
     }
+    else
+    {
+        rel->rd_createSubid = InvalidSubTransactionId;
+    }
 
     return rel;
 }
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..8253c420ef 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int            wal_skip_threshold = 2048;    /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -61,7 +65,14 @@ typedef struct PendingRelDelete
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    BlockNumber max_truncated;
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB       *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -117,6 +128,35 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /*
+     * If the relation needs at-commit sync, we also need to track the maximum
+     * unsynced truncated block; see smgrDoPendingSyncs().
+     */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool        found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL        ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize = sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("max truncated block hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->max_truncated = 0;
+    }
+
     return srel;
 }
 
@@ -216,6 +256,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
             prev = pending;
         }
     }
+
+    /* FIXME what to do about pending syncs? */
 }
 
 /*
@@ -275,6 +317,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         }
     }
 
+    RelationPreTruncate(rel);
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -325,6 +369,34 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
 }
 
+/*
+ * RelationPreTruncate
+ *        Perform AM-independent work before a physical truncation.
+ *
+ * If an access method's relation_nontransactional_truncate does not call
+ * RelationTruncate(), it must call this before decreasing the table size.
+ */
+void
+RelationPreTruncate(Relation rel)
+{
+    pendingSync *pending;
+
+    if (!pendingSyncHash)
+        return;
+    RelationOpenSmgr(rel);
+
+    /* Record largest maybe-unsynced block of files under tracking  */
+    pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                          HASH_FIND, NULL);
+    if (pending)
+    {
+        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
+
+        if (pending->max_truncated < nblocks)
+            pending->max_truncated = nblocks;
+    }
+}
+
 /*
  * Copy a fork's data, block by block.
  *
@@ -355,7 +427,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +471,39 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    if (XLogIsNeeded())
+        return false;            /* no permanent relfilenode skips WAL */
+
+    if (!pendingSyncHash ||
+        hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +581,135 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    PendingRelDelete *pending;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
+
+    if (XLogIsNeeded())
+        return;                    /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+
+    if (!pendingSyncHash)
+        return;                    /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        pendingSyncHash = NULL;
+        return;
+    }
+
+    AssertPendingSyncs_RelationCache();
+
+    /* Skip syncing nodes that smgrDoPendingDeletes() will delete. */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (!pending->atCommit)
+            continue;
+
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
+    }
+
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
+    {
+        ForkNumber    fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records.  We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         */
+        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        {
+            if (smgrexists(srel, fork))
+            {
+                BlockNumber n = smgrnblocks(srel, fork);
+
+                /* we shouldn't come here for unlogged relations */
+                Assert(fork != INIT_FORKNUM);
+
+                nblocks[fork] = n;
+                total_blocks += n;
+            }
+            else
+                nblocks[fork] = InvalidBlockNumber;
+        }
+
+        /*
+         * Sync file or emit WAL records for its contents.  Do file sync if
+         * the size is larger than the threshold or truncates may have removed
+         * blocks beyond the current size.
+         */
+        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
+            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        {
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
+            {
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
+            }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
+        }
+        else
+        {
+            /* Emit WAL records for all blocks.  The file is small enough. */
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                int            n = nblocks[fork];
+                Relation    rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    pendingSyncHash = NULL;
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e9d7a7ff79..b836ccf2d6 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1014,6 +1014,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
                 relfilenode2;
     Oid            swaptemp;
     char        swptmpchr;
+    Relation    rel1;
 
     /* We need writable copies of both pg_class tuples. */
     relRelation = table_open(RelationRelationId, RowExclusiveLock);
@@ -1173,6 +1174,15 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         CacheInvalidateRelcacheByTuple(reltup2);
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. Since the next step for rel2 is deletion, don't bother
+     * recording the newness of its relfilenode.
+     */
+    rel1 = relation_open(r1, NoLock);
+    RelationAssumeNewRelfilenode(rel1);
+    relation_close(rel1, NoLock);
+
     /*
      * Post alter hook for modified relations. The change to r2 is always
      * internal, but r1 depends on the invocation context.
@@ -1489,7 +1499,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 
             /* Get the associated valid index to be renamed */
             toastidx = toast_get_valid_index(newrel->rd_rel->reltoastrelid,
-                                             AccessShareLock);
+                                             NoLock);
 
             /* rename the toast table ... */
             snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u",
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 40a8ec1abd..f9acde56a5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2712,63 +2712,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 9f387b5f5f..fe9a754782 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 1ee37c1aeb..ea1d0fc850 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7c23968f2d..f706d6856f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5023,19 +5023,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -12921,6 +12916,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index aba3960481..73c38757fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,7 +66,7 @@
 #define BUF_WRITTEN                0x01
 #define BUF_REUSABLE            0x02
 
-#define DROP_RELS_BSEARCH_THRESHOLD        20
+#define RELS_BSEARCH_THRESHOLD        20
 
 typedef struct PrivateRefCountEntry
 {
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelationsAllBuffers shares the same comparator function with
+ * DropRelFileNodesAllBuffers. Pointer to this struct and RelFileNode must be
+ * compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode rnode;            /* This must be the first member */
+    SMgrRelation srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -3043,7 +3056,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
      * an exactly determined value, as it depends on many factors (CPU and RAM
      * speeds, amount of shared buffers etc.).
      */
-    use_bsearch = n > DROP_RELS_BSEARCH_THRESHOLD;
+    use_bsearch = n > RELS_BSEARCH_THRESHOLD;
 
     /* sort the list of rnodes if necessary */
     if (use_bsearch)
@@ -3293,6 +3306,104 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelationsAllBuffers
+ *
+ *        This function flushes out of the buffer pool all the pages of all
+ *        forks of the specified smgr relations.  It's equivalent to calling
+ *        FlushRelationBuffers once per fork per relation.  The relations are
+ *        assumed not to use local buffers.
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0; i < nrels; i++)
+    {
+        Assert(!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to sync. See
+     * DropRelFileNodesAllBuffers for details.
+     */
+    use_bsearch = nrels > RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
@@ -3494,13 +3605,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 56dba09299..8f98f665c5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -587,6 +587,18 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
     return (locallock && locallock->nLocks > 0);
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * GetLockMethodLocalHash -- return the hash of local locks, for modules that
+ *        evaluate assertions based on all locks held.
+ */
+HTAB *
+GetLockMethodLocalHash(void)
+{
+    return LockMethodLocalHash;
+}
+#endif
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *        lock would wake up other processes waiting for it.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ee9822c6e1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
@@ -877,12 +876,18 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *    mdimmedsync() -- Immediately sync a relation to stable storage.
  *
  * Note that only writes already issued are synced; this routine knows
- * nothing of dirty buffers that may exist inside the buffer manager.
+ * nothing of dirty buffers that may exist inside the buffer manager.  We
+ * sync active and inactive segments; smgrDoPendingSyncs() relies on this.
+ * Consider a relation skipping WAL.  Suppose a checkpoint syncs blocks of
+ * some segment, then mdtruncate() renders that segment inactive.  If we
+ * crash before the next checkpoint syncs the newly-inactive segment, that
+ * segment may survive recovery, reintroducing unwanted data into the table.
  */
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
     int            segno;
+    int            min_inactive_seg;
 
     /*
      * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -890,7 +895,16 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
      */
     mdnblocks(reln, forknum);
 
-    segno = reln->md_num_open_segs[forknum];
+    min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+    /*
+     * Temporarily open inactive segments, then close them after sync.  There
+     * may be some inactive segments left opened after fsync() error, but that
+     * is harmless.  We don't bother to clean them up and take a risk of
+     * further trouble.  The next mdclose() will soon close them.
+     */
+    while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+        segno++;
 
     while (segno > 0)
     {
@@ -901,6 +915,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
                     (errcode_for_file_access(),
                      errmsg("could not fsync file \"%s\": %m",
                             FilePathName(v->mdfd_vfd))));
+
+        /* Close inactive segments immediately */
+        if (segno > min_inactive_seg)
+        {
+            FileClose(v->mdfd_vfd);
+            _fdvec_resize(reln, forknum, segno - 1);
+        }
+
         segno--;
     }
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..72c9696ad1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,41 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are synced out to the store.
+ *
+ *        This is equivalent to FlushRelationBuffers() for each smgr relation,
+ *        then calling smgrimmedsync() for all forks of each relation, but it's
+ *        significantly quicker so should be preferred when possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    FlushRelationsAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index df025a5a30..0ac72572e3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -257,6 +257,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1090,6 +1093,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1814,6 +1818,7 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -2021,6 +2026,7 @@ RelationIdGetRelation(Oid relationId)
     rd = RelationBuildDesc(relationId, true);
     if (RelationIsValid(rd))
         RelationIncrementReferenceCount(rd);
+
     return rd;
 }
 
@@ -2089,7 +2095,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2505,13 +2511,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2591,6 +2597,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2669,12 +2676,12 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
-         * forget the "new" status of the relation, which is a useful
-         * optimization to have.  Ditto for the new-relfilenode status.
+         * forget the "new" status of the relation.  Ditto for the
+         * new-relfilenode status.
          *
          * The rel could have zero refcnt here, so temporarily increment the
          * refcnt to ensure it's safe to rebuild it.  We can assume that the
@@ -2754,11 +2761,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2809,7 +2815,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2921,6 +2927,78 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool        relcache_verdict =
+    relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+    ((relation->rd_createSubid != InvalidSubTransactionId &&
+      RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+     relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ *
+ * This consistently detects relcache.c skipping WAL while storage.c is not
+ * skipping WAL.  It often fails to detect the reverse error, because
+ * invalidation will have destroyed the relcache entry.  It will detect the
+ * reverse error if something opens the relation after the DDL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    LOCALLOCK  *locallock;
+    Relation   *rels;
+    int            maxrels;
+    int            nrels;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /* open every relation that this transaction has locked */
+    PushActiveSnapshot(GetTransactionSnapshot());
+    maxrels = 1;
+    rels = palloc(maxrels * sizeof(*rels));
+    nrels = 0;
+    hash_seq_init(&status, GetLockMethodLocalHash());
+    while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+    {
+        Oid            relid;
+        Relation    r;
+
+        if (locallock->nLocks <= 0)
+            continue;
+        if ((LockTagType) locallock->tag.lock.locktag_type !=
+            LOCKTAG_RELATION)
+            continue;
+        relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
+        r = RelationIdGetRelation(relid);
+        if (r == NULL)
+            continue;
+        if (nrels >= maxrels)
+        {
+            maxrels *= 2;
+            rels = repalloc(rels, maxrels * sizeof(*rels));
+        }
+        rels[nrels++] = r;
+    }
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+
+    for (i = 0; i < nrels; i++)
+        RelationClose(rels[i]);
+    PopActiveSnapshot();
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3032,10 +3110,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
      *
      * During commit, reset the flag to zero, since we are now out of the
      * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.  (NOTE: if we have forgotten the
-     * new-ness of a new relation due to a forced cache flush, the entry will
-     * get deleted anyway by shared-cache-inval processing of the aborted
-     * pg_class insertion.)
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid != InvalidSubTransactionId)
     {
@@ -3063,9 +3138,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
     }
 
     /*
-     * Likewise, reset the hint about the relfilenode being new.
+     * Likewise, reset any record of the relfilenode being new.
      */
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3157,7 +3233,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3166,6 +3242,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3255,6 +3339,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3552,14 +3637,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
-    relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    RelationAssumeNewRelfilenode(relation);
+}
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
+    relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
+
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5625,6 +5725,7 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cacbe904db..edf175a1b3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2694,6 +2695,18 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        2048,
+        0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..4bfc628000 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 55fc843d3a..673afee1e1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign new LSN */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 00a17f5f71..14f096d037 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -31,7 +31,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -168,8 +167,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index fb2902bd69..e6d7fa1e65 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 696451f728..6547099e84 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -127,7 +127,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -409,9 +409,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1105,10 +1104,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1328,9 +1323,7 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 
 /*
  * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert with a BulkInsertState specified.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 048003c25e..bd37bf311c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,24 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
+extern void RelationPreTruncate(Relation rel);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..292d440eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -189,6 +192,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index bb8e4e6e5b..8c180094f0 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -544,6 +544,9 @@ extern void LockReleaseSession(LOCKMETHODID lockmethodid);
 extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
+#ifdef USE_ASSERT_CHECKING
+extern HTAB *GetLockMethodLocalHash(void);
+#endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
                            LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..79dfe0e373 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 44ed04dd3f..ad72a8b910 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -64,22 +64,40 @@ typedef struct RelationData
                                  * rd_replidindex) */
     bool        rd_statvalid;    /* is rd_statlist valid? */
 
-    /*
+    /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
-     * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten
+     * survived into or zero if the rel was not created in the current top
+     * transaction.  rd_firstRelfilenodeSubid is the ID of the highest
+     * subtransaction an rd_node change has survived into or zero if rd_node
+     * matches the value it had at the start of the current top transaction.
+     * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes
+     * would restore rd_node to the value it had at the start of the current
+     * top transaction.  Rolling back any lower subtransaction would not.)
+     * Their accuracy is critical to RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
+     *        BEGIN;
+     *        TRUNCATE t;
+     *        SAVEPOINT save;
+     *        TRUNCATE t;
+     *        ROLLBACK TO save;
+     *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * These fields are read-only outside relcache.c.  Other files trigger
+     * rd_node changes by updating pg_class.reltablespace and/or
+     * pg_class.relfilenode.  They must call RelationAssumeNewRelfilenode() to
+     * update these fields.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -526,9 +544,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
  */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d77f5beec6..62239a09e8 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -106,9 +106,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -121,6 +122,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..78d81e12d0
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,374 @@
+# Test WAL replay when some operation has skipped WAL.
+#
+# These tests exercise code that once violated the mandate described in
+# src/backend/access/transam/README section "Skipping WAL for New
+# RelFileNode".  The tests work by committing some transactions, initiating an
+# immediate shutdown, and confirming that the expected data survives recovery.
+# For many years, individual commands made the decision to skip WAL, hence the
+# frequent appearance of COPY in these tests.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 34;
+
+sub check_orphan_relfilenodes
+{
+    my ($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+        "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix               = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql(
+        'postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 AND relpersistence <> 't' AND
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply(
+        [
+            sort(map { "$prefix$_" }
+                  grep(/^[0-9]+$/, slurp_dir($node->data_dir . "/$prefix")))
+        ],
+        [ sort split /\n/, $filepaths_referenced ],
+        $test_name);
+    return;
+}
+
+# We run this same test suite for both wal_level=minimal and replica.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf(
+        'postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
+#wal_debug = on
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+        "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc (id serial PRIMARY KEY);
+        TRUNCATE trunc;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
+    is($result, qq(0), "wal_level = $wal_level, TRUNCATE with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_ins (id serial PRIMARY KEY);
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        TRUNCATE trunc_ins;
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc_ins;");
+    is($result, qq(1), "wal_level = $wal_level, TRUNCATE INSERT");
+
+    # Same for prepared transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE twophase (id serial PRIMARY KEY);
+        INSERT INTO twophase VALUES (DEFAULT);
+        TRUNCATE twophase;
+        INSERT INTO twophase VALUES (DEFAULT);
+        PREPARE TRANSACTION 't';
+        COMMIT PREPARED 't';");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM twophase;");
+    is($result, qq(1), "wal_level = $wal_level, TRUNCATE INSERT PREPARE");
+
+    # Same with writing WAL at end of xact, instead of syncing.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        SET wal_skip_threshold = '1TB';
+        BEGIN;
+        CREATE TABLE noskip (id serial PRIMARY KEY);
+        INSERT INTO noskip VALUES (DEFAULT);
+        TRUNCATE noskip;
+        INSERT INTO noskip VALUES (DEFAULT);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM noskip;");
+    is($result, qq(1),
+        "wal_level = $wal_level, TRUNCATE with end-of-xact WAL");
+
+    # Data file for COPY query in subsequent tests
+    my $basedir   = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file(
+        $copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using both INSERT and COPY.  Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trunc (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_trunc VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE ins_trunc;
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COPY ins_trunc FROM '$copy_file' DELIMITER ',';
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trunc;");
+    is($result, qq(5), "wal_level = $wal_level, TRUNCATE COPY INSERT");
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after
+    # the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO trunc_copy VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE trunc_copy;
+        COPY trunc_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_copy;");
+    is($result, qq(3), "wal_level = $wal_level, TRUNCATE COPY");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_abort (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_abort VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_abort;
+        SAVEPOINT s;
+          ALTER TABLE spc_abort SET TABLESPACE other; ROLLBACK TO s;
+        COPY spc_abort FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_abort;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE abort subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_commit;
+        SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
+        COPY spc_commit FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM spc_commit;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE commit subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_nest (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_nest VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_nest;
+        SAVEPOINT s;
+            ALTER TABLE spc_nest SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY spc_nest FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_nest;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE nested subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        CREATE TABLE spc_hint (id int);
+        INSERT INTO spc_hint VALUES (1);
+        BEGIN;
+        ALTER TABLE spc_hint SET TABLESPACE other;
+        CHECKPOINT;
+        SELECT * FROM spc_hint;  -- set hint bit
+        INSERT INTO spc_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_hint;");
+    is($result, qq(2), "wal_level = $wal_level, SET TABLESPACE, hint bit");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE idx_hint (c int PRIMARY KEY);
+        SAVEPOINT q; INSERT INTO idx_hint VALUES (1); ROLLBACK TO q;
+        CHECKPOINT;
+        INSERT INTO idx_hint VALUES (1);  -- set index hint bit
+        INSERT INTO idx_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->psql('postgres',);
+    my ($ret, $stdout, $stderr) =
+      $node->psql('postgres', "INSERT INTO idx_hint VALUES (2);");
+    is($ret, qq(3), "wal_level = $wal_level, unique index LP_DEAD");
+    like(
+        $stderr,
+        qr/violates unique/,
+        "wal_level = $wal_level, unique index LP_DEAD message");
+
+    # UPDATE touches two buffers for one row.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE upd (id serial PRIMARY KEY, id2 int);
+        INSERT INTO upd (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY upd FROM '$copy_file' DELIMITER ',';
+        UPDATE upd SET id2 = id2 + 1;
+        DELETE FROM upd;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM upd;");
+    is($result, qq(0),
+        "wal_level = $wal_level, UPDATE touches two buffers for one row");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_copy VALUES (DEFAULT, 1);
+        COPY ins_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_copy;");
+    is($result, qq(4), "wal_level = $wal_level, INSERT COPY");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION ins_trig_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION ins_trig_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER ins_trig_before_row_insert
+          BEFORE INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_before_row_trig();
+        CREATE TRIGGER ins_trig_after_row_insert
+          AFTER INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_after_row_trig();
+        COPY ins_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trig;");
+    is($result, qq(9), "wal_level = $wal_level, COPY with INSERT triggers");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION trunc_trig_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION trunc_trig_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER trunc_trig_before_stat_truncate
+          BEFORE TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_before_stat_trig();
+        CREATE TRIGGER trunc_trig_after_stat_truncate
+          AFTER TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_after_stat_trig();
+        INSERT INTO trunc_trig VALUES (DEFAULT, 1);
+        TRUNCATE trunc_trig;
+        COPY trunc_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_trig;");
+    is($result, qq(4),
+        "wal_level = $wal_level, TRUNCATE COPY with TRUNCATE triggers");
+
+    # Test redo of temp table creation.
+    $node->safe_psql(
+        'postgres', "
+        CREATE TEMP TABLE temp (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+    check_orphan_relfilenodes($node,
+        "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 4dd3507c99..57767bab6d 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1984,6 +1984,12 @@ select * from another;
 (3 rows)
 
 drop table another;
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index a16e4c9a29..e11399f2cd 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -1360,6 +1360,13 @@ select * from another;
 
 drop table another;
 
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
+
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
-- 
2.18.2

From 1ccd187820b20378059e4c0346ee5c8623c007d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Jan 2020 19:24:04 +0900
Subject: [PATCH v34 2/4] Fix the defect 1

Pending sync is lost by the followig sequence.

  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure

Relcache entry for a dropped relation is deleted right out. On the
other hand we need the newness information holded in the dropped entry
if the subtransaction is rolled back later. So this patch makes
relcache entry preserved after dropping of a relation that any newness
flag is active, so that it is available later in the current
transaction.
---
 src/backend/utils/cache/relcache.c | 170 +++++++++++++++++++++++++----
 src/include/utils/rel.h            |   2 +
 src/include/utils/relcache.h       |   1 +
 3 files changed, 150 insertions(+), 23 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 0ac72572e3..ec1e501b4d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -18,6 +18,7 @@
  *        RelationCacheInitializePhase2    - initialize shared-catalog entries
  *        RelationCacheInitializePhase3    - finish initializing relcache
  *        RelationIdGetRelation            - get a reldesc by relation id
+ *        RelationIdGetRelationCache        - get a relcache entry by relation id
  *        RelationClose                    - close an open relation
  *
  * NOTES
@@ -1094,6 +1095,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_droppedSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1991,6 +1993,13 @@ RelationIdGetRelation(Oid relationId)
 
     if (RelationIsValid(rd))
     {
+        /* return NULL for dropped relatoins */
+        if (rd->rd_droppedSubid != InvalidSubTransactionId)
+        {
+            Assert (!rd->rd_isvalid);
+            return NULL;
+        }
+
         RelationIncrementReferenceCount(rd);
         /* revalidate cache entry if necessary */
         if (!rd->rd_isvalid)
@@ -2030,6 +2039,31 @@ RelationIdGetRelation(Oid relationId)
     return rd;
 }
 
+/*
+ * RelationIdGetRelationCache: returns an entry exists in relcache.
+ *
+ * This function returns NULL not building new one if no existing entries
+ * found, and may return an invalid or dropped-but-not-commited entry if any.
+ *
+ * This function is intended to be used to lookup the relcache entriy for a
+ * dropped relation that entry is preserved for pending sync.
+ */
+Relation
+RelationIdGetRelationCache(Oid relationId)
+{
+    Relation    rd;
+
+    /* Make sure we're in an xact, even if this ends up being a cache hit */
+    Assert(IsTransactionState());
+
+    RelationIdCacheLookup(relationId, rd);
+
+    if (RelationIsValid(rd))
+        RelationIncrementReferenceCount(rd);
+
+    return rd;
+}
+
 /* ----------------------------------------------------------------
  *                cache invalidation support routines
  * ----------------------------------------------------------------
@@ -2134,10 +2168,11 @@ RelationReloadIndexInfo(Relation relation)
     HeapTuple    pg_class_tuple;
     Form_pg_class relp;
 
-    /* Should be called only for invalidated indexes */
+    /* Should be called only for invalidated living indexes */
     Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
             relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
-           !relation->rd_isvalid);
+           !relation->rd_isvalid &&
+           relation->rd_droppedSubid == InvalidSubTransactionId);
 
     /* Ensure it's closed at smgr level */
     RelationCloseSmgr(relation);
@@ -2438,12 +2473,14 @@ RelationClearRelation(Relation relation, bool rebuild)
      * have valid index support information.  This avoids problems with active
      * use of the index support information.  As with nailed indexes, we
      * re-read the pg_class row to handle possible physical relocation of the
-     * index, and we check for pg_index updates too.
+     * index, and we check for pg_index updates too.  Relations with valid
+     * rd_droppedSubid doesn't have the corresponding catalog entry,
      */
     if ((relation->rd_rel->relkind == RELKIND_INDEX ||
          relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
         relation->rd_refcnt > 0 &&
-        relation->rd_indexcxt != NULL)
+        relation->rd_indexcxt != NULL &&
+        relation->rd_droppedSubid == InvalidSubTransactionId)
     {
         relation->rd_isvalid = false;    /* needs to be revalidated */
         if (IsTransactionState())
@@ -2462,6 +2499,25 @@ RelationClearRelation(Relation relation, bool rebuild)
      */
     if (!rebuild)
     {
+        /*
+         * If pending sync is active, the entry is still needed.  Mark the
+         * relcache as "dropped" and leave it live invalid.
+         */
+        if (relation->rd_createSubid != InvalidSubTransactionId ||
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+        {
+            if (relation->rd_droppedSubid == InvalidSubTransactionId)
+                relation->rd_droppedSubid = GetCurrentSubTransactionId();
+            else
+            {
+                /* shouldn't try to change it */
+                Assert (relation->rd_droppedSubid ==
+                        GetCurrentSubTransactionId());
+            }
+
+            return;
+        }
+
         /* Remove it from the hash table */
         RelationCacheDelete(relation);
 
@@ -2546,6 +2602,26 @@ RelationClearRelation(Relation relation, bool rebuild)
             if (HistoricSnapshotActive())
                 return;
 
+            /*
+             * Although this relation is already dropped from the catalog, the
+             * relcache entry is still needed if pending sync is active.  Mark
+             * the relcache as "dropped" and leave it live invalid.
+             */
+            if (relation->rd_createSubid != InvalidSubTransactionId ||
+                relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+            {
+                if (relation->rd_droppedSubid == InvalidSubTransactionId)
+                    relation->rd_droppedSubid = GetCurrentSubTransactionId();
+                else
+                {
+                    /* shouldn't try to change it */
+                    Assert (relation->rd_droppedSubid ==
+                            GetCurrentSubTransactionId());
+                }
+
+                return;
+            }
+
             /*
              * This shouldn't happen as dropping a relation is intended to be
              * impossible if still referenced (cf. CheckTableNotInUse()). But
@@ -2598,6 +2674,7 @@ RelationClearRelation(Relation relation, bool rebuild)
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
         SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_droppedSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2991,7 +3068,20 @@ AssertPendingSyncs_RelationCache(void)
 
     hash_seq_init(&status, RelationIdCache);
     while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
-        AssertPendingSyncConsistency(idhentry->reldesc);
+    {
+        Relation r = idhentry->reldesc;
+
+        /* Ignore relcache entries of deleted relations */
+        if (r->rd_droppedSubid != InvalidSubTransactionId)
+        {
+            Assert(!r->rd_isvalid &&
+                   (r->rd_createSubid != InvalidSubTransactionId ||
+                    r->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+            continue;
+        }
+
+        AssertPendingSyncConsistency(r);
+    }
 
     for (i = 0; i < nrels; i++)
         RelationClose(rels[i]);
@@ -3081,6 +3171,8 @@ AtEOXact_RelationCache(bool isCommit)
 static void
 AtEOXact_cleanup(Relation relation, bool isCommit)
 {
+    bool clear_relcache = false;
+
     /*
      * The relcache entry's ref count should be back to its normal
      * not-in-a-transaction state: 0 unless it's nailed in cache.
@@ -3106,17 +3198,31 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
 #endif
 
     /*
-     * Is it a relation created in the current transaction?
+     * Is the relation lives after this transction ends?
      *
-     * During commit, reset the flag to zero, since we are now out of the
-     * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.
+     * During commit, clear the relcache entry if it is preserved after
+     * relation drop, in order not to make the entry left orphaned. During
+     * rollback, clear the relacache entry if the relation is created in the
+     * current transaction since it isn't interesting any longer once we are
+     * out of the transaction.
      */
-    if (relation->rd_createSubid != InvalidSubTransactionId)
+    clear_relcache =
+        (isCommit ?
+         relation->rd_droppedSubid != InvalidSubTransactionId :
+         relation->rd_createSubid != InvalidSubTransactionId);
+            
+    /*
+     * Since we are now out of the transaction, reset the flags to zero.
+     * That also lets RelationClearRelation drop the relcache entry.
+     */
+    relation->rd_createSubid = InvalidSubTransactionId;
+    relation->rd_droppedSubid = InvalidSubTransactionId;
+    relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+
+    if (clear_relcache)
     {
-        if (isCommit)
-            relation->rd_createSubid = InvalidSubTransactionId;
-        else if (RelationHasReferenceCountZero(relation))
+        if (RelationHasReferenceCountZero(relation))
         {
             RelationClearRelation(relation, false);
             return;
@@ -3131,17 +3237,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
              * eventually.  This must be just a WARNING to avoid
              * error-during-error-recovery loops.
              */
-            relation->rd_createSubid = InvalidSubTransactionId;
             elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
                  RelationGetRelationName(relation));
         }
     }
-
-    /*
-     * Likewise, reset any record of the relfilenode being new.
-     */
-    relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
-    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3205,15 +3304,24 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     /*
      * Is it a relation created in the current subtransaction?
      *
-     * During subcommit, mark it as belonging to the parent, instead. During
-     * subabort, simply delete the relcache entry.
+     * During subcommit, mark it as belonging to the parent, instead, as long
+     * as it has not been dropped. Otherwise simply delete the relcache entry.
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid == mySubid)
     {
-        if (isCommit)
+        /*
+         * Valid rd_droppedSubid means the corresponding relation is dropped
+         * but the relcache entry is preserved for at-commit pending sync.
+         * We need to drop it explicitly here not to make the entry orphan.
+         */
+        if (isCommit && relation->rd_droppedSubid != mySubid)
             relation->rd_createSubid = parentSubid;
         else if (RelationHasReferenceCountZero(relation))
         {
+            /* allow the entry to be removed */
+            relation->rd_createSubid = InvalidSubTransactionId;
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
             RelationClearRelation(relation, false);
             return;
         }
@@ -3232,6 +3340,22 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         }
     }
 
+    /*
+     * Have the relation that got a new relfilenode in the current transaction
+     * been dropped?
+     *
+     * If this relation registered pending sync then dropped, subxact rollback
+     * cancels the uncommitted drop, and commit propagates it to the parent.
+     */
+    if (relation->rd_droppedSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_droppedSubid = parentSubid;
+        else
+            relation->rd_droppedSubid = InvalidSubTransactionId;
+    }
+
+
     /*
      * Likewise, update or drop any new-relfilenode-in-subtransaction record.
      */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ad72a8b910..0b87bc3222 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -98,6 +98,8 @@ typedef struct RelationData
                                                  * rd_node to current value */
     SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
                                                  * rd_node to any value */
+    SubTransactionId rd_droppedSubid;    /* in-transaction created rel has been
+                                         * dropped */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 62239a09e8..0362b6f6ff 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -37,6 +37,7 @@ typedef Relation *RelationPtr;
  * Routines to open (lookup) and close a relcache entry
  */
 extern Relation RelationIdGetRelation(Oid relationId);
+extern Relation RelationIdGetRelationCache(Oid relationId);
 extern void RelationClose(Relation relation);
 
 /*
-- 
2.18.2

From be8522ec748eb8210c7e8c1b0b2f5e0358790b23 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 21 Jan 2020 18:12:52 +0900
Subject: [PATCH v34 3/4] Fix the defect 2

Pass newness flags to new index relation inherits the old relfilenode
whie ALTER TABLE ALTER TYPE.

The command may reuse existing index and the reused index may be
created in the current transaction.  Pass the information to the
relcache of the new index relation so that pending sync correctly
works. This relies on the relcache preserving feature introduced by
the previos fix.
---
 src/backend/commands/tablecmds.c | 32 +++++++++++++++++++++++++++++---
 src/backend/nodes/copyfuncs.c    |  1 +
 src/backend/nodes/equalfuncs.c   |  1 +
 src/backend/nodes/outfuncs.c     |  1 +
 src/include/nodes/parsenodes.h   |  1 +
 5 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f706d6856f..529d47af67 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -7696,12 +7696,35 @@ ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
      * this index will have scheduled the storage for deletion at commit, so
      * cancel that pending deletion.
      */
+    Assert (OidIsValid(stmt->oldNode) == OidIsValid(stmt->oldRelId));
     if (OidIsValid(stmt->oldNode))
     {
-        Relation    irel = index_open(address.objectId, NoLock);
+        Relation    newirel = index_open(address.objectId, NoLock);
+        Relation    oldirel = RelationIdGetRelationCache(stmt->oldRelId);
 
-        RelationPreserveStorage(irel->rd_node, true);
-        index_close(irel, NoLock);
+        RelationPreserveStorage(newirel->rd_node, true);
+
+        /*
+         * oidirel is valid iff the old relation was created then dropped in
+         * the current transaction.  We need to copy the newness hints other
+         * than rd_droppedSubid corresponding to the reused relfilenode in that
+         * case.
+         */
+        if (oldirel != NULL)
+        {
+            Assert(!oldirel->rd_isvalid &&
+                   oldirel->rd_createSubid != InvalidSubTransactionId &&
+                   oldirel->rd_droppedSubid != InvalidSubTransactionId);
+
+            newirel->rd_createSubid = oldirel->rd_createSubid;
+            newirel->rd_firstRelfilenodeSubid =
+                oldirel->rd_firstRelfilenodeSubid;
+            newirel->rd_newRelfilenodeSubid =
+                oldirel->rd_newRelfilenodeSubid;
+
+            RelationClose(oldirel);
+        }
+        index_close(newirel, NoLock);
     }
 
     return address;
@@ -11980,7 +12003,10 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
 
         /* If it's a partitioned index, there is no storage to share. */
         if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+        {
             stmt->oldNode = irel->rd_node.relNode;
+            stmt->oldRelId = irel->rd_id;
+        }
         index_close(irel, NoLock);
     }
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 54ad62bb7f..0e621d74d4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3477,6 +3477,7 @@ _copyIndexStmt(const IndexStmt *from)
     COPY_STRING_FIELD(idxcomment);
     COPY_SCALAR_FIELD(indexOid);
     COPY_SCALAR_FIELD(oldNode);
+    COPY_SCALAR_FIELD(oldRelId);
     COPY_SCALAR_FIELD(unique);
     COPY_SCALAR_FIELD(primary);
     COPY_SCALAR_FIELD(isconstraint);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 5b1ba143b1..5740b6890b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1345,6 +1345,7 @@ _equalIndexStmt(const IndexStmt *a, const IndexStmt *b)
     COMPARE_STRING_FIELD(idxcomment);
     COMPARE_SCALAR_FIELD(indexOid);
     COMPARE_SCALAR_FIELD(oldNode);
+    COMPARE_SCALAR_FIELD(oldRelId);
     COMPARE_SCALAR_FIELD(unique);
     COMPARE_SCALAR_FIELD(primary);
     COMPARE_SCALAR_FIELD(isconstraint);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..bcbdb29ccb 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2650,6 +2650,7 @@ _outIndexStmt(StringInfo str, const IndexStmt *node)
     WRITE_STRING_FIELD(idxcomment);
     WRITE_OID_FIELD(indexOid);
     WRITE_OID_FIELD(oldNode);
+    WRITE_OID_FIELD(oldRelId);
     WRITE_BOOL_FIELD(unique);
     WRITE_BOOL_FIELD(primary);
     WRITE_BOOL_FIELD(isconstraint);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index da0706add5..b114d7a772 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2782,6 +2782,7 @@ typedef struct IndexStmt
     char       *idxcomment;        /* comment to apply to index, or NULL */
     Oid            indexOid;        /* OID of an existing index, if any */
     Oid            oldNode;        /* relfilenode of existing storage, if any */
+    Oid            oldRelId;        /* relid of the old index, if any */
     bool        unique;            /* is index unique? */
     bool        primary;        /* is index a primary key? */
     bool        isconstraint;    /* is it for a pkey/unique constraint? */
-- 
2.18.2

From 82beb3f17ca7c9fc5b9260b1e19e138a059e2b77 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 16 Jan 2020 13:24:27 +0900
Subject: [PATCH v34 4/4] Fix the defect 3

Force file sync if the file has been truncated.

The previous verision of the patch allowed to choose WAL when main
fork is larger than ever.  But there's a case where FSM fork gets
shorter while main fork is larger than ever.  Force file sync always
when the file has experienced a truncation.
---
 src/backend/catalog/storage.c | 56 +++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 8253c420ef..447fb606e5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -68,7 +68,7 @@ typedef struct PendingRelDelete
 typedef struct pendingSync
 {
     RelFileNode rnode;
-    BlockNumber max_truncated;
+    bool        is_truncated;    /* Has the file experienced truncation? */
 } pendingSync;
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
@@ -154,7 +154,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
 
         pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
         Assert(!found);
-        pending->max_truncated = 0;
+        pending->is_truncated = false;
     }
 
     return srel;
@@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
     /* Record largest maybe-unsynced block of files under tracking  */
     pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
                           HASH_FIND, NULL);
-    if (pending)
-    {
-        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
-
-        if (pending->max_truncated < nblocks)
-            pending->max_truncated = nblocks;
-    }
+    pending->is_truncated = true;
 }
 
 /*
@@ -637,31 +631,43 @@ smgrDoPendingSyncs(bool isCommit)
          * Small WAL records have a chance to be emitted along with other
          * backends' WAL records.  We emit WAL records instead of syncing for
          * files that are smaller than a certain threshold, expecting faster
-         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.  We
+         * don't bother counting the pages when the file has experienced a
+         * truncation.
          */
-        for (fork = 0; fork <= MAX_FORKNUM; fork++)
+        if (!pendingsync->is_truncated)
         {
-            if (smgrexists(srel, fork))
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
             {
-                BlockNumber n = smgrnblocks(srel, fork);
+                if (smgrexists(srel, fork))
+                {
+                    BlockNumber n = smgrnblocks(srel, fork);
 
-                /* we shouldn't come here for unlogged relations */
-                Assert(fork != INIT_FORKNUM);
-
-                nblocks[fork] = n;
-                total_blocks += n;
+                    /* we shouldn't come here for unlogged relations */
+                    Assert(fork != INIT_FORKNUM);
+                    nblocks[fork] = n;
+                    total_blocks += n;
+                }
+                else
+                    nblocks[fork] = InvalidBlockNumber;
             }
-            else
-                nblocks[fork] = InvalidBlockNumber;
         }
 
         /*
-         * Sync file or emit WAL records for its contents.  Do file sync if
-         * the size is larger than the threshold or truncates may have removed
-         * blocks beyond the current size.
+         * Sync file or emit WAL records for its contents.
+         *
+         * Alghough we emit WAL record if the file is small enough, do file
+         * sync regardless of the size if the file has experienced a
+         * truncation. It is because the file would be followed by trailing
+         * garbage blocks after a crash recovery if, while a past longer file
+         * had been flushed out, we omitted syncing-out of the file and emit
+         * WAL instead.  You might think that we could choose WAL if the
+         * current main fork is longer than ever, but there's a case where main
+         * fork is longer than ever but FSM fork gets shorter. We don't bother
+         * checking that for every fork.
          */
-        if (total_blocks * BLCKSZ / 1024 >= wal_skip_threshold ||
-            nblocks[MAIN_FORKNUM] < pendingsync->max_truncated)
+        if (pendingsync->is_truncated ||
+            total_blocks * BLCKSZ / 1024 >= wal_skip_threshold)
         {
             /* allocate the initial array, or extend it, if needed */
             if (maxrels == 0)
-- 
2.18.2

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Thomas Munro

Дата:

18 февраля 2020 г., 05:56:15

On Mon, Jan 27, 2020 at 11:30 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Hello, this is rebased then addressed version.

Hi, I haven't followed this thread but I just noticed this strange
looking failure:

CREATE TYPE priv_testtype1 AS (a int, b text);
+ERROR: relation 24844 deleted while still in use
REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923

It didn't fail on the same OS a couple of days earlier:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

18 февраля 2020 г., 11:53:37

On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote:
> CREATE TYPE priv_testtype1 AS (a int, b text);
> +ERROR: relation 24844 deleted while still in use
> REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;
> 
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923
> 
> It didn't fail on the same OS a couple of days earlier:
> 
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686

Thanks for the report.  This reproduces consistently under
CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today).  Removing the
heap_create() change fixes it.  Since we now restore a saved rd_createSubid,
the heap_create() change is obsolete.  My next version will include that fix.

The system uses rd_createSubid to mean two things.  First, rd_node is new.
Second, the rel might not yet be in catalogs, so we can't rebuild its relcache
entry.  The first can be false while the second is true, hence this failure.
However, the second is true in a relatively-narrow period in which we don't
run arbitrary user code.  Hence, that simple fix suffices.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

18 февраля 2020 г., 13:50:16

Oops. I played on a wrong branch and got stuck in slow build on
Windows...

At Tue, 18 Feb 2020 00:53:37 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote:
> > CREATE TYPE priv_testtype1 AS (a int, b text);
> > +ERROR: relation 24844 deleted while still in use
> > REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;
> > 
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923
> > 
> > It didn't fail on the same OS a couple of days earlier:
> > 
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686
> 
> Thanks for the report.  This reproduces consistently under
> CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today).  Removing the
> heap_create() change fixes it.  Since we now restore a saved rd_createSubid,
> the heap_create() change is obsolete.  My next version will include that fix.

Yes, ATExecAddIndex correctly set createSubid without that.

> The system uses rd_createSubid to mean two things.  First, rd_node is new.
> Second, the rel might not yet be in catalogs, so we can't rebuild its relcache
> entry.  The first can be false while the second is true, hence this failure.
> However, the second is true in a relatively-narrow period in which we don't
> run arbitrary user code.  Hence, that simple fix suffices.

I didn't care the second meaning. I thought it is caused by
invalidation but I couldn't get a core dump on Windows 10.. The
comment for RelationCacheInvalidate seems faintly explains about the
second meaning.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

19 февраля 2020 г., 10:44:52

I think attached v35nm is ready for commit to master.  Would anyone like to
talk me out of back-patching this?  I would not enjoy back-patching it, but
it's hard to justify lack of back-patch for a data-loss bug.

Notable changes since v34:

- Separate a few freestanding fixes into their own patches.

On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
>      /* Record largest maybe-unsynced block of files under tracking  */
>      pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
>                            HASH_FIND, NULL);
> -    if (pending)
> -    {
> -        BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> -
> -        if (pending->max_truncated < nblocks)
> -            pending->max_truncated = nblocks;
> -    }
> +    pending->is_truncated = true;

- Fix this crashing when "pending" is NULL, as it is in this test case:

  begin;
  create temp table t ();
  create table t2 ();  -- cause pendingSyncHash to exist
  truncate t;
  rollback;

- Fix the "deleted while still in use" problem that Thomas Munro reported, by
  removing the heap_create() change.  Restoring the saved rd_createSubid had
  made obsolete the heap_create() change.  check-world now passes with
  wal_level=minimal and CLOBBER_CACHE_ALWAYS.

- Set rd_droppedSubid in RelationForgetRelation(), not
  RelationClearRelation().  RelationForgetRelation() knows it is processing a
  drop, but RelationClearRelation() could only infer that from circumstantial
  evidence.  This seems more future-proof to me.

- When reusing an index build, instead of storing the dropped relid in the
  IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
  the subid fields in the IndexStmt.  This is less code, and I felt
  RelationIdGetRelationCache() invited misuse.

On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 25 Feb 2020 21:36:12 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote:
> > > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > > If we decide to keep the consistency there, I would like to describe
> > > the code is there for consistency, not for the benefit of a specific
> > > assertion.
> > > 
> > > (cluster.c:1116)
> > > -    * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > > -    * benefit of AssertPendingSyncs_RelationCache().
> > > +    * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > > +    * consistency of the fieles. It is checked later by
> > > +    * AssertPendingSyncs_RelationCache().
> > 
> > I think the word "consistency" is too vague for "consistency of the fields" to
> > convey information.  May I just remove the last sentence of the comment
> > (everything after "* new.")?
> 
> I'm fine with that:)
> 
> > > I agree that relation works as the generic name of table-like
> > > objects. Addition to that, doesn't using the word "storage file" make
> > > it more clearly?  I'm not confident on the wording itself, but it will
> > > look like the following.
> > 
> > The docs rarely use "storage file" or "on-disk file" as terms.  I hesitate to
> > put more emphasis on files, because they are part of the implementation, not
> > part of the user interface.  The term "rewrites"/"rewriting" has the same
> > problem, though.  Yet another alternative would be to talk about operations
> > that change the pg_relation_filenode() return value:
> > 
> >   In <literal>minimal</literal> level, no information is logged for permanent
> >   relations for the remainder of a transaction that creates them or changes
> >   what <function>pg_relation_filenode</function> returns for them.
> > 
> > What do you think?
> 
> It sounds somewhat obscure.

I see.  I won't use that.

> Coulnd't we enumetate examples? And if we
> could use pg_relation_filenode, I think we can use just
> "filenode". (Thuogh the word is used in the documentation, it is not
> defined anywhere..)

func.sgml does define the term.  Nonetheless, I'm not using it.

> ====
> In <literal>minimal</literal> level, no information is logged for
> permanent relations for the remainder of a transaction that creates
> them or changes their <code>filenode</code>. For example, CREATE
> TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
> category.
> ====
> 
> # sorry for bothering you..

Including examples is fine.  Attached v36nm has just comment and doc changes.
Would you translate this into back-patch versions for v9.5 through v12?

Вложения

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

02 марта 2020 г., 10:53:53

At Sun, 1 Mar 2020 11:56:32 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote:
> > It sounds somewhat obscure.
> 
> I see.  I won't use that.

Thanks.

> > Coulnd't we enumetate examples? And if we
> > could use pg_relation_filenode, I think we can use just
> > "filenode". (Thuogh the word is used in the documentation, it is not
> > defined anywhere..)
> 
> func.sgml does define the term.  Nonetheless, I'm not using it.

Ah, "The filenode is the base component oif the file name(s) used for
the relation".. So it's very similar to "on-disk file" in a sense.

> > ====
> > In <literal>minimal</literal> level, no information is logged for
> > permanent relations for the remainder of a transaction that creates
> > them or changes their <code>filenode</code>. For example, CREATE
> > TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
> > category.
> > ====
> > 
> > # sorry for bothering you..
> 
> Including examples is fine.  Attached v36nm has just comment and doc changes.
> Would you translate this into back-patch versions for v9.5 through v12?

The explicit list of commands that initiate the WAL-skipping mode
works for me. I'm going to work on the tranlation right now.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

04 марта 2020 г., 10:29:19

Hello.

The attached is back-patches from 9.5 through master.

At Mon, 02 Mar 2020 16:53:53 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Would you translate this into back-patch versions for v9.5 through v12?
> 
> The explicit list of commands that initiate the WAL-skipping mode
> works for me. I'm going to work on the tranlation right now.

At first I fixed several ssues in 018_wal_optimize.pl:

- TRUNCATE INSERT, TRUNCATE INSERT PREPARE

 It wrongly passes if finally we see the value only from the first
 INSERT. I changed it so that it checks the value, not the number of
 values.

- TRUNCATE with end-of-xact WAL => lengty end-of-xact WAL

 TRUNCATE inhibits end-of-xact WAL so I removed the TRUNCATE.  It uses
 only 1 page so it fails to excercise multi-page behavior of
 log_newpage_range.  At least 33 pages is needed to check if it is
 working correctly.  10000 rows is sufficient but I choosed 20000 rows
 including margin.

- COPY with INSERT tirggers
 It wrongly referes to OLD in AFTER-INSERT trigger. It yeilds NULL
 for 11 and later, or ends in ERROR otherwise.  Addition to that
 AFTER-INSERT ROW-level trigger is fired after *stagtement* (but
 before AFTER-INSERT statement level triggers).  That being said, it
 doesn't affect the result of the test so I leave it with modifying it
 not to refer to OLD.

log_newpage_range has been introduced at PG12.  Fortunately the
required infrastructure is introduced at PG9.5 so what I need to do
for PG95-PG11 is back-patching the function and its counter part in
xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or
more backup pages so the compatibility is forward only. That is, newer
minor versions read WAL from older minor versions, but not vise
versea.  I'm not sure it is back-patchable so in the attached the
end-of-xact WAL feature is separated for PG9.5-PG11.
(000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch)

====

In the patchset for 12, I let the functions heap_sync,
heapam_methods.finish_bulk_insert and table_finish_bulk_insert left
as-is.  As the result heapam_finish_bulk_insert becomes no-op.
begin_heap_rewrite is a public function but the last parameter is
useless and rather harmful as it looks as if it works. So I removed
the parameter.
  
For 11 and 10, heap_sync and begin_heap_rewrite is treated the same
way to 12.

For 9.6, mdexists() creates the specified file while bootstrap mode
and that leads to assertion failure of smgrDoPendingSyncs. So I made
CreateStorage not register pending sync while bootstrap mode.
gistbuild generates the LSN for root page of a newly created index
using gistGetFakeLSN(heap), which fires assertion failure in
gistGetFakeLSN.  I think we should use index instead of heap there,
but it doesn't matter if we don't have the new pending sync mechanism,
so I didn't split it as a separate patch.  pg_visibility doesn't have
regression test but I added the files conatining only the test for
this feature.

For 9.5, pg_visibility does not exist so I dropped the test for the
module.  It lacks a part of TAP infrastructure nowadays we have, but I
want to have the test (and it actually found a bug I made during this
work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
and RecursiveCopy.pm along with 018_wal_optimize.pl.
(0004-Add-TAP-test-for-WAL-skipping-feature.patch)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

wal_skip_optimize_patchset_20200304.tar.gz

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

04 марта 2020 г., 10:44:25

Some fixes..

At Wed, 04 Mar 2020 16:29:19 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At first I fixed several ssues in 018_wal_optimize.pl:
> 
> - TRUNCATE INSERT, TRUNCATE INSERT PREPARE
> 
>  It wrongly passes if finally we see the value only from the first
>  INSERT. I changed it so that it checks the value, not the number of
>  values.

Finally it checks both number of values and the largest value.

...
> log_newpage_range has been introduced at PG12.  Fortunately the
> required infrastructure is introduced at PG9.5 so what I need to do
> for PG95-PG11 is back-patching the function and its counter part in
- xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or
+ xlog_redo. It doen't change WAL format itself but XLOG_FPI gets to have 2 or
> more backup pages so the compatibility is forward only. That is, newer
> minor versions read WAL from older minor versions, but not vise
> versea.  I'm not sure it is back-patchable so in the attached the
> end-of-xact WAL feature is separated for PG9.5-PG11.
> (000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

16 марта 2020 г., 06:46:47

On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> The attached is back-patches from 9.5 through master.

Thanks.  I've made some edits.  I'll plan to push the attached patches on
Friday or Saturday.

> log_newpage_range has been introduced at PG12.  Fortunately the
> required infrastructure is introduced at PG9.5 so what I need to do
> for PG95-PG11 is back-patching the function and its counter part in
> xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or
> more backup pages so the compatibility is forward only. That is, newer
> minor versions read WAL from older minor versions, but not vise
> versea.  I'm not sure it is back-patchable so in the attached the
> end-of-xact WAL feature is separated for PG9.5-PG11.
> (000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch)

The main patch's introduction of XLOG_GIST_ASSIGN_LSN already creates a WAL
upgrade hazard.  Changing XLOG_FPI is riskier, because an old server will
apply the first FPI and ignore the rest.  For v11 and earlier, I decided to
introduce XLOG_FPI_MULTI.  It behaves exactly like XLOG_FPI, but this PANICs
if one reads post-update WAL with a pre-update server.  The main alternative
would be to issue one XLOG_FPI per page, but I was concerned that would cause
a notable performance loss.

> In the patchset for 12, I let the functions heap_sync,
> heapam_methods.finish_bulk_insert and table_finish_bulk_insert left
> as-is.  As the result heapam_finish_bulk_insert becomes no-op.

heapam_finish_bulk_insert() is a static function, so I deleted it.

> begin_heap_rewrite is a public function but the last parameter is
> useless and rather harmful as it looks as if it works. So I removed
> the parameter.

Agreed.  Also, pgxn contains no references to begin_heap_rewrite().

> For 9.6, mdexists() creates the specified file while bootstrap mode
> and that leads to assertion failure of smgrDoPendingSyncs. So I made
> CreateStorage not register pending sync while bootstrap mode.
> gistbuild generates the LSN for root page of a newly created index
> using gistGetFakeLSN(heap), which fires assertion failure in
> gistGetFakeLSN.  I think we should use index instead of heap there,
> but it doesn't matter if we don't have the new pending sync mechanism,
> so I didn't split it as a separate patch.

v11 and v10, too, had the gistGetFakeLSN(heap) problem.  I saw that and other
problems by running the following on each branch:

  make check-world
  printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' >/tmp/minimal.conf
  make check-world TEMP_CONFIG=/tmp/minimal.conf
  make -C doc  # catch breakage when XML changes don't work in SGML

> For 9.5, pg_visibility does not exist so I dropped the test for the
> module.

The test would have required further changes to work in v11 or earlier, so I
deleted the test.  It was a low-importance test.

> It lacks a part of TAP infrastructure nowadays we have, but I
> want to have the test (and it actually found a bug I made during this
> work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> and RecursiveCopy.pm along with 018_wal_optimize.pl.
> (0004-Add-TAP-test-for-WAL-skipping-feature.patch)

That is a good idea.  Rather than make it specific to this test, I would like
to back-patch all applicable test files from 9.6 src/test/recovery.  I'll plan
to push that one part on Thursday.


Other notable changes:

- Like you suggested earlier, I moved restoration of old*Subid from
  index_create() back to ATExecAddIndex().  I decided to do this when I
  observed that pg_idx_advisor calls index_create().  That's not a strong
  reason, but it was enough to change a decision that had been arbitrary.

- Updated the wal_skip_threshold GUC category to WAL_SETTINGS, for consistency
  with the documentation move.  Added the GUC to postgresql.conf.sample.

- In released branches, I moved the new public struct fields to the end.  This
  reduces the number of extensions requiring a recompile.  From a grep of
  pgxn, one extension ("citus") relies on sizeof(RelationData), and nothing
  relies on sizeof(IndexStmt).

- In 9.6, I rewrote the mdimmedsync() changes so the function never ignores
  FileSync() failure.

Other observations:

- The new test file takes ~62s longer on 9.6 and 9.5, mostly due to commit
  c61559ec3 first appearing in v10.  I am fine with this.

- This is the most-demanding back-branch fix I've ever attempted.  Hopefully
  I've been smarter than usual while reviewing it, but that is unlikely.

Thanks,
nm

Вложения

skip-wal-v38nm.tar.gz

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

21 марта 2020 г., 22:01:27

On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote:
> On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> > The attached is back-patches from 9.5 through master.
> 
> Thanks.  I've made some edits.  I'll plan to push the attached patches on
> Friday or Saturday.

Pushed, after adding a missing "break" to gist_identify() and tweaking two
more comments.  However, a diverse minority of buildfarm members are failing
like this, in most branches:

Mar 21 13:16:37 #   Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
Mar 21 13:16:37 #   at t/018_wal_optimize.pl line 231.
Mar 21 13:16:37 #          got: '1'
Mar 21 13:16:37 #     expected: '2'
Mar 21 13:16:46 # Looks like you failed 1 test of 34.
Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................ 
  -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05

Since I run two of the failing animals, I expect to reproduce this soon.

fairywren failed differently on 9.5; I have not yet studied it:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10

> > It lacks a part of TAP infrastructure nowadays we have, but I
> > want to have the test (and it actually found a bug I made during this
> > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> > and RecursiveCopy.pm along with 018_wal_optimize.pl.
> > (0004-Add-TAP-test-for-WAL-skipping-feature.patch)
> 
> That is a good idea.  Rather than make it specific to this test, I would like
> to back-patch all applicable test files from 9.6 src/test/recovery.  I'll plan
> to push that one part on Thursday.

That push did not cause failures.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Bruce Momjian

Дата:

21 марта 2020 г., 22:04:50

Wow, this thread started in 2015.  :-O

    Date: Fri, 3 Jul 2015 00:05:24 +0200

---------------------------------------------------------------------------

On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote:
> > On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> > > The attached is back-patches from 9.5 through master.
> > 
> > Thanks.  I've made some edits.  I'll plan to push the attached patches on
> > Friday or Saturday.
> 
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments.  However, a diverse minority of buildfarm members are failing
> like this, in most branches:
> 
> Mar 21 13:16:37 #   Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 #   at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 #          got: '1'
> Mar 21 13:16:37 #     expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................ 
>   -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
> 
> Since I run two of the failing animals, I expect to reproduce this soon.
> 
> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10
> 
> > > It lacks a part of TAP infrastructure nowadays we have, but I
> > > want to have the test (and it actually found a bug I made during this
> > > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> > > and RecursiveCopy.pm along with 018_wal_optimize.pl.
> > > (0004-Add-TAP-test-for-WAL-skipping-feature.patch)
> > 
> > That is a good idea.  Rather than make it specific to this test, I would like
> > to back-patch all applicable test files from 9.6 src/test/recovery.  I'll plan
> > to push that one part on Thursday.
> 
> That push did not cause failures.
> 
> 

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

22 марта 2020 г., 01:49:20

On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments.  However, a diverse minority of buildfarm members are failing
> like this, in most branches:
> 
> Mar 21 13:16:37 #   Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 #   at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 #          got: '1'
> Mar 21 13:16:37 #     expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................ 
>   -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
> 
> Since I run two of the failing animals, I expect to reproduce this soon.

force_parallel_regress was the setting needed to reproduce this:

  printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf
  make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf

The proximate cause is the RelFileNodeSkippingWAL() call that we added to
MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
parallel workers have zeroes for pendingSyncHash and rd_*Subid.  I hacked up
the attached patch to understand the scope of the problem (not to commit).  It
logs a message whenever a parallel worker uses pendingSyncHash or
RelationNeedsWAL().  Some of the cases happen often enough to make logs huge,
so the patch suppresses logging for them.  You can see the lower-volume calls
like this:

  printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode =
regress'>/tmp/minimal_parallel.conf

  make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
  find . -name log | xargs grep -rl 'nm0 invalid'

Not all are actual bugs.  For example, get_relation_info() behaves fine:

    /* Temporary and unlogged relations are inaccessible during recovery. */
    if (!RelationNeedsWAL(relation) && RecoveryInProgress())

Kyotaro, can you look through the affected code and propose a strategy for
good coexistence of parallel query with the WAL skipping mechanism?

Since I don't expect one strategy to win clearly and quickly, I plan to revert
the main patch around 2020-03-22 17:30 UTC.  That will give the patch about
twenty-four hours in the buildfarm, so more animals can report in.  I will
leave the three smaller patches in place.

> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10

This did not remain specific to 9.5.  On platforms where SIZEOF_SIZE_T==4 or
SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB.  A simple s/1TB/1GB/ in
the test should fix this.

Вложения

debug-parallel-skip-wal-v0.patch

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

23 марта 2020 г., 11:20:27

Thanks for the labour on this.

At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> > Pushed, after adding a missing "break" to gist_identify() and tweaking two
..
> The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
> parallel workers have zeroes for pendingSyncHash and rd_*Subid.  I hacked up
> the attached patch to understand the scope of the problem (not to commit).  It
> logs a message whenever a parallel worker uses pendingSyncHash or
> RelationNeedsWAL().  Some of the cases happen often enough to make logs huge,
> so the patch suppresses logging for them.  You can see the lower-volume calls
> like this:
> 
>   printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode =
regress'>/tmp/minimal_parallel.conf
 
>   make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
>   find . -name log | xargs grep -rl 'nm0 invalid'
> 
> Not all are actual bugs.  For example, get_relation_info() behaves fine:
> 
>     /* Temporary and unlogged relations are inaccessible during recovery. */
>     if (!RelationNeedsWAL(relation) && RecoveryInProgress())

But the relcache entry shows wrong information about new-ness of its
storage and it is the root cause of the all other problems.

> Kyotaro, can you look through the affected code and propose a strategy for
> good coexistence of parallel query with the WAL skipping mechanism?

Bi-directional communication between leader and workers is too-much.
It wouldn't be acceptable to inhibit the problematic operations on
workers such like heap-prune or btree pin removal.  If we do pending
syncs just before worker start, it won't fix the issue.

The attached patch passes a list of pending-sync relfilenodes at
worker start.  Workers create (immature) pending sync hash from the
list and create relcache entries using the hash. Given that parallel
workers don't perform transactional operations and DDL operations,
workers needs only the list of relfilenodes. The list might be long,
but I don't think it realistic that such many tables are truncated or
created then scanned in parallel within a transaction while wal_level
= minimal.

> Since I don't expect one strategy to win clearly and quickly, I plan to revert
> the main patch around 2020-03-22 17:30 UTC.  That will give the patch about
> twenty-four hours in the buildfarm, so more animals can report in.  I will
> leave the three smaller patches in place.

Thank you for your trouble and the check code. Sorry for not
responding in time.

> > fairywren failed differently on 9.5; I have not yet studied it:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10
> 
> This did not remain specific to 9.5.  On platforms where SIZEOF_SIZE_T==4 or
> SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB.  A simple s/1TB/1GB/ in
> the test should fix this.

Oops. I felt that the 2TB looks too large but didn't get it
seriously. 1GB is 1048576 is less than the said limit 2097151 so the
attached second patch does that.

The attached is a proposal of fix of the issue on top of the reverted
commit.

- v36-0001-Skip-WAL-for-new-relfilenodes-under-wal_level-mi.patch
 The reverted patch.

- v36-0002-Fix-GUC-value-in-TAP-test.patch
 Change wal_skip_threashold to 2TB to 2GB in TAP test.

v36-0003-Fix-the-name-of-struct-pendingSyncs.patch
 I found that the struct of pending sync hash entry is named
 differently way from pending delete hash entry. Change it so that the
 two are in similarly naming convention.

v36-0004-Propagage-pending-sync-information-to-parallel-w.patch

 The proposed fix for the parallel-worker problem.

The make check-world above didn't fail with this patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7aa0ba8fe7bccda5e24a8d25af925372aae8402f Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Sat, 21 Mar 2020 09:38:26 -0700
Subject: [PATCH v36 1/4] Skip WAL for new relfilenodes, under
 wal_level=minimal.

Until now, only selected bulk operations (e.g. COPY) did this.  If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY.  See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules.  Maintainers of table access
methods should examine that section.

To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL.  A new GUC,
wal_skip_threshold, guides that choice.  If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold.  Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.

Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode.  Introduce rd_firstRelfilenodeSubid.  Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node.  Make relcache.c retain entries for certain
dropped relations until end of transaction.

Back-patch to 9.5 (all supported versions).  This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC.  As
always, update standby systems before master systems.  This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions.  (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)

Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas.  Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem.  Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs.  Reported by Martijn van Oosterhout.

Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
---
 .../pg_visibility/expected/pg_visibility.out  |  35 ++
 contrib/pg_visibility/sql/pg_visibility.sql   |  19 +
 doc/src/sgml/config.sgml                      |  39 +-
 doc/src/sgml/perform.sgml                     |  47 +--
 src/backend/access/gist/gistutil.c            |  31 +-
 src/backend/access/gist/gistxlog.c            |  21 +
 src/backend/access/heap/heapam.c              |  45 +--
 src/backend/access/heap/heapam_handler.c      |  22 +-
 src/backend/access/heap/rewriteheap.c         |  21 +-
 src/backend/access/nbtree/nbtsort.c           |  41 +-
 src/backend/access/rmgrdesc/gistdesc.c        |   6 +
 src/backend/access/transam/README             |  45 ++-
 src/backend/access/transam/xact.c             |  15 +
 src/backend/access/transam/xlogutils.c        |  18 +-
 src/backend/bootstrap/bootparse.y             |   4 +
 src/backend/catalog/storage.c                 | 246 +++++++++++-
 src/backend/commands/cluster.c                |  19 +
 src/backend/commands/copy.c                   |  58 +--
 src/backend/commands/createas.c               |  11 +-
 src/backend/commands/indexcmds.c              |   2 +
 src/backend/commands/matview.c                |  12 +-
 src/backend/commands/tablecmds.c              |  26 +-
 src/backend/nodes/copyfuncs.c                 |   2 +
 src/backend/nodes/equalfuncs.c                |   2 +
 src/backend/nodes/outfuncs.c                  |   2 +
 src/backend/parser/gram.y                     |   4 +
 src/backend/parser/parse_utilcmd.c            |   4 +
 src/backend/storage/buffer/bufmgr.c           | 125 +++++-
 src/backend/storage/lmgr/lock.c               |  12 +
 src/backend/storage/smgr/md.c                 |  36 +-
 src/backend/storage/smgr/smgr.c               |  35 ++
 src/backend/utils/cache/relcache.c            | 268 ++++++++++---
 src/backend/utils/misc/guc.c                  |  12 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/gist_private.h             |   2 +
 src/include/access/gistxlog.h                 |   1 +
 src/include/access/heapam.h                   |   3 -
 src/include/access/rewriteheap.h              |   2 +-
 src/include/access/tableam.h                  |  15 +-
 src/include/catalog/storage.h                 |   6 +
 src/include/nodes/parsenodes.h                |   3 +
 src/include/storage/bufmgr.h                  |   4 +
 src/include/storage/lock.h                    |   3 +
 src/include/storage/smgr.h                    |   1 +
 src/include/utils/rel.h                       |  55 ++-
 src/include/utils/relcache.h                  |   8 +-
 src/test/recovery/t/018_wal_optimize.pl       | 372 ++++++++++++++++++
 src/test/regress/expected/alter_table.out     |   6 +
 src/test/regress/expected/create_table.out    |  13 +
 src/test/regress/sql/alter_table.sql          |   7 +
 src/test/regress/sql/create_table.sql         |  15 +
 51 files changed, 1439 insertions(+), 363 deletions(-)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index f0dcb897c4..2abc1b5107 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -1,5 +1,40 @@
 CREATE EXTENSION pg_visibility;
 --
+-- recently-dropped table
+--
+\set VERBOSITY sqlstate
+BEGIN;
+CREATE TABLE droppedtest (c int);
+SELECT 'droppedtest'::regclass::oid AS oid \gset
+SAVEPOINT q; DROP TABLE droppedtest; RELEASE q;
+SAVEPOINT q; SELECT * FROM pg_visibility_map(:oid); ROLLBACK TO q;
+ERROR:  XX000
+-- ERROR:  could not open relation with OID 16xxx
+SAVEPOINT q; SELECT 1; ROLLBACK TO q;
+ ?column? 
+----------
+        1
+(1 row)
+
+SAVEPOINT q; SELECT 1; ROLLBACK TO q;
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT pg_relation_size(:oid), pg_relation_filepath(:oid),
+  has_table_privilege(:oid, 'SELECT');
+ pg_relation_size | pg_relation_filepath | has_table_privilege 
+------------------+----------------------+---------------------
+                  |                      | 
+(1 row)
+
+SELECT * FROM pg_visibility_map(:oid);
+ERROR:  XX000
+-- ERROR:  could not open relation with OID 16xxx
+ROLLBACK;
+\set VERBOSITY default
+--
 -- check that using the module's functions with unsupported relations will fail
 --
 -- partitioned tables (the parent ones) don't have visibility maps
diff --git a/contrib/pg_visibility/sql/pg_visibility.sql b/contrib/pg_visibility/sql/pg_visibility.sql
index c2a7f1d9e4..c78b90521b 100644
--- a/contrib/pg_visibility/sql/pg_visibility.sql
+++ b/contrib/pg_visibility/sql/pg_visibility.sql
@@ -1,5 +1,24 @@
 CREATE EXTENSION pg_visibility;
 
+--
+-- recently-dropped table
+--
+\set VERBOSITY sqlstate
+BEGIN;
+CREATE TABLE droppedtest (c int);
+SELECT 'droppedtest'::regclass::oid AS oid \gset
+SAVEPOINT q; DROP TABLE droppedtest; RELEASE q;
+SAVEPOINT q; SELECT * FROM pg_visibility_map(:oid); ROLLBACK TO q;
+-- ERROR:  could not open relation with OID 16xxx
+SAVEPOINT q; SELECT 1; ROLLBACK TO q;
+SAVEPOINT q; SELECT 1; ROLLBACK TO q;
+SELECT pg_relation_size(:oid), pg_relation_filepath(:oid),
+  has_table_privilege(:oid, 'SELECT');
+SELECT * FROM pg_visibility_map(:oid);
+-- ERROR:  could not open relation with OID 16xxx
+ROLLBACK;
+\set VERBOSITY default
+
 --
 -- check that using the module's functions with unsupported relations will fail
 --
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 70854ae298..9cc5281f01 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2501,16 +2501,19 @@ include_dir 'conf.d'
         levels.  This parameter can only be set at server start.
        </para>
        <para>
-        In <literal>minimal</literal> level, WAL-logging of some bulk
-        operations can be safely skipped, which can make those
-        operations much faster (see <xref linkend="populate-pitr"/>).
-        Operations in which this optimization can be applied include:
+        In <literal>minimal</literal> level, no information is logged for
+        permanent relations for the remainder of a transaction that creates or
+        rewrites them.  This can make operations much faster (see
+        <xref linkend="populate-pitr"/>).  Operations that initiate this
+        optimization include:
         <simplelist>
-         <member><command>CREATE TABLE AS</command></member>
-         <member><command>CREATE INDEX</command></member>
+         <member><command>ALTER ... SET TABLESPACE</command></member>
          <member><command>CLUSTER</command></member>
-         <member><command>COPY</command> into tables that were created or truncated in the same
-         transaction</member>
+         <member><command>CREATE TABLE</command></member>
+         <member><command>REFRESH MATERIALIZED VIEW</command>
+         (without <option>CONCURRENTLY</option>)</member>
+         <member><command>REINDEX</command></member>
+         <member><command>TRUNCATE</command></member>
         </simplelist>
         But minimal WAL does not contain enough information to reconstruct the
         data from a base backup and the WAL logs, so <literal>replica</literal> or
@@ -2907,6 +2910,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold">
+      <term><varname>wal_skip_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>wal_skip_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When <varname>wal_level</varname> is <literal>minimal</literal> and a
+        transaction commits after creating or rewriting a permanent relation,
+        this setting determines how to persist the new data.  If the data is
+        smaller than this setting, write it to the WAL log; otherwise, use an
+        fsync of affected files.  Depending on the properties of your storage,
+        raising or lowering this value might help if such commits are slowing
+        concurrent transactions.  The default is two megabytes
+        (<literal>2MB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-commit-delay" xreflabel="commit_delay">
       <term><varname>commit_delay</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index ab090441cf..58477ac83a 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -1607,8 +1607,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
     needs to be written, because in case of an error, the files
     containing the newly loaded data will be removed anyway.
     However, this consideration only applies when
-    <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for
-    non-partitioned tables as all commands must write WAL otherwise.
+    <xref linkend="guc-wal-level"/> is <literal>minimal</literal>
+    as all commands must write WAL otherwise.
    </para>
 
   </sect2>
@@ -1708,42 +1708,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;
    </para>
 
    <para>
-    Aside from avoiding the time for the archiver or WAL sender to
-    process the WAL data,
-    doing this will actually make certain commands faster, because they
-    are designed not to write WAL at all if <varname>wal_level</varname>
-    is <literal>minimal</literal>.  (They can guarantee crash safety more cheaply
-    by doing an <function>fsync</function> at the end than by writing WAL.)
-    This applies to the following commands:
-    <itemizedlist>
-     <listitem>
-      <para>
-       <command>CREATE TABLE AS SELECT</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CREATE INDEX</command> (and variants such as
-       <command>ALTER TABLE ADD PRIMARY KEY</command>)
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>ALTER TABLE SET TABLESPACE</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>CLUSTER</command>
-      </para>
-     </listitem>
-     <listitem>
-      <para>
-       <command>COPY FROM</command>, when the target table has been
-       created or truncated earlier in the same transaction
-      </para>
-     </listitem>
-    </itemizedlist>
+    Aside from avoiding the time for the archiver or WAL sender to process the
+    WAL data, doing this will actually make certain commands faster, because
+    they do not to write WAL at all if <varname>wal_level</varname>
+    is <literal>minimal</literal> and the current subtransaction (or top-level
+    transaction) created or truncated the table or index they change.  (They
+    can guarantee crash safety more cheaply by doing
+    an <function>fsync</function> at the end than by writing WAL.)
    </para>
   </sect2>
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dd975b164c..765329bbcd 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1004,23 +1004,44 @@ gistproperty(Oid index_oid, int attno,
 }
 
 /*
- * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs
- * to detect concurrent page splits anyway. This function provides a fake
- * sequence of LSNs for that purpose.
+ * Some indexes are not WAL-logged, but we need LSNs to detect concurrent page
+ * splits anyway. This function provides a fake sequence of LSNs for that
+ * purpose.
  */
 XLogRecPtr
 gistGetFakeLSN(Relation rel)
 {
-    static XLogRecPtr counter = FirstNormalUnloggedLSN;
-
     if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
     {
         /*
          * Temporary relations are only accessible in our session, so a simple
          * backend-local counter will do.
          */
+        static XLogRecPtr counter = FirstNormalUnloggedLSN;
+
         return counter++;
     }
+    else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+    {
+        /*
+         * WAL-logging on this relation will start after commit, so its LSNs
+         * must be distinct numbers smaller than the LSN at the next commit.
+         * Emit a dummy WAL record if insert-LSN hasn't advanced after the
+         * last call.
+         */
+        static XLogRecPtr lastlsn = InvalidXLogRecPtr;
+        XLogRecPtr    currlsn = GetXLogInsertRecPtr();
+
+        /* Shouldn't be called for WAL-logging relations */
+        Assert(!RelationNeedsWAL(rel));
+
+        /* No need for an actual record if we already have a distinct LSN */
+        if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
+            currlsn = gistXLogAssignLSN();
+
+        lastlsn = currlsn;
+        return currlsn;
+    }
     else
     {
         /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index d3f3a7b803..b60dba052f 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             gistRedoPageDelete(record);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* nop. See gistGetFakeLSN(). */
+            break;
         default:
             elog(PANIC, "gist_redo: unknown op code %u", info);
     }
@@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
     return recptr;
 }
 
+/*
+ * Write an empty XLOG record to assign a distinct LSN.
+ */
+XLogRecPtr
+gistXLogAssignLSN(void)
+{
+    int            dummy = 0;
+
+    /*
+     * Records other than SWITCH_WAL must have content. We use an integer 0 to
+     * follow the restriction.
+     */
+    XLogBeginInsert();
+    XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT);
+    XLogRegisterData((char *) &dummy, sizeof(dummy));
+    return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN);
+}
+
 /*
  * Write XLOG record about reuse of a deleted page.
  */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 29694b8aa4..a25d539ec4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -21,7 +21,6 @@
  *        heap_multi_insert - insert multiple tuples into a relation
  *        heap_delete        - delete a tuple from a relation
  *        heap_update        - replace a tuple in a relation with another tuple
- *        heap_sync        - sync heap, for when no WAL has been written
  *
  * NOTES
  *      This file contains the heap_ routines which implement
@@ -1939,7 +1938,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
     MarkBufferDirty(buffer);
 
     /* XLOG stuff */
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (RelationNeedsWAL(relation))
     {
         xl_heap_insert xlrec;
         xl_heap_header xlhdr;
@@ -2122,7 +2121,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
     /* currently not needed (thus unsupported) for heap_multi_insert() */
     AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+    needwal = RelationNeedsWAL(relation);
     saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
                                                    HEAP_DEFAULT_FILLFACTOR);
 
@@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record)
     }
 }
 
-/*
- *    heap_sync        - sync a heap, for use when no WAL has been written
- *
- * This forces the heap contents (including TOAST heap if any) down to disk.
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
- *
- * Indexes are not touched.  (Currently, index operations associated with
- * the commands that use this are WAL-logged and so do not need fsync.
- * That behavior might change someday, but in any case it's likely that
- * any fsync decisions required would be per-index and hence not appropriate
- * to be done here.)
- */
-void
-heap_sync(Relation rel)
-{
-    /* non-WAL-logged tables never need fsync */
-    if (!RelationNeedsWAL(rel))
-        return;
-
-    /* main heap */
-    FlushRelationBuffers(rel);
-    /* FlushRelationBuffers will have opened rd_smgr */
-    smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
-
-    /* FSM is not critical, don't bother syncing it */
-
-    /* toast heap, if any */
-    if (OidIsValid(rel->rd_rel->reltoastrelid))
-    {
-        Relation    toastrel;
-
-        toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
-        FlushRelationBuffers(toastrel);
-        smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
-        table_close(toastrel, AccessShareLock);
-    }
-}
-
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ca52846b97..56b35622f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -555,17 +555,6 @@ tuple_lock_retry:
     return result;
 }
 
-static void
-heapam_finish_bulk_insert(Relation relation, int options)
-{
-    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway / don't go through tableam)
-     */
-    if (options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(relation);
-}
-
 
 /* ------------------------------------------------------------------------
  * DDL related callbacks for heap AM.
@@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     IndexScanDesc indexScan;
     TableScanDesc tableScan;
     HeapScanDesc heapScan;
-    bool        use_wal;
     bool        is_system_catalog;
     Tuplesortstate *tuplesort;
     TupleDesc    oldTupDesc = RelationGetDescr(OldHeap);
@@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
     is_system_catalog = IsSystemRelation(OldHeap);
 
     /*
-     * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a WAL-logged rel.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
-    /* use_wal off requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
     /* Preallocate values/isnull arrays */
@@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
     /* Initialize the rewrite operation */
     rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-                                 *multi_cutoff, use_wal);
+                                 *multi_cutoff);
 
 
     /* Set up sorting if wanted */
@@ -2525,7 +2510,6 @@ static const TableAmRoutine heapam_methods = {
     .tuple_delete = heapam_tuple_delete,
     .tuple_update = heapam_tuple_update,
     .tuple_lock = heapam_tuple_lock,
-    .finish_bulk_insert = heapam_finish_bulk_insert,
 
     .tuple_fetch_row_version = heapam_fetch_row_version,
     .tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 9c29bc0e0f..39e33763df 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -136,7 +136,6 @@ typedef struct RewriteStateData
     Page        rs_buffer;        /* page currently being built */
     BlockNumber rs_blockno;        /* block where page will go */
     bool        rs_buffer_valid;    /* T if any tuples in buffer */
-    bool        rs_use_wal;        /* must we WAL-log inserts? */
     bool        rs_logical_rewrite; /* do we need to do logical rewriting */
     TransactionId rs_oldest_xmin;    /* oldest xmin used by caller to determine
                                      * tuple visibility */
@@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state);
  * oldest_xmin    xid used by the caller to determine which tuples are dead
  * freeze_xid    xid before which tuples will be frozen
  * cutoff_multi    multixact before which multis will be removed
- * use_wal        should the inserts to the new heap be WAL-logged?
  *
  * Returns an opaque RewriteState, allocated in current memory context,
  * to be used in subsequent calls to the other functions.
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-                   TransactionId freeze_xid, MultiXactId cutoff_multi,
-                   bool use_wal)
+                   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
     RewriteState state;
     MemoryContext rw_cxt;
@@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     /* new_heap needn't be empty, just locked */
     state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
     state->rs_buffer_valid = false;
-    state->rs_use_wal = use_wal;
     state->rs_oldest_xmin = oldest_xmin;
     state->rs_freeze_xid = freeze_xid;
     state->rs_cutoff_multi = cutoff_multi;
@@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state)
     /* Write the last page, if any */
     if (state->rs_buffer_valid)
     {
-        if (state->rs_use_wal)
+        if (RelationNeedsWAL(state->rs_new_rel))
             log_newpage(&state->rs_new_rel->rd_node,
                         MAIN_FORKNUM,
                         state->rs_blockno,
@@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state)
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.
-     *
-     * It's obvious that we must do this when not WAL-logging. It's less
-     * obvious that we have to do it even if we did WAL-log the pages. The
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
      * reason is the same as in storage.c's RelationCopyStorage(): we're
      * writing data that's not in shared buffers, and so a CHECKPOINT
      * occurring during the rewriteheap operation won't have fsync'd data we
      * wrote before the checkpoint.
      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);
 
@@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
     {
         int            options = HEAP_INSERT_SKIP_FSM;
 
-        if (!state->rs_use_wal)
-            options |= HEAP_INSERT_SKIP_WAL;
-
         /*
          * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
          * for the TOAST table are not logically decoded.  The main heap is
@@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
             /* Doesn't fit, so write out the existing page */
 
             /* XLOG stuff */
-            if (state->rs_use_wal)
+            if (RelationNeedsWAL(state->rs_new_rel))
                 log_newpage(&state->rs_new_rel->rd_node,
                             MAIN_FORKNUM,
                             state->rs_blockno,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e66cd36dfa..9111e2789c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -31,18 +31,6 @@
  * them.  They will need to be re-read into shared buffers on first use after
  * the build finishes.
  *
- * Since the index will never be used unless it is completely built,
- * from a crash-recovery point of view there is no need to WAL-log the
- * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
- * This can be seen to be sufficient for crash recovery by considering that
- * it's effectively equivalent to what would happen if a CHECKPOINT occurred
- * just after the index build.  However, it is clearly not sufficient if the
- * DBA is using the WAL log for PITR or replication purposes, since another
- * machine would not be able to reconstruct the index from WAL.  Therefore,
- * we log the completed index pages to WAL if and only if WAL archiving is
- * active.
- *
  * This code isn't concerned about the FSM at all. The caller is responsible
  * for initializing that.
  *
@@ -569,12 +557,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
     wstate.inskey = _bt_mkscankey(wstate.index, NULL);
     /* _bt_mkscankey() won't set allequalimage without metapage */
     wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
-
-    /*
-     * We need to log index creation in WAL iff WAL archiving/streaming is
-     * enabled UNLESS the index isn't WAL-logged anyway.
-     */
-    wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
+    wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
 
     /* reserve the metapage */
     wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
@@ -1424,21 +1407,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
     _bt_uppershutdown(wstate, state);
 
     /*
-     * If the index is WAL-logged, we must fsync it down to disk before it's
-     * safe to commit the transaction.  (For a non-WAL-logged index we don't
-     * care since the index will be uninteresting after a crash anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the build. It's
-     * less obvious that we have to do it even if we did WAL-log the index
-     * pages.  The reason is that since we're building outside shared buffers,
-     * a CHECKPOINT occurring during the build has no way to flush the
-     * previously written data to disk (indeed it won't know the index even
-     * exists).  A crash later on would replay WAL from the checkpoint,
-     * therefore it wouldn't replay our earlier WAL entries. If we do not
-     * fsync those pages here, they might still not be on disk when the crash
-     * occurs.
+     * When we WAL-logged index pages, we must nonetheless fsync index files.
+     * Since we're building outside shared buffers, a CHECKPOINT occurring
+     * during the build has no way to flush the previously written data to
+     * disk (indeed it won't know the index even exists).  A crash later on
+     * would replay WAL from the checkpoint, therefore it wouldn't replay our
+     * earlier WAL entries. If we do not fsync those pages here, they might
+     * still not be on disk when the crash occurs.
      */
-    if (RelationNeedsWAL(wstate->index))
+    if (wstate->btws_use_wal)
     {
         RelationOpenSmgr(wstate->index);
         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 3377367e12..de309fb122 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
         case XLOG_GIST_PAGE_DELETE:
             out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            /* No details to write out */
+            break;
     }
 }
 
@@ -105,6 +108,9 @@ gist_identify(uint8 info)
         case XLOG_GIST_PAGE_DELETE:
             id = "PAGE_DELETE";
             break;
+        case XLOG_GIST_ASSIGN_LSN:
+            id = "ASSIGN_LSN";
+            break;
     }
 
     return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index b5a2cb2de8..eb9aac5fd3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -717,6 +717,38 @@ then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
 
+Skipping WAL for New RelFileNode
+--------------------------------
+
+Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK
+would unlink, in-tree access methods write no WAL for that change.  Code that
+writes WAL without calling RelationNeedsWAL() must check for this case.  This
+skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
+for the same block, REDO could overwrite the WAL-skipping change.  If a
+WAL-writing change followed a WAL-skipping change for the same block, a
+related problem would arise.  When a WAL record contains no full-page image,
+REDO expects the page to match its contents from just before record insertion.
+A WAL-skipping change may not reach disk at all, violating REDO's expectation
+under full_page_writes=off.  For any access method, CommitTransaction() writes
+and fsyncs affected blocks before recording the commit.
+
+Prefer to do the same in future access methods.  However, two other approaches
+can work.  First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync().  Second, an access method can opt to write WAL
+unconditionally for permanent relations.  Under these approaches, the access
+method callbacks must not call functions that react to RelationNeedsWAL().
+
+This applies only to WAL records whose replay would modify bytes stored in the
+new relfilenode.  It does not apply to other records about the relfilenode,
+such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
+relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations.
+Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
+ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
+the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
+to skip WAL, but that won't affect its indexes.
+
+
 Asynchronous Commit
 -------------------
 
@@ -820,13 +852,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in
 advance of T1's commit, but we don't care since temp table contents don't
 survive crashes anyway.
 
-Database writes made via any of the paths we have introduced to avoid WAL
-overhead for bulk updates are also safe.  In these cases it's entirely
-possible for the data to reach disk before T1's commit, because T1 will
-fsync it down to disk without any sort of interlock, as soon as it finishes
-the bulk update.  However, all these paths are designed to write data that
-no other transaction can see until after T1 commits.  The situation is thus
-not different from ordinary WAL-logged updates.
+Database writes that skip WAL for new relfilenodes are also safe.  In these
+cases it's entirely possible for the data to reach disk before T1's commit,
+because T1 will fsync it down to disk without any sort of interlock.  However,
+all these paths are designed to write data that no other transaction can see
+until after T1 commits.  The situation is thus not different from ordinary
+WAL-logged updates.
 
 Transaction Emulation during Recovery
 -------------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e3c60f23cd..b6885b01bc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2109,6 +2109,13 @@ CommitTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before AtEOXact_RelationMap(), so that we
+     * don't see committed-but-broken files after a crash.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2342,6 +2349,13 @@ PrepareTransaction(void)
      */
     PreCommit_on_commit_actions();
 
+    /*
+     * Synchronize files that are created and not WAL-logged during this
+     * transaction. This must happen before EndPrepare(), so that we don't see
+     * committed-but-broken files after a crash and COMMIT PREPARED.
+     */
+    smgrDoPendingSyncs(true);
+
     /* close large objects before lower-level cleanup */
     AtEOXact_LargeObject(true);
 
@@ -2660,6 +2674,7 @@ AbortTransaction(void)
      */
     AfterTriggerEndXact(false); /* 'false' means it's abort */
     AtAbort_Portals();
+    smgrDoPendingSyncs(false);
     AtEOXact_LargeObject(false);
     AtAbort_Notify();
     AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index b217ffa52f..6cb143e161 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -549,6 +549,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry;
  * fields related to physical storage, like rd_rel, are initialized, so the
  * fake entry is only usable in low-level operations like ReadBuffer().
  *
+ * This is also used for syncing WAL-skipped files.
+ *
  * Caller must free the returned entry with FreeFakeRelcacheEntry().
  */
 Relation
@@ -557,18 +559,20 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     FakeRelCacheEntry fakeentry;
     Relation    rel;
 
-    Assert(InRecovery);
-
     /* Allocate the Relation struct and all related space in one block. */
     fakeentry = palloc0(sizeof(FakeRelCacheEntryData));
     rel = (Relation) fakeentry;
 
     rel->rd_rel = &fakeentry->pgc;
     rel->rd_node = rnode;
-    /* We will never be working with temp rels during recovery */
+
+    /*
+     * We will never be working with temp rels during recovery or while
+     * syncing WAL-skipped files.
+     */
     rel->rd_backend = InvalidBackendId;
 
-    /* It must be a permanent table if we're in recovery. */
+    /* It must be a permanent table here */
     rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT;
 
     /* We don't know the name of the relation; use relfilenode instead */
@@ -577,9 +581,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
     /*
      * We set up the lockRelId in case anything tries to lock the dummy
      * relation.  Note that this is fairly bogus since relNode may be
-     * different from the relation's OID.  It shouldn't really matter though,
-     * since we are presumably running by ourselves and can't have any lock
-     * conflicts ...
+     * different from the relation's OID.  It shouldn't really matter though.
+     * In recovery, we are running by ourselves and can't have any lock
+     * conflicts.  While syncing, we already hold AccessExclusiveLock.
      */
     rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
     rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y
index 61e758696f..5eaca279ee 100644
--- a/src/backend/bootstrap/bootparse.y
+++ b/src/backend/bootstrap/bootparse.y
@@ -306,6 +306,8 @@ Boot_DeclareIndexStmt:
                     stmt->idxcomment = NULL;
                     stmt->indexOid = InvalidOid;
                     stmt->oldNode = InvalidOid;
+                    stmt->oldCreateSubid = InvalidSubTransactionId;
+                    stmt->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
                     stmt->unique = false;
                     stmt->primary = false;
                     stmt->isconstraint = false;
@@ -356,6 +358,8 @@ Boot_DeclareUniqueIndexStmt:
                     stmt->idxcomment = NULL;
                     stmt->indexOid = InvalidOid;
                     stmt->oldNode = InvalidOid;
+                    stmt->oldCreateSubid = InvalidSubTransactionId;
+                    stmt->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
                     stmt->unique = true;
                     stmt->primary = false;
                     stmt->isconstraint = false;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..0ed7c64a05 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -29,9 +29,13 @@
 #include "miscadmin.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC variables */
+int            wal_skip_threshold = 2048;    /* in kilobytes */
+
 /*
  * We keep a list of all relations (represented as RelFileNode values)
  * that have been created or deleted in the current transaction.  When
@@ -61,7 +65,14 @@ typedef struct PendingRelDelete
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
+typedef struct pendingSync
+{
+    RelFileNode rnode;
+    bool        is_truncated;    /* Has the file experienced truncation? */
+} pendingSync;
+
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+HTAB       *pendingSyncHash = NULL;
 
 /*
  * RelationCreateStorage
@@ -117,6 +128,32 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     pending->next = pendingDeletes;
     pendingDeletes = pending;
 
+    /* Queue an at-commit sync. */
+    if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+    {
+        pendingSync *pending;
+        bool        found;
+
+        /* we sync only permanent relations */
+        Assert(backend == InvalidBackendId);
+
+        if (!pendingSyncHash)
+        {
+            HASHCTL        ctl;
+
+            ctl.keysize = sizeof(RelFileNode);
+            ctl.entrysize = sizeof(pendingSync);
+            ctl.hcxt = TopTransactionContext;
+            pendingSyncHash =
+                hash_create("pending sync hash",
+                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+        }
+
+        pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
+        Assert(!found);
+        pending->is_truncated = false;
+    }
+
     return srel;
 }
 
@@ -275,6 +312,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         }
     }
 
+    RelationPreTruncate(rel);
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -325,6 +364,28 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
         FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
 }
 
+/*
+ * RelationPreTruncate
+ *        Perform AM-independent work before a physical truncation.
+ *
+ * If an access method's relation_nontransactional_truncate does not call
+ * RelationTruncate(), it must call this before decreasing the table size.
+ */
+void
+RelationPreTruncate(Relation rel)
+{
+    pendingSync *pending;
+
+    if (!pendingSyncHash)
+        return;
+    RelationOpenSmgr(rel);
+
+    pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
+                          HASH_FIND, NULL);
+    if (pending)
+        pending->is_truncated = true;
+}
+
 /*
  * Copy a fork's data, block by block.
  *
@@ -355,7 +416,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
     /*
      * We need to log the copied data in WAL iff WAL archiving/streaming is
-     * enabled AND it's a permanent relation.
+     * enabled AND it's a permanent relation.  This gives the same answer as
+     * "RelationNeedsWAL(rel) || copying_initfork", because we know the
+     * current operation created a new relfilenode.
      */
     use_wal = XLogIsNeeded() &&
         (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
@@ -397,24 +460,39 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
     }
 
     /*
-     * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
-     * to ensure that the toast table gets fsync'd too.  (For a temp or
-     * unlogged rel we don't care since the data will be gone after a crash
-     * anyway.)
-     *
-     * It's obvious that we must do this when not WAL-logging the copy. It's
-     * less obvious that we have to do it even if we did WAL-log the copied
-     * pages. The reason is that since we're copying outside shared buffers, a
-     * CHECKPOINT occurring during the copy has no way to flush the previously
-     * written data to disk (indeed it won't know the new rel even exists).  A
-     * crash later on would replay WAL from the checkpoint, therefore it
-     * wouldn't replay our earlier WAL entries. If we do not fsync those pages
-     * here, they might still not be on disk when the crash occurs.
+     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
+     * reason is that since we're copying outside shared buffers, a CHECKPOINT
+     * occurring during the copy has no way to flush the previously written
+     * data to disk (indeed it won't know the new rel even exists).  A crash
+     * later on would replay WAL from the checkpoint, therefore it wouldn't
+     * replay our earlier WAL entries. If we do not fsync those pages here,
+     * they might still not be on disk when the crash occurs.
      */
-    if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
+    if (use_wal || copying_initfork)
         smgrimmedsync(dst, forkNum);
 }
 
+/*
+ * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL
+ *
+ *   Changes of certain relfilenodes must not write WAL; see "Skipping WAL for
+ *   New RelFileNode" in src/backend/access/transam/README.  Though it is
+ *   known from Relation efficiently, this function is intended for the code
+ *   paths not having access to Relation.
+ */
+bool
+RelFileNodeSkippingWAL(RelFileNode rnode)
+{
+    if (XLogIsNeeded())
+        return false;            /* no permanent relfilenode skips WAL */
+
+    if (!pendingSyncHash ||
+        hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL)
+        return false;
+
+    return true;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
@@ -492,6 +570,144 @@ smgrDoPendingDeletes(bool isCommit)
     }
 }
 
+/*
+ *    smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+    PendingRelDelete *pending;
+    int            nrels = 0,
+                maxrels = 0;
+    SMgrRelation *srels = NULL;
+    HASH_SEQ_STATUS scan;
+    pendingSync *pendingsync;
+
+    if (XLogIsNeeded())
+        return;                    /* no relation can use this */
+
+    Assert(GetCurrentTransactionNestLevel() == 1);
+
+    if (!pendingSyncHash)
+        return;                    /* no relation needs sync */
+
+    /* Just throw away all pending syncs if any at rollback */
+    if (!isCommit)
+    {
+        pendingSyncHash = NULL;
+        return;
+    }
+
+    AssertPendingSyncs_RelationCache();
+
+    /* Skip syncing nodes that smgrDoPendingDeletes() will delete. */
+    for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+    {
+        if (!pending->atCommit)
+            continue;
+
+        (void) hash_search(pendingSyncHash, (void *) &pending->relnode,
+                           HASH_REMOVE, NULL);
+    }
+
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
+    {
+        ForkNumber    fork;
+        BlockNumber nblocks[MAX_FORKNUM + 1];
+        BlockNumber total_blocks = 0;
+        SMgrRelation srel;
+
+        srel = smgropen(pendingsync->rnode, InvalidBackendId);
+
+        /*
+         * We emit newpage WAL records for smaller relations.
+         *
+         * Small WAL records have a chance to be emitted along with other
+         * backends' WAL records.  We emit WAL records instead of syncing for
+         * files that are smaller than a certain threshold, expecting faster
+         * commit.  The threshold is defined by the GUC wal_skip_threshold.
+         */
+        if (!pendingsync->is_truncated)
+        {
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                if (smgrexists(srel, fork))
+                {
+                    BlockNumber n = smgrnblocks(srel, fork);
+
+                    /* we shouldn't come here for unlogged relations */
+                    Assert(fork != INIT_FORKNUM);
+                    nblocks[fork] = n;
+                    total_blocks += n;
+                }
+                else
+                    nblocks[fork] = InvalidBlockNumber;
+            }
+        }
+
+        /*
+         * Sync file or emit WAL records for its contents.
+         *
+         * Although we emit WAL record if the file is small enough, do file
+         * sync regardless of the size if the file has experienced a
+         * truncation. It is because the file would be followed by trailing
+         * garbage blocks after a crash recovery if, while a past longer file
+         * had been flushed out, we omitted syncing-out of the file and
+         * emitted WAL instead.  You might think that we could choose WAL if
+         * the current main fork is longer than ever, but there's a case where
+         * main fork is longer than ever but FSM fork gets shorter.
+         */
+        if (pendingsync->is_truncated ||
+            total_blocks * BLCKSZ / 1024 >= wal_skip_threshold)
+        {
+            /* allocate the initial array, or extend it, if needed */
+            if (maxrels == 0)
+            {
+                maxrels = 8;
+                srels = palloc(sizeof(SMgrRelation) * maxrels);
+            }
+            else if (maxrels <= nrels)
+            {
+                maxrels *= 2;
+                srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+            }
+
+            srels[nrels++] = srel;
+        }
+        else
+        {
+            /* Emit WAL records for all blocks.  The file is small enough. */
+            for (fork = 0; fork <= MAX_FORKNUM; fork++)
+            {
+                int            n = nblocks[fork];
+                Relation    rel;
+
+                if (!BlockNumberIsValid(n))
+                    continue;
+
+                /*
+                 * Emit WAL for the whole file.  Unfortunately we don't know
+                 * what kind of a page this is, so we have to log the full
+                 * page including any unused space.  ReadBufferExtended()
+                 * counts some pgstat events; unfortunately, we discard them.
+                 */
+                rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node);
+                log_newpage_range(rel, fork, 0, n, false);
+                FreeFakeRelcacheEntry(rel);
+            }
+        }
+    }
+
+    pendingSyncHash = NULL;
+
+    if (nrels > 0)
+    {
+        smgrdosyncall(srels, nrels);
+        pfree(srels);
+    }
+}
+
 /*
  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
  *
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index fc1cea0236..ccd0c9b286 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1110,6 +1110,25 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
         *mapped_tables++ = r2;
     }
 
+    /*
+     * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+     * subtransaction. The rel2 storage (swapped from rel1) may or may not be
+     * new.
+     */
+    {
+        Relation    rel1,
+                    rel2;
+
+        rel1 = relation_open(r1, NoLock);
+        rel2 = relation_open(r2, NoLock);
+        rel2->rd_createSubid = rel1->rd_createSubid;
+        rel2->rd_newRelfilenodeSubid = rel1->rd_newRelfilenodeSubid;
+        rel2->rd_firstRelfilenodeSubid = rel1->rd_firstRelfilenodeSubid;
+        RelationAssumeNewRelfilenode(rel1);
+        relation_close(rel1, NoLock);
+        relation_close(rel2, NoLock);
+    }
+
     /*
      * In the case of a shared catalog, these next few steps will only affect
      * our own database's pg_class row; but that's okay, because they are all
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index fbde9f88e7..05f1fae6b0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2713,63 +2713,15 @@ CopyFrom(CopyState cstate)
                             RelationGetRelationName(cstate->rel))));
     }
 
-    /*----------
-     * Check to see if we can avoid writing WAL
-     *
-     * If archive logging/streaming is not enabled *and* either
-     *    - table was created in same transaction as this COPY
-     *    - data is being written to relfilenode created in this transaction
-     * then we can skip writing WAL.  It's safe because if the transaction
-     * doesn't commit, we'll discard the table (or the new relfilenode file).
-     * If it does commit, we'll have done the table_finish_bulk_insert() at
-     * the bottom of this routine first.
-     *
-     * As mentioned in comments in utils/rel.h, the in-same-transaction test
-     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-     * can be cleared before the end of the transaction. The exact case is
-     * when a relation sets a new relfilenode twice in same transaction, yet
-     * the second one fails in an aborted subtransaction, e.g.
-     *
-     * BEGIN;
-     * TRUNCATE t;
-     * SAVEPOINT save;
-     * TRUNCATE t;
-     * ROLLBACK TO save;
-     * COPY ...
-     *
-     * Also, if the target file is new-in-transaction, we assume that checking
-     * FSM for free space is a waste of time, even if we must use WAL because
-     * of archiving.  This could possibly be wrong, but it's unlikely.
-     *
-     * The comments for table_tuple_insert and RelationGetBufferForTuple
-     * specify that skipping WAL logging is only safe if we ensure that our
-     * tuples do not go into pages containing tuples from any other
-     * transactions --- but this must be the case if we have a new table or
-     * new relfilenode, so we need no additional work to enforce that.
-     *
-     * We currently don't support this optimization if the COPY target is a
-     * partitioned table as we currently only lazily initialize partition
-     * information when routing the first tuple to the partition.  We cannot
-     * know at this stage if we can perform this optimization.  It should be
-     * possible to improve on this, but it does mean maintaining heap insert
-     * option flags per partition and setting them when we first open the
-     * partition.
-     *
-     * This optimization is not supported for relation types which do not
-     * have any physical storage, with foreign tables and views using
-     * INSTEAD OF triggers entering in this category.  Partitioned tables
-     * are not supported as per the description above.
-     *----------
+    /*
+     * If the target file is new-in-transaction, we assume that checking FSM
+     * for free space is a waste of time.  This could possibly be wrong, but
+     * it's unlikely.
      */
-    /* createSubid is creation check, newRelfilenodeSubid is truncation check */
     if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
         (cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-         cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-    {
+         cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
         ti_options |= TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
-    }
 
     /*
      * Optimize if new relfilenode was created in this subxact or one of its
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 3a5676fb39..8e5e4fb95e 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -550,16 +550,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
     myState->rel = intoRelationDesc;
     myState->reladdr = intoRelationAddr;
     myState->output_cid = GetCurrentCommandId(true);
+    myState->ti_options = TABLE_INSERT_SKIP_FSM;
+    myState->bistate = GetBulkInsertState();
 
     /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
      */
-    myState->ti_options = TABLE_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
-    myState->bistate = GetBulkInsertState();
-
-    /* Not using WAL requires smgr_targblock be initially invalid */
     Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4e8263af4b..c94939df40 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1192,6 +1192,8 @@ DefineIndex(Oid relationId,
                     childStmt->relation = NULL;
                     childStmt->indexOid = InvalidOid;
                     childStmt->oldNode = InvalidOid;
+                    childStmt->oldCreateSubid = InvalidSubTransactionId;
+                    childStmt->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
 
                     /*
                      * Adjust any Vars (both in expressions and in the index's
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index c3954f3e24..492b2a3ee6 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
      */
     myState->transientrel = transientrel;
     myState->output_cid = GetCurrentCommandId(true);
-
-    /*
-     * We can skip WAL-logging the insertions, unless PITR or streaming
-     * replication is in use. We can skip the FSM in any case.
-     */
     myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-    if (!XLogIsNeeded())
-        myState->ti_options |= TABLE_INSERT_SKIP_WAL;
     myState->bistate = GetBulkInsertState();
 
-    /* Not using WAL requires smgr_targblock be initially invalid */
+    /*
+     * Valid smgr_targblock implies something already wrote to the relation.
+     * This may be harmless, but this function hasn't planned for it.
+     */
     Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 729025470d..31d718e8ea 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5041,19 +5041,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
         newrel = NULL;
 
     /*
-     * Prepare a BulkInsertState and options for table_tuple_insert. Because
-     * we're building a new heap, we can skip WAL-logging and fsync it to disk
-     * at the end instead (unless WAL-logging is required for archiving or
-     * streaming replication). The FSM is empty too, so don't bother using it.
+     * Prepare a BulkInsertState and options for table_tuple_insert.  The FSM
+     * is empty, so don't bother using it.
      */
     if (newrel)
     {
         mycid = GetCurrentCommandId(true);
         bistate = GetBulkInsertState();
-
         ti_options = TABLE_INSERT_SKIP_FSM;
-        if (!XLogIsNeeded())
-            ti_options |= TABLE_INSERT_SKIP_WAL;
     }
     else
     {
@@ -7719,14 +7714,19 @@ ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
 
     /*
      * If TryReuseIndex() stashed a relfilenode for us, we used it for the new
-     * index instead of building from scratch.  The DROP of the old edition of
-     * this index will have scheduled the storage for deletion at commit, so
-     * cancel that pending deletion.
+     * index instead of building from scratch.  Restore associated fields.
+     * This may store InvalidSubTransactionId in both fields, in which case
+     * relcache.c will assume it can rebuild the relcache entry.  Hence, do
+     * this after the CCI that made catalog rows visible to any rebuild.  The
+     * DROP of the old edition of this index will have scheduled the storage
+     * for deletion at commit, so cancel that pending deletion.
      */
     if (OidIsValid(stmt->oldNode))
     {
         Relation    irel = index_open(address.objectId, NoLock);
 
+        irel->rd_createSubid = stmt->oldCreateSubid;
+        irel->rd_firstRelfilenodeSubid = stmt->oldFirstRelfilenodeSubid;
         RelationPreserveStorage(irel->rd_node, true);
         index_close(irel, NoLock);
     }
@@ -12052,7 +12052,11 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
 
         /* If it's a partitioned index, there is no storage to share. */
         if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+        {
             stmt->oldNode = irel->rd_node.relNode;
+            stmt->oldCreateSubid = irel->rd_createSubid;
+            stmt->oldFirstRelfilenodeSubid = irel->rd_firstRelfilenodeSubid;
+        }
         index_close(irel, NoLock);
     }
 }
@@ -12988,6 +12992,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 
     table_close(pg_class, RowExclusiveLock);
 
+    RelationAssumeNewRelfilenode(rel);
+
     relation_close(rel, NoLock);
 
     /* Make sure the reltablespace change is visible */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..1a70625dc8 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -3478,6 +3478,8 @@ _copyIndexStmt(const IndexStmt *from)
     COPY_STRING_FIELD(idxcomment);
     COPY_SCALAR_FIELD(indexOid);
     COPY_SCALAR_FIELD(oldNode);
+    COPY_SCALAR_FIELD(oldCreateSubid);
+    COPY_SCALAR_FIELD(oldFirstRelfilenodeSubid);
     COPY_SCALAR_FIELD(unique);
     COPY_SCALAR_FIELD(primary);
     COPY_SCALAR_FIELD(isconstraint);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 88b912977e..2256859dc3 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1345,6 +1345,8 @@ _equalIndexStmt(const IndexStmt *a, const IndexStmt *b)
     COMPARE_STRING_FIELD(idxcomment);
     COMPARE_SCALAR_FIELD(indexOid);
     COMPARE_SCALAR_FIELD(oldNode);
+    COMPARE_SCALAR_FIELD(oldCreateSubid);
+    COMPARE_SCALAR_FIELD(oldFirstRelfilenodeSubid);
     COMPARE_SCALAR_FIELD(unique);
     COMPARE_SCALAR_FIELD(primary);
     COMPARE_SCALAR_FIELD(isconstraint);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..89d00444ed 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2653,6 +2653,8 @@ _outIndexStmt(StringInfo str, const IndexStmt *node)
     WRITE_STRING_FIELD(idxcomment);
     WRITE_OID_FIELD(indexOid);
     WRITE_OID_FIELD(oldNode);
+    WRITE_UINT_FIELD(oldCreateSubid);
+    WRITE_UINT_FIELD(oldFirstRelfilenodeSubid);
     WRITE_BOOL_FIELD(unique);
     WRITE_BOOL_FIELD(primary);
     WRITE_BOOL_FIELD(isconstraint);
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 7e384f956c..13f3755345 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -7415,6 +7415,8 @@ IndexStmt:    CREATE opt_unique INDEX opt_concurrently opt_index_name
                     n->idxcomment = NULL;
                     n->indexOid = InvalidOid;
                     n->oldNode = InvalidOid;
+                    n->oldCreateSubid = InvalidSubTransactionId;
+                    n->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
                     n->primary = false;
                     n->isconstraint = false;
                     n->deferrable = false;
@@ -7443,6 +7445,8 @@ IndexStmt:    CREATE opt_unique INDEX opt_concurrently opt_index_name
                     n->idxcomment = NULL;
                     n->indexOid = InvalidOid;
                     n->oldNode = InvalidOid;
+                    n->oldCreateSubid = InvalidSubTransactionId;
+                    n->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
                     n->primary = false;
                     n->isconstraint = false;
                     n->deferrable = false;
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index c1911411d0..6a27c35e3b 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -1415,6 +1415,8 @@ generateClonedIndexStmt(RangeVar *heapRel, Relation source_idx,
     index->idxcomment = NULL;
     index->indexOid = InvalidOid;
     index->oldNode = InvalidOid;
+    index->oldCreateSubid = InvalidSubTransactionId;
+    index->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
     index->unique = idxrec->indisunique;
     index->primary = idxrec->indisprimary;
     index->transformed = true;    /* don't need transformIndexStmt */
@@ -2013,6 +2015,8 @@ transformIndexConstraint(Constraint *constraint, CreateStmtContext *cxt)
     index->idxcomment = NULL;
     index->indexOid = InvalidOid;
     index->oldNode = InvalidOid;
+    index->oldCreateSubid = InvalidSubTransactionId;
+    index->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
     index->transformed = false;
     index->concurrent = false;
     index->if_not_exists = false;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..4f60979ce5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -66,7 +66,7 @@
 #define BUF_WRITTEN                0x01
 #define BUF_REUSABLE            0x02
 
-#define DROP_RELS_BSEARCH_THRESHOLD        20
+#define RELS_BSEARCH_THRESHOLD        20
 
 typedef struct PrivateRefCountEntry
 {
@@ -105,6 +105,19 @@ typedef struct CkptTsStatus
     int            index;
 } CkptTsStatus;
 
+/*
+ * Type for array used to sort SMgrRelations
+ *
+ * FlushRelationsAllBuffers shares the same comparator function with
+ * DropRelFileNodesAllBuffers. Pointer to this struct and RelFileNode must be
+ * compatible.
+ */
+typedef struct SMgrSortArray
+{
+    RelFileNode rnode;            /* This must be the first member */
+    SMgrRelation srel;
+} SMgrSortArray;
+
 /* GUC variables */
 bool        zero_damaged_pages = false;
 int            bgwriter_lru_maxpages = 100;
@@ -2990,7 +3003,7 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes)
      * an exactly determined value, as it depends on many factors (CPU and RAM
      * speeds, amount of shared buffers etc.).
      */
-    use_bsearch = n > DROP_RELS_BSEARCH_THRESHOLD;
+    use_bsearch = n > RELS_BSEARCH_THRESHOLD;
 
     /* sort the list of rnodes if necessary */
     if (use_bsearch)
@@ -3240,6 +3253,104 @@ FlushRelationBuffers(Relation rel)
     }
 }
 
+/* ---------------------------------------------------------------------
+ *        FlushRelationsAllBuffers
+ *
+ *        This function flushes out of the buffer pool all the pages of all
+ *        forks of the specified smgr relations.  It's equivalent to calling
+ *        FlushRelationBuffers once per fork per relation.  The relations are
+ *        assumed not to use local buffers.
+ * --------------------------------------------------------------------
+ */
+void
+FlushRelationsAllBuffers(SMgrRelation *smgrs, int nrels)
+{
+    int            i;
+    SMgrSortArray *srels;
+    bool        use_bsearch;
+
+    if (nrels == 0)
+        return;
+
+    /* fill-in array for qsort */
+    srels = palloc(sizeof(SMgrSortArray) * nrels);
+
+    for (i = 0; i < nrels; i++)
+    {
+        Assert(!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode));
+
+        srels[i].rnode = smgrs[i]->smgr_rnode.node;
+        srels[i].srel = smgrs[i];
+    }
+
+    /*
+     * Save the bsearch overhead for low number of relations to sync. See
+     * DropRelFileNodesAllBuffers for details.
+     */
+    use_bsearch = nrels > RELS_BSEARCH_THRESHOLD;
+
+    /* sort the list of SMgrRelations if necessary */
+    if (use_bsearch)
+        pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator);
+
+    /* Make sure we can handle the pin inside the loop */
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    for (i = 0; i < NBuffers; i++)
+    {
+        SMgrSortArray *srelent = NULL;
+        BufferDesc *bufHdr = GetBufferDescriptor(i);
+        uint32        buf_state;
+
+        /*
+         * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
+         * and saves some cycles.
+         */
+
+        if (!use_bsearch)
+        {
+            int            j;
+
+            for (j = 0; j < nrels; j++)
+            {
+                if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode))
+                {
+                    srelent = &srels[j];
+                    break;
+                }
+            }
+
+        }
+        else
+        {
+            srelent = bsearch((const void *) &(bufHdr->tag.rnode),
+                              srels, nrels, sizeof(SMgrSortArray),
+                              rnode_comparator);
+        }
+
+        /* buffer doesn't belong to any of the given relfilenodes; skip it */
+        if (srelent == NULL)
+            continue;
+
+        ReservePrivateRefCountEntry();
+
+        buf_state = LockBufHdr(bufHdr);
+        if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) &&
+            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+        {
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+            FlushBuffer(bufHdr, srelent->srel);
+            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr, true);
+        }
+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+
+    pfree(srels);
+}
+
 /* ---------------------------------------------------------------------
  *        FlushDatabaseBuffers
  *
@@ -3441,13 +3552,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
             (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT))
         {
             /*
-             * If we're in recovery we cannot dirty a page because of a hint.
-             * We can set the hint, just not dirty the page as a result so the
-             * hint is lost when we evict the page or shutdown.
+             * If we must not write WAL, due to a relfilenode-specific
+             * condition or being in recovery, don't dirty the page.  We can
+             * set the hint, just not dirty the page as a result so the hint
+             * is lost when we evict the page or shutdown.
              *
              * See src/backend/storage/page/README for longer discussion.
              */
-            if (RecoveryInProgress())
+            if (RecoveryInProgress() ||
+                RelFileNodeSkippingWAL(bufHdr->tag.rnode))
                 return;
 
             /*
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3013ef63d0..efb44a25c4 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -614,6 +614,18 @@ LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode)
     return (locallock && locallock->nLocks > 0);
 }
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * GetLockMethodLocalHash -- return the hash of local locks, for modules that
+ *        evaluate assertions based on all locks held.
+ */
+HTAB *
+GetLockMethodLocalHash(void)
+{
+    return LockMethodLocalHash;
+}
+#endif
+
 /*
  * LockHasWaiters -- look up 'locktag' and check if releasing this
  *        lock would wake up other processes waiting for it.
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c5b771c531..ee9822c6e1 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
  * During replay, we would delete the file and then recreate it, which is fine
  * if the contents of the file were repopulated by subsequent WAL entries.
  * But if we didn't WAL-log insertions, but instead relied on fsyncing the
- * file after populating it (as for instance CLUSTER and CREATE INDEX do),
- * the contents of the file would be lost forever.  By leaving the empty file
- * until after the next checkpoint, we prevent reassignment of the relfilenode
- * number until it's safe, because relfilenode assignment skips over any
- * existing file.
+ * file after populating it (as we do at wal_level=minimal), the contents of
+ * the file would be lost forever.  By leaving the empty file until after the
+ * next checkpoint, we prevent reassignment of the relfilenode number until
+ * it's safe, because relfilenode assignment skips over any existing file.
  *
  * We do not need to go through this dance for temp relations, though, because
  * we never make WAL entries for temp rels, and so a temp rel poses no threat
@@ -877,12 +876,18 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *    mdimmedsync() -- Immediately sync a relation to stable storage.
  *
  * Note that only writes already issued are synced; this routine knows
- * nothing of dirty buffers that may exist inside the buffer manager.
+ * nothing of dirty buffers that may exist inside the buffer manager.  We
+ * sync active and inactive segments; smgrDoPendingSyncs() relies on this.
+ * Consider a relation skipping WAL.  Suppose a checkpoint syncs blocks of
+ * some segment, then mdtruncate() renders that segment inactive.  If we
+ * crash before the next checkpoint syncs the newly-inactive segment, that
+ * segment may survive recovery, reintroducing unwanted data into the table.
  */
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
     int            segno;
+    int            min_inactive_seg;
 
     /*
      * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -890,7 +895,16 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
      */
     mdnblocks(reln, forknum);
 
-    segno = reln->md_num_open_segs[forknum];
+    min_inactive_seg = segno = reln->md_num_open_segs[forknum];
+
+    /*
+     * Temporarily open inactive segments, then close them after sync.  There
+     * may be some inactive segments left opened after fsync() error, but that
+     * is harmless.  We don't bother to clean them up and take a risk of
+     * further trouble.  The next mdclose() will soon close them.
+     */
+    while (_mdfd_openseg(reln, forknum, segno, 0) != NULL)
+        segno++;
 
     while (segno > 0)
     {
@@ -901,6 +915,14 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
                     (errcode_for_file_access(),
                      errmsg("could not fsync file \"%s\": %m",
                             FilePathName(v->mdfd_vfd))));
+
+        /* Close inactive segments immediately */
+        if (segno > min_inactive_seg)
+        {
+            FileClose(v->mdfd_vfd);
+            _fdvec_resize(reln, forknum, segno - 1);
+        }
+
         segno--;
     }
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..72c9696ad1 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -388,6 +388,41 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo);
 }
 
+/*
+ *    smgrdosyncall() -- Immediately sync all forks of all given relations
+ *
+ *        All forks of all given relations are synced out to the store.
+ *
+ *        This is equivalent to FlushRelationBuffers() for each smgr relation,
+ *        then calling smgrimmedsync() for all forks of each relation, but it's
+ *        significantly quicker so should be preferred when possible.
+ */
+void
+smgrdosyncall(SMgrRelation *rels, int nrels)
+{
+    int            i = 0;
+    ForkNumber    forknum;
+
+    if (nrels == 0)
+        return;
+
+    FlushRelationsAllBuffers(rels, nrels);
+
+    /*
+     * Sync the physical file(s).
+     */
+    for (i = 0; i < nrels; i++)
+    {
+        int            which = rels[i]->smgr_which;
+
+        for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+        {
+            if (smgrsw[which].smgr_exists(rels[i], forknum))
+                smgrsw[which].smgr_immedsync(rels[i], forknum);
+        }
+    }
+}
+
 /*
  *    smgrdounlinkall() -- Immediately unlink all forks of all given relations
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 76f41dbe36..9ee9dc8cc0 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -257,6 +257,9 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+#ifdef USE_ASSERT_CHECKING
+static void AssertPendingSyncConsistency(Relation relation);
+#endif
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
                                 SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1093,6 +1096,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     relation->rd_isnailed = false;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_droppedSubid = InvalidSubTransactionId;
     switch (relation->rd_rel->relpersistence)
     {
         case RELPERSISTENCE_UNLOGGED:
@@ -1817,6 +1822,8 @@ formrdesc(const char *relationName, Oid relationReltype,
     relation->rd_isnailed = true;
     relation->rd_createSubid = InvalidSubTransactionId;
     relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_droppedSubid = InvalidSubTransactionId;
     relation->rd_backend = InvalidBackendId;
     relation->rd_islocaltemp = false;
 
@@ -1989,6 +1996,13 @@ RelationIdGetRelation(Oid relationId)
 
     if (RelationIsValid(rd))
     {
+        /* return NULL for dropped relations */
+        if (rd->rd_droppedSubid != InvalidSubTransactionId)
+        {
+            Assert(!rd->rd_isvalid);
+            return NULL;
+        }
+
         RelationIncrementReferenceCount(rd);
         /* revalidate cache entry if necessary */
         if (!rd->rd_isvalid)
@@ -2092,7 +2106,7 @@ RelationClose(Relation relation)
 #ifdef RELCACHE_FORCE_RELEASE
     if (RelationHasReferenceCountZero(relation) &&
         relation->rd_createSubid == InvalidSubTransactionId &&
-        relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
         RelationClearRelation(relation, false);
 #endif
 }
@@ -2131,10 +2145,11 @@ RelationReloadIndexInfo(Relation relation)
     HeapTuple    pg_class_tuple;
     Form_pg_class relp;
 
-    /* Should be called only for invalidated indexes */
+    /* Should be called only for invalidated, live indexes */
     Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
             relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
-           !relation->rd_isvalid);
+           !relation->rd_isvalid &&
+           relation->rd_droppedSubid == InvalidSubTransactionId);
 
     /* Ensure it's closed at smgr level */
     RelationCloseSmgr(relation);
@@ -2430,6 +2445,13 @@ RelationClearRelation(Relation relation, bool rebuild)
         return;
     }
 
+    /* Mark it invalid until we've finished rebuild */
+    relation->rd_isvalid = false;
+
+    /* See RelationForgetRelation(). */
+    if (relation->rd_droppedSubid != InvalidSubTransactionId)
+        return;
+
     /*
      * Even non-system indexes should not be blown away if they are open and
      * have valid index support information.  This avoids problems with active
@@ -2442,15 +2464,11 @@ RelationClearRelation(Relation relation, bool rebuild)
         relation->rd_refcnt > 0 &&
         relation->rd_indexcxt != NULL)
     {
-        relation->rd_isvalid = false;    /* needs to be revalidated */
         if (IsTransactionState())
             RelationReloadIndexInfo(relation);
         return;
     }
 
-    /* Mark it invalid until we've finished rebuild */
-    relation->rd_isvalid = false;
-
     /*
      * If we're really done with the relcache entry, blow it away. But if
      * someone is still using it, reconstruct the whole deal without moving
@@ -2508,13 +2526,13 @@ RelationClearRelation(Relation relation, bool rebuild)
          * problem.
          *
          * When rebuilding an open relcache entry, we must preserve ref count,
-         * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state.  Also
-         * attempt to preserve the pg_class entry (rd_rel), tupledesc,
-         * rewrite-rule, partition key, and partition descriptor substructures
-         * in place, because various places assume that these structures won't
-         * move while they are working with an open relcache entry.  (Note:
-         * the refcount mechanism for tupledescs might someday allow us to
-         * remove this hack for the tupledesc.)
+         * rd_*Subid, and rd_toastoid state.  Also attempt to preserve the
+         * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key,
+         * and partition descriptor substructures in place, because various
+         * places assume that these structures won't move while they are
+         * working with an open relcache entry.  (Note:  the refcount
+         * mechanism for tupledescs might someday allow us to remove this hack
+         * for the tupledesc.)
          *
          * Note that this process does not touch CurrentResourceOwner; which
          * is good because whatever ref counts the entry may have do not
@@ -2594,6 +2612,8 @@ RelationClearRelation(Relation relation, bool rebuild)
         /* creation sub-XIDs must be preserved */
         SWAPFIELD(SubTransactionId, rd_createSubid);
         SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
+        SWAPFIELD(SubTransactionId, rd_droppedSubid);
         /* un-swap rd_rel pointers, swap contents instead */
         SWAPFIELD(Form_pg_class, rd_rel);
         /* ... but actually, we don't have to update newrel->rd_rel */
@@ -2672,12 +2692,12 @@ static void
 RelationFlushRelation(Relation relation)
 {
     if (relation->rd_createSubid != InvalidSubTransactionId ||
-        relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
     {
         /*
          * New relcache entries are always rebuilt, not flushed; else we'd
-         * forget the "new" status of the relation, which is a useful
-         * optimization to have.  Ditto for the new-relfilenode status.
+         * forget the "new" status of the relation.  Ditto for the
+         * new-relfilenode status.
          *
          * The rel could have zero refcnt here, so temporarily increment the
          * refcnt to ensure it's safe to rebuild it.  We can assume that the
@@ -2699,10 +2719,7 @@ RelationFlushRelation(Relation relation)
 }
 
 /*
- * RelationForgetRelation - unconditionally remove a relcache entry
- *
- *           External interface for destroying a relcache entry when we
- *           drop the relation.
+ * RelationForgetRelation - caller reports that it dropped the relation
  */
 void
 RelationForgetRelation(Oid rid)
@@ -2717,7 +2734,19 @@ RelationForgetRelation(Oid rid)
     if (!RelationHasReferenceCountZero(relation))
         elog(ERROR, "relation %u is still open", rid);
 
-    /* Unconditionally destroy the relcache entry */
+    Assert(relation->rd_droppedSubid == InvalidSubTransactionId);
+    if (relation->rd_createSubid != InvalidSubTransactionId ||
+        relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
+    {
+        /*
+         * In the event of subtransaction rollback, we must not forget
+         * rd_*Subid.  Mark the entry "dropped" so RelationClearRelation()
+         * invalidates it in lieu of destroying it.  (If we're in a top
+         * transaction, we could opt to destroy the entry.)
+         */
+        relation->rd_droppedSubid = GetCurrentSubTransactionId();
+    }
+
     RelationClearRelation(relation, false);
 }
 
@@ -2757,11 +2786,10 @@ RelationCacheInvalidateEntry(Oid relationId)
  *     relation cache and re-read relation mapping data.
  *
  *     This is currently used only to recover from SI message buffer overflow,
- *     so we do not touch new-in-transaction relations; they cannot be targets
- *     of cross-backend SI updates (and our own updates now go through a
- *     separate linked list that isn't limited by the SI message buffer size).
- *     Likewise, we need not discard new-relfilenode-in-transaction hints,
- *     since any invalidation of those would be a local event.
+ *     so we do not touch relations having new-in-transaction relfilenodes; they
+ *     cannot be targets of cross-backend SI updates (and our own updates now go
+ *     through a separate linked list that isn't limited by the SI message
+ *     buffer size).
  *
  *     We do this in two phases: the first pass deletes deletable items, and
  *     the second one rebuilds the rebuildable items.  This is essential for
@@ -2812,7 +2840,7 @@ RelationCacheInvalidate(void)
          * pending invalidations.
          */
         if (relation->rd_createSubid != InvalidSubTransactionId ||
-            relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+            relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
             continue;
 
         relcacheInvalsReceived++;
@@ -2924,6 +2952,84 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
     EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+#ifdef USE_ASSERT_CHECKING
+static void
+AssertPendingSyncConsistency(Relation relation)
+{
+    bool        relcache_verdict =
+    relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
+    ((relation->rd_createSubid != InvalidSubTransactionId &&
+      RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) ||
+     relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId);
+
+    Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node));
+
+    if (relation->rd_droppedSubid != InvalidSubTransactionId)
+        Assert(!relation->rd_isvalid &&
+               (relation->rd_createSubid != InvalidSubTransactionId ||
+                relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
+}
+
+/*
+ * AssertPendingSyncs_RelationCache
+ *
+ *    Assert that relcache.c and storage.c agree on whether to skip WAL.
+ */
+void
+AssertPendingSyncs_RelationCache(void)
+{
+    HASH_SEQ_STATUS status;
+    LOCALLOCK  *locallock;
+    Relation   *rels;
+    int            maxrels;
+    int            nrels;
+    RelIdCacheEnt *idhentry;
+    int            i;
+
+    /*
+     * Open every relation that this transaction has locked.  If, for some
+     * relation, storage.c is skipping WAL and relcache.c is not skipping WAL,
+     * a CommandCounterIncrement() typically yields a local invalidation
+     * message that destroys the relcache entry.  By recreating such entries
+     * here, we detect the problem.
+     */
+    PushActiveSnapshot(GetTransactionSnapshot());
+    maxrels = 1;
+    rels = palloc(maxrels * sizeof(*rels));
+    nrels = 0;
+    hash_seq_init(&status, GetLockMethodLocalHash());
+    while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+    {
+        Oid            relid;
+        Relation    r;
+
+        if (locallock->nLocks <= 0)
+            continue;
+        if ((LockTagType) locallock->tag.lock.locktag_type !=
+            LOCKTAG_RELATION)
+            continue;
+        relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
+        r = RelationIdGetRelation(relid);
+        if (!RelationIsValid(r))
+            continue;
+        if (nrels >= maxrels)
+        {
+            maxrels *= 2;
+            rels = repalloc(rels, maxrels * sizeof(*rels));
+        }
+        rels[nrels++] = r;
+    }
+
+    hash_seq_init(&status, RelationIdCache);
+    while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+        AssertPendingSyncConsistency(idhentry->reldesc);
+
+    for (i = 0; i < nrels; i++)
+        RelationClose(rels[i]);
+    PopActiveSnapshot();
+}
+#endif
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3006,6 +3112,8 @@ AtEOXact_RelationCache(bool isCommit)
 static void
 AtEOXact_cleanup(Relation relation, bool isCommit)
 {
+    bool        clear_relcache = false;
+
     /*
      * The relcache entry's ref count should be back to its normal
      * not-in-a-transaction state: 0 unless it's nailed in cache.
@@ -3031,17 +3139,31 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
 #endif
 
     /*
-     * Is it a relation created in the current transaction?
+     * Is the relation live after this transaction ends?
      *
-     * During commit, reset the flag to zero, since we are now out of the
-     * creating transaction.  During abort, simply delete the relcache entry
-     * --- it isn't interesting any longer.
+     * During commit, clear the relcache entry if it is preserved after
+     * relation drop, in order not to orphan the entry.  During rollback,
+     * clear the relcache entry if the relation is created in the current
+     * transaction since it isn't interesting any longer once we are out of
+     * the transaction.
+     */
+    clear_relcache =
+        (isCommit ?
+         relation->rd_droppedSubid != InvalidSubTransactionId :
+         relation->rd_createSubid != InvalidSubTransactionId);
+
+    /*
+     * Since we are now out of the transaction, reset the subids to zero.
+     * That also lets RelationClearRelation() drop the relcache entry.
      */
-    if (relation->rd_createSubid != InvalidSubTransactionId)
+    relation->rd_createSubid = InvalidSubTransactionId;
+    relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    relation->rd_droppedSubid = InvalidSubTransactionId;
+
+    if (clear_relcache)
     {
-        if (isCommit)
-            relation->rd_createSubid = InvalidSubTransactionId;
-        else if (RelationHasReferenceCountZero(relation))
+        if (RelationHasReferenceCountZero(relation))
         {
             RelationClearRelation(relation, false);
             return;
@@ -3056,16 +3178,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
              * eventually.  This must be just a WARNING to avoid
              * error-during-error-recovery loops.
              */
-            relation->rd_createSubid = InvalidSubTransactionId;
             elog(WARNING, "cannot remove relcache entry for \"%s\" because it has nonzero refcount",
                  RelationGetRelationName(relation));
         }
     }
-
-    /*
-     * Likewise, reset the hint about the relfilenode being new.
-     */
-    relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3129,15 +3245,28 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     /*
      * Is it a relation created in the current subtransaction?
      *
-     * During subcommit, mark it as belonging to the parent, instead. During
-     * subabort, simply delete the relcache entry.
+     * During subcommit, mark it as belonging to the parent, instead, as long
+     * as it has not been dropped. Otherwise simply delete the relcache entry.
+     * --- it isn't interesting any longer.
      */
     if (relation->rd_createSubid == mySubid)
     {
-        if (isCommit)
+        /*
+         * Valid rd_droppedSubid means the corresponding relation is dropped
+         * but the relcache entry is preserved for at-commit pending sync. We
+         * need to drop it explicitly here not to make the entry orphan.
+         */
+        Assert(relation->rd_droppedSubid == mySubid ||
+               relation->rd_droppedSubid == InvalidSubTransactionId);
+        if (isCommit && relation->rd_droppedSubid == InvalidSubTransactionId)
             relation->rd_createSubid = parentSubid;
         else if (RelationHasReferenceCountZero(relation))
         {
+            /* allow the entry to be removed */
+            relation->rd_createSubid = InvalidSubTransactionId;
+            relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+            relation->rd_droppedSubid = InvalidSubTransactionId;
             RelationClearRelation(relation, false);
             return;
         }
@@ -3157,7 +3286,8 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
     }
 
     /*
-     * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+     * Likewise, update or drop any new-relfilenode-in-subtransaction record
+     * or drop record.
      */
     if (relation->rd_newRelfilenodeSubid == mySubid)
     {
@@ -3166,6 +3296,22 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
         else
             relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
     }
+
+    if (relation->rd_firstRelfilenodeSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_firstRelfilenodeSubid = parentSubid;
+        else
+            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+    }
+
+    if (relation->rd_droppedSubid == mySubid)
+    {
+        if (isCommit)
+            relation->rd_droppedSubid = parentSubid;
+        else
+            relation->rd_droppedSubid = InvalidSubTransactionId;
+    }
 }
 
 
@@ -3255,6 +3401,7 @@ RelationBuildLocalRelation(const char *relname,
     /* it's being created in this transaction */
     rel->rd_createSubid = GetCurrentSubTransactionId();
     rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+    rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 
     /*
      * create a new tuple descriptor from the one passed in.  We do this
@@ -3552,14 +3699,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
      */
     CommandCounterIncrement();
 
-    /*
-     * Mark the rel as having been given a new relfilenode in the current
-     * (sub) transaction.  This is a hint that can be used to optimize later
-     * operations on the rel in the same transaction.
-     */
+    RelationAssumeNewRelfilenode(relation);
+}
+
+/*
+ * RelationAssumeNewRelfilenode
+ *
+ * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call
+ * this.  The call shall precede any code that might insert WAL records whose
+ * replay would modify bytes in the new RelFileNode, and the call shall follow
+ * any WAL modifying bytes in the prior RelFileNode.  See struct RelationData.
+ * Ideally, call this as near as possible to the CommandCounterIncrement()
+ * that makes the pg_class change visible (before it or after it); that
+ * minimizes the chance of future development adding a forbidden WAL insertion
+ * between RelationAssumeNewRelfilenode() and CommandCounterIncrement().
+ */
+void
+RelationAssumeNewRelfilenode(Relation relation)
+{
     relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+    if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+        relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
 
-    /* Flag relation as needing eoxact cleanup (to remove the hint) */
+    /* Flag relation as needing eoxact cleanup (to clear these fields) */
     EOXactListAdd(relation);
 }
 
@@ -5625,6 +5787,8 @@ load_relcache_init_file(bool shared)
         rel->rd_fkeylist = NIL;
         rel->rd_createSubid = InvalidSubTransactionId;
         rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+        rel->rd_droppedSubid = InvalidSubTransactionId;
         rel->rd_amcache = NULL;
         MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..7d1f1069f1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -36,6 +36,7 @@
 #include "access/xlog_internal.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
+#include "catalog/storage.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/trigger.h"
@@ -2750,6 +2751,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"wal_skip_threshold", PGC_USERSET, WAL_SETTINGS,
+            gettext_noop("Size of new file to fsync instead of writing WAL."),
+            NULL,
+            GUC_UNIT_KB
+        },
+        &wal_skip_threshold,
+        2048, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     {
         {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING,
             gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..c7e46592fb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -215,6 +215,7 @@
                     # (change requires restart)
 #wal_writer_delay = 200ms        # 1-10000 milliseconds
 #wal_writer_flush_after = 1MB        # measured in pages, 0 disables
+#wal_skip_threshold = 2MB
 
 #commit_delay = 0            # range 0-100000, in microseconds
 #commit_siblings = 5            # range 1-1000
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 18f2b0d98e..4bfc628000 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
                                 BlockNumber origrlink, GistNSN oldnsn,
                                 Buffer leftchild, bool markfollowright);
 
+extern XLogRecPtr gistXLogAssignLSN(void);
+
 /* gistget.c */
 extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 55fc843d3a..673afee1e1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -26,6 +26,7 @@
  /* #define XLOG_GIST_INSERT_COMPLETE     0x40 */    /* not used anymore */
  /* #define XLOG_GIST_CREATE_INDEX         0x50 */    /* not used anymore */
 #define XLOG_GIST_PAGE_DELETE        0x60
+#define XLOG_GIST_ASSIGN_LSN        0x70    /* nop, assign new LSN */
 
 /*
  * Backup Blk 0: updated page.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 47fda28daa..f279edc473 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -31,7 +31,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM    TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN        TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL    TABLE_INSERT_NO_LOGICAL
@@ -168,8 +167,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);
 extern void simple_heap_update(Relation relation, ItemPointer otid,
                                HeapTuple tup);
 
-extern void heap_sync(Relation relation);
-
 extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel,
                                                          ItemPointerData *items,
                                                          int nitems);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index fb2902bd69..e6d7fa1e65 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
                                        TransactionId OldestXmin, TransactionId FreezeXid,
-                                       MultiXactId MultiXactCutoff, bool use_wal);
+                                       MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
                                HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 91f84b1107..94903dd8de 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -128,7 +128,7 @@ typedef struct TM_FailureData
 } TM_FailureData;
 
 /* "options" flag bits for table_tuple_insert */
-#define TABLE_INSERT_SKIP_WAL        0x0001
+/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM        0x0002
 #define TABLE_INSERT_FROZEN            0x0004
 #define TABLE_INSERT_NO_LOGICAL        0x0008
@@ -410,9 +410,8 @@ typedef struct TableAmRoutine
 
     /*
      * Perform operations necessary to complete insertions made via
-     * tuple_insert and multi_insert with a BulkInsertState specified. This
-     * may for example be used to flush the relation, when the
-     * TABLE_INSERT_SKIP_WAL option was used.
+     * tuple_insert and multi_insert with a BulkInsertState specified. In-tree
+     * access methods ceased to use this.
      *
      * Typically callers of tuple_insert and multi_insert will just pass all
      * the flags that apply to them, and each AM has to decide which of them
@@ -1119,10 +1118,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * The options bitmask allows the caller to specify options that may change the
  * behaviour of the AM. The AM will ignore options that it does not support.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.
@@ -1342,9 +1337,7 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
 
 /*
  * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * tuple_insert and multi_insert with a BulkInsertState specified.
  */
 static inline void
 table_finish_bulk_insert(Relation rel, int options)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 048003c25e..bd37bf311c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -19,18 +19,24 @@
 #include "storage/smgr.h"
 #include "utils/relcache.h"
 
+/* GUC variables */
+extern int    wal_skip_threshold;
+
 extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
 extern void RelationDropStorage(Relation rel);
 extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
+extern void RelationPreTruncate(Relation rel);
 extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
+extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
  * naming
  */
 extern void smgrDoPendingDeletes(bool isCommit);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 2039b42449..9c41bd5915 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2782,6 +2782,9 @@ typedef struct IndexStmt
     char       *idxcomment;        /* comment to apply to index, or NULL */
     Oid            indexOid;        /* OID of an existing index, if any */
     Oid            oldNode;        /* relfilenode of existing storage, if any */
+    SubTransactionId oldCreateSubid;    /* rd_createSubid of oldNode */
+    SubTransactionId oldFirstRelfilenodeSubid;    /* rd_firstRelfilenodeSubid of
+                                                 * oldNode */
     bool        unique;            /* is index unique? */
     bool        primary;        /* is index a primary key? */
     bool        isconstraint;    /* is it for a pkey/unique constraint? */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d2a5b52f6e..bf3b12a2de 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -49,6 +49,9 @@ typedef enum
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
+/* forward declared, to avoid including smgr.h here */
+struct SMgrRelationData;
+
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
@@ -186,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                                    ForkNumber forkNum);
 extern void FlushOneBuffer(Buffer buffer);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                                    int nforks, BlockNumber *firstDelBlock);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index a89e54dbb0..fdabf42721 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -546,6 +546,9 @@ extern void LockReleaseSession(LOCKMETHODID lockmethodid);
 extern void LockReleaseCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern void LockReassignCurrentOwner(LOCALLOCK **locallocks, int nlocks);
 extern bool LockHeldByMe(const LOCKTAG *locktag, LOCKMODE lockmode);
+#ifdef USE_ASSERT_CHECKING
+extern HTAB *GetLockMethodLocalHash(void);
+#endif
 extern bool LockHasWaiters(const LOCKTAG *locktag,
                            LOCKMODE lockmode, bool sessionLock);
 extern VirtualTransactionId *GetLockConflicts(const LOCKTAG *locktag,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 243822137c..79dfe0e373 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -89,6 +89,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                        BlockNumber blocknum, char *buffer, bool skipFsync);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 39cdcddc2b..461f64e611 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -66,25 +66,45 @@ typedef struct RelationData
 
     /*----------
      * rd_createSubid is the ID of the highest subtransaction the rel has
-     * survived into; or zero if the rel was not created in the current top
-     * transaction.  This can be now be relied on, whereas previously it could
-     * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is
-     * the ID of the highest subtransaction the relfilenode change has
-     * survived into, or zero if not changed in the current transaction (or we
-     * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten
-     * when a relation has multiple new relfilenodes within a single
-     * transaction, with one of them occurring in a subsequently aborted
-     * subtransaction, e.g.
+     * survived into or zero if the rel or its rd_node was created before the
+     * current top transaction.  (IndexStmt.oldNode leads to the case of a new
+     * rel with an old rd_node.)  rd_firstRelfilenodeSubid is the ID of the
+     * highest subtransaction an rd_node change has survived into or zero if
+     * rd_node matches the value it had at the start of the current top
+     * transaction.  (Rolling back the subtransaction that
+     * rd_firstRelfilenodeSubid denotes would restore rd_node to the value it
+     * had at the start of the current top transaction.  Rolling back any
+     * lower subtransaction would not.)  Their accuracy is critical to
+     * RelationNeedsWAL().
+     *
+     * rd_newRelfilenodeSubid is the ID of the highest subtransaction the
+     * most-recent relfilenode change has survived into or zero if not changed
+     * in the current transaction (or we have forgotten changing it).  This
+     * field is accurate when non-zero, but it can be zero when a relation has
+     * multiple new relfilenodes within a single transaction, with one of them
+     * occurring in a subsequently aborted subtransaction, e.g.
      *        BEGIN;
      *        TRUNCATE t;
      *        SAVEPOINT save;
      *        TRUNCATE t;
      *        ROLLBACK TO save;
      *        -- rd_newRelfilenodeSubid is now forgotten
+     *
+     * If every rd_*Subid field is zero, they are read-only outside
+     * relcache.c.  Files that trigger rd_node changes by updating
+     * pg_class.reltablespace and/or pg_class.relfilenode call
+     * RelationAssumeNewRelfilenode() to update rd_*Subid.
+     *
+     * rd_droppedSubid is the ID of the highest subtransaction that a drop of
+     * the rel has survived into.  In entries visible outside relcache.c, this
+     * is always zero.
      */
     SubTransactionId rd_createSubid;    /* rel was created in current xact */
-    SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
-                                                 * current xact */
+    SubTransactionId rd_newRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to current value */
+    SubTransactionId rd_firstRelfilenodeSubid;    /* highest subxact changing
+                                                 * rd_node to any value */
+    SubTransactionId rd_droppedSubid;    /* dropped with another Subid set */
 
     Form_pg_class rd_rel;        /* RELATION tuple */
     TupleDesc    rd_att;            /* tuple descriptor */
@@ -531,9 +551,16 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *        True if relation needs WAL.
- */
-#define RelationNeedsWAL(relation) \
-    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+ *
+ * Returns false if wal_level = minimal and this relation is created or
+ * truncated in the current transaction.  See "Skipping WAL for New
+ * RelFileNode" in src/backend/access/transam/README.
+ */
+#define RelationNeedsWAL(relation)                                        \
+    ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&    \
+     (XLogIsNeeded() ||                                                    \
+      (relation->rd_createSubid == InvalidSubTransactionId &&            \
+       relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d77f5beec6..62239a09e8 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -106,9 +106,10 @@ extern Relation RelationBuildLocalRelation(const char *relname,
                                            char relkind);
 
 /*
- * Routine to manage assignment of new relfilenode to a relation
+ * Routines to manage assignment of new relfilenode to a relation
  */
 extern void RelationSetNewRelfilenode(Relation relation, char persistence);
+extern void RelationAssumeNewRelfilenode(Relation relation);
 
 /*
  * Routines for flushing/rebuilding relcache entries in various scenarios
@@ -121,6 +122,11 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+#ifdef USE_ASSERT_CHECKING
+extern void AssertPendingSyncs_RelationCache(void);
+#else
+#define AssertPendingSyncs_RelationCache() do {} while (0)
+#endif
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
                                       SubTransactionId parentSubid);
diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..50bb2fef61
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,372 @@
+# Test WAL replay when some operation has skipped WAL.
+#
+# These tests exercise code that once violated the mandate described in
+# src/backend/access/transam/README section "Skipping WAL for New
+# RelFileNode".  The tests work by committing some transactions, initiating an
+# immediate shutdown, and confirming that the expected data survives recovery.
+# For many years, individual commands made the decision to skip WAL, hence the
+# frequent appearance of COPY in these tests.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 34;
+
+sub check_orphan_relfilenodes
+{
+    my ($node, $test_name) = @_;
+
+    my $db_oid = $node->safe_psql('postgres',
+        "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+    my $prefix               = "base/$db_oid/";
+    my $filepaths_referenced = $node->safe_psql(
+        'postgres', "
+       SELECT pg_relation_filepath(oid) FROM pg_class
+       WHERE reltablespace = 0 AND relpersistence <> 't' AND
+       pg_relation_filepath(oid) IS NOT NULL;");
+    is_deeply(
+        [
+            sort(map { "$prefix$_" }
+                  grep(/^[0-9]+$/, slurp_dir($node->data_dir . "/$prefix")))
+        ],
+        [ sort split /\n/, $filepaths_referenced ],
+        $test_name);
+    return;
+}
+
+# We run this same test suite for both wal_level=minimal and replica.
+sub run_wal_optimize
+{
+    my $wal_level = shift;
+
+    my $node = get_new_node("node_$wal_level");
+    $node->init;
+    $node->append_conf(
+        'postgresql.conf', qq(
+wal_level = $wal_level
+max_prepared_transactions = 1
+wal_log_hints = on
+wal_skip_threshold = 0
+#wal_debug = on
+));
+    $node->start;
+
+    # Setup
+    my $tablespace_dir = $node->basedir . '/tablespace_other';
+    mkdir($tablespace_dir);
+    $tablespace_dir = TestLib::perl2host($tablespace_dir);
+    $node->safe_psql('postgres',
+        "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+    # Test direct truncation optimization.  No tuples.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc (id serial PRIMARY KEY);
+        TRUNCATE trunc;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    my $result = $node->safe_psql('postgres', "SELECT count(*) FROM trunc;");
+    is($result, qq(0), "wal_level = $wal_level, TRUNCATE with empty table");
+
+    # Test truncation with inserted tuples within the same transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_ins (id serial PRIMARY KEY);
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        TRUNCATE trunc_ins;
+        INSERT INTO trunc_ins VALUES (DEFAULT);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres',
+        "SELECT count(*), min(id) FROM trunc_ins;");
+    is($result, qq(1|2), "wal_level = $wal_level, TRUNCATE INSERT");
+
+    # Same for prepared transaction.
+    # Tuples inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE twophase (id serial PRIMARY KEY);
+        INSERT INTO twophase VALUES (DEFAULT);
+        TRUNCATE twophase;
+        INSERT INTO twophase VALUES (DEFAULT);
+        PREPARE TRANSACTION 't';
+        COMMIT PREPARED 't';");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres',
+        "SELECT count(*), min(id) FROM trunc_ins;");
+    is($result, qq(1|2), "wal_level = $wal_level, TRUNCATE INSERT PREPARE");
+
+    # Writing WAL at end of xact, instead of syncing.
+    $node->safe_psql(
+        'postgres', "
+        SET wal_skip_threshold = '1TB';
+        BEGIN;
+        CREATE TABLE noskip (id serial PRIMARY KEY);
+        INSERT INTO noskip (SELECT FROM generate_series(1, 20000) a) ;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM noskip;");
+    is($result, qq(20000), "wal_level = $wal_level, end-of-xact WAL");
+
+    # Data file for COPY query in subsequent tests
+    my $basedir   = $node->basedir;
+    my $copy_file = "$basedir/copy_data.txt";
+    TestLib::append_to_file(
+        $copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+    # Test truncation with inserted tuples using both INSERT and COPY.  Tuples
+    # inserted after the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trunc (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_trunc VALUES (DEFAULT, generate_series(1,10000));
+        TRUNCATE ins_trunc;
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COPY ins_trunc FROM '$copy_file' DELIMITER ',';
+        INSERT INTO ins_trunc (id, id2) VALUES (DEFAULT, 10000);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trunc;");
+    is($result, qq(5), "wal_level = $wal_level, TRUNCATE COPY INSERT");
+
+    # Test truncation with inserted tuples using COPY.  Tuples copied after
+    # the truncation should be seen.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO trunc_copy VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE trunc_copy;
+        COPY trunc_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_copy;");
+    is($result, qq(3), "wal_level = $wal_level, TRUNCATE COPY");
+
+    # Like previous test, but rollback SET TABLESPACE in a subtransaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_abort (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_abort VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_abort;
+        SAVEPOINT s;
+          ALTER TABLE spc_abort SET TABLESPACE other; ROLLBACK TO s;
+        COPY spc_abort FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_abort;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE abort subtransaction");
+
+    # in different subtransaction patterns
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_commit;
+        SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
+        COPY spc_commit FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM spc_commit;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE commit subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE spc_nest (id serial PRIMARY KEY, id2 int);
+        INSERT INTO spc_nest VALUES (DEFAULT, generate_series(1,3000));
+        TRUNCATE spc_nest;
+        SAVEPOINT s;
+            ALTER TABLE spc_nest SET TABLESPACE other;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            ROLLBACK TO s2;
+            SAVEPOINT s2;
+                ALTER TABLE spc_nest SET TABLESPACE pg_default;
+            RELEASE s2;
+        ROLLBACK TO s;
+        COPY spc_nest FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_nest;");
+    is($result, qq(3),
+        "wal_level = $wal_level, SET TABLESPACE nested subtransaction");
+
+    $node->safe_psql(
+        'postgres', "
+        CREATE TABLE spc_hint (id int);
+        INSERT INTO spc_hint VALUES (1);
+        BEGIN;
+        ALTER TABLE spc_hint SET TABLESPACE other;
+        CHECKPOINT;
+        SELECT * FROM spc_hint;  -- set hint bit
+        INSERT INTO spc_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM spc_hint;");
+    is($result, qq(2), "wal_level = $wal_level, SET TABLESPACE, hint bit");
+
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE idx_hint (c int PRIMARY KEY);
+        SAVEPOINT q; INSERT INTO idx_hint VALUES (1); ROLLBACK TO q;
+        CHECKPOINT;
+        INSERT INTO idx_hint VALUES (1);  -- set index hint bit
+        INSERT INTO idx_hint VALUES (2);
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->psql('postgres',);
+    my ($ret, $stdout, $stderr) =
+      $node->psql('postgres', "INSERT INTO idx_hint VALUES (2);");
+    is($ret, qq(3), "wal_level = $wal_level, unique index LP_DEAD");
+    like(
+        $stderr,
+        qr/violates unique/,
+        "wal_level = $wal_level, unique index LP_DEAD message");
+
+    # UPDATE touches two buffers for one row.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE upd (id serial PRIMARY KEY, id2 int);
+        INSERT INTO upd (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+        COPY upd FROM '$copy_file' DELIMITER ',';
+        UPDATE upd SET id2 = id2 + 1;
+        DELETE FROM upd;
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM upd;");
+    is($result, qq(0),
+        "wal_level = $wal_level, UPDATE touches two buffers for one row");
+
+    # Test consistency of COPY with INSERT for table created in the same
+    # transaction.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_copy (id serial PRIMARY KEY, id2 int);
+        INSERT INTO ins_copy VALUES (DEFAULT, 1);
+        COPY ins_copy FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_copy;");
+    is($result, qq(4), "wal_level = $wal_level, INSERT COPY");
+
+    # Test consistency of COPY that inserts more to the same table using
+    # triggers.  If the INSERTS from the trigger go to the same block data
+    # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+    # it tries to replay the WAL record but the "before" image doesn't match,
+    # because not all changes were WAL-logged.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE ins_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION ins_trig_before_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE FUNCTION ins_trig_after_row_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            IF new.id2 NOT LIKE 'triggered%' THEN
+              INSERT INTO ins_trig
+                VALUES (DEFAULT, 'triggered row after' || NEW.id2);
+            END IF;
+            RETURN NEW;
+          END; \$\$;
+        CREATE TRIGGER ins_trig_before_row_insert
+          BEFORE INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_before_row_trig();
+        CREATE TRIGGER ins_trig_after_row_insert
+          AFTER INSERT ON ins_trig
+          FOR EACH ROW EXECUTE PROCEDURE ins_trig_after_row_trig();
+        COPY ins_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result = $node->safe_psql('postgres', "SELECT count(*) FROM ins_trig;");
+    is($result, qq(9), "wal_level = $wal_level, COPY with INSERT triggers");
+
+    # Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+    # with TRUNCATE triggers.
+    $node->safe_psql(
+        'postgres', "
+        BEGIN;
+        CREATE TABLE trunc_trig (id serial PRIMARY KEY, id2 text);
+        CREATE FUNCTION trunc_trig_before_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE FUNCTION trunc_trig_after_stat_trig() RETURNS trigger
+          LANGUAGE plpgsql as \$\$
+          BEGIN
+            INSERT INTO trunc_trig VALUES (DEFAULT, 'triggered stat before');
+            RETURN NULL;
+          END; \$\$;
+        CREATE TRIGGER trunc_trig_before_stat_truncate
+          BEFORE TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_before_stat_trig();
+        CREATE TRIGGER trunc_trig_after_stat_truncate
+          AFTER TRUNCATE ON trunc_trig
+          FOR EACH STATEMENT EXECUTE PROCEDURE trunc_trig_after_stat_trig();
+        INSERT INTO trunc_trig VALUES (DEFAULT, 1);
+        TRUNCATE trunc_trig;
+        COPY trunc_trig FROM '$copy_file' DELIMITER ',';
+        COMMIT;");
+    $node->stop('immediate');
+    $node->start;
+    $result =
+      $node->safe_psql('postgres', "SELECT count(*) FROM trunc_trig;");
+    is($result, qq(4),
+        "wal_level = $wal_level, TRUNCATE COPY with TRUNCATE triggers");
+
+    # Test redo of temp table creation.
+    $node->safe_psql(
+        'postgres', "
+        CREATE TEMP TABLE temp (id serial PRIMARY KEY, id2 text);");
+    $node->stop('immediate');
+    $node->start;
+    check_orphan_relfilenodes($node,
+        "wal_level = $wal_level, no orphan relfilenode remains");
+
+    return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index fb6d86a269..7c2181ac2f 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -1984,6 +1984,12 @@ select * from another;
 (3 rows)
 
 drop table another;
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
diff --git a/src/test/regress/expected/create_table.out b/src/test/regress/expected/create_table.out
index c5e95edbed..6acf31725f 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -331,6 +331,19 @@ CREATE TABLE default_expr_agg (a int DEFAULT (generate_series(1,3)));
 ERROR:  set-returning functions are not allowed in DEFAULT expressions
 LINE 1: CREATE TABLE default_expr_agg (a int DEFAULT (generate_serie...
                                                       ^
+-- Verify that subtransaction rollback restores rd_createSubid.
+BEGIN;
+CREATE TABLE remember_create_subid (c int);
+SAVEPOINT q; DROP TABLE remember_create_subid; ROLLBACK TO q;
+COMMIT;
+DROP TABLE remember_create_subid;
+-- Verify that subtransaction rollback restores rd_firstRelfilenodeSubid.
+CREATE TABLE remember_node_subid (c int);
+BEGIN;
+ALTER TABLE remember_node_subid ALTER c TYPE bigint;
+SAVEPOINT q; DROP TABLE remember_node_subid; ROLLBACK TO q;
+COMMIT;
+DROP TABLE remember_node_subid;
 --
 -- Partitioned tables
 --
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index 3801f19c58..1b1315f316 100644
--- a/src/test/regress/sql/alter_table.sql
+++ b/src/test/regress/sql/alter_table.sql
@@ -1360,6 +1360,13 @@ select * from another;
 
 drop table another;
 
+-- Create an index that skips WAL, then perform a SET DATA TYPE that skips
+-- rewriting the index.
+begin;
+create table skip_wal_skip_rewrite_index (c varchar(10) primary key);
+alter table skip_wal_skip_rewrite_index alter c type varchar(20);
+commit;
+
 -- table's row type
 create table tab1 (a int, b text);
 create table tab2 (x int, y tab1);
diff --git a/src/test/regress/sql/create_table.sql b/src/test/regress/sql/create_table.sql
index 00ef81a685..a670438c48 100644
--- a/src/test/regress/sql/create_table.sql
+++ b/src/test/regress/sql/create_table.sql
@@ -318,6 +318,21 @@ CREATE TABLE default_expr_agg (a int DEFAULT (select 1));
 -- invalid use of set-returning function
 CREATE TABLE default_expr_agg (a int DEFAULT (generate_series(1,3)));
 
+-- Verify that subtransaction rollback restores rd_createSubid.
+BEGIN;
+CREATE TABLE remember_create_subid (c int);
+SAVEPOINT q; DROP TABLE remember_create_subid; ROLLBACK TO q;
+COMMIT;
+DROP TABLE remember_create_subid;
+
+-- Verify that subtransaction rollback restores rd_firstRelfilenodeSubid.
+CREATE TABLE remember_node_subid (c int);
+BEGIN;
+ALTER TABLE remember_node_subid ALTER c TYPE bigint;
+SAVEPOINT q; DROP TABLE remember_node_subid; ROLLBACK TO q;
+COMMIT;
+DROP TABLE remember_node_subid;
+
 --
 -- Partitioned tables
 --
-- 
2.18.2

From c3307bd73c7e8f9894fb5276f32c93a1d956aac9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 23 Mar 2020 16:03:26 +0900
Subject: [PATCH v36 2/4] Fix GUC value in TAP test

2TB was too large for wal_skip_threshold on 32 bit platforms. Reduce
it to 1GB.
---
 src/test/recovery/t/018_wal_optimize.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
index 50bb2fef61..c39998bb2a 100644
--- a/src/test/recovery/t/018_wal_optimize.pl
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -107,7 +107,7 @@ wal_skip_threshold = 0
     # Writing WAL at end of xact, instead of syncing.
     $node->safe_psql(
         'postgres', "
-        SET wal_skip_threshold = '1TB';
+        SET wal_skip_threshold = '1GB';
         BEGIN;
         CREATE TABLE noskip (id serial PRIMARY KEY);
         INSERT INTO noskip (SELECT FROM generate_series(1, 20000) a) ;
-- 
2.18.2

From 4f7821d401ef4495324f7882604b47d39b9da134 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 23 Mar 2020 15:32:47 +0900
Subject: [PATCH v36 3/4] Fix the name of struct pendingSyncs

The naming convention is different from existing
PendingRelDelete. Unify with that.
---
 src/backend/catalog/storage.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0ed7c64a05..56be8a2d52 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -65,11 +65,11 @@ typedef struct PendingRelDelete
     struct PendingRelDelete *next;    /* linked-list link */
 } PendingRelDelete;
 
-typedef struct pendingSync
+typedef struct PendingRelSync
 {
     RelFileNode rnode;
     bool        is_truncated;    /* Has the file experienced truncation? */
-} pendingSync;
+} PendingRelSync;
 
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 HTAB       *pendingSyncHash = NULL;
@@ -131,7 +131,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
     /* Queue an at-commit sync. */
     if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
     {
-        pendingSync *pending;
+        PendingRelSync *pending;
         bool        found;
 
         /* we sync only permanent relations */
@@ -142,7 +142,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
             HASHCTL        ctl;
 
             ctl.keysize = sizeof(RelFileNode);
-            ctl.entrysize = sizeof(pendingSync);
+            ctl.entrysize = sizeof(PendingRelSync);
             ctl.hcxt = TopTransactionContext;
             pendingSyncHash =
                 hash_create("pending sync hash",
@@ -374,7 +374,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 void
 RelationPreTruncate(Relation rel)
 {
-    pendingSync *pending;
+    PendingRelSync *pending;
 
     if (!pendingSyncHash)
         return;
@@ -581,7 +581,7 @@ smgrDoPendingSyncs(bool isCommit)
                 maxrels = 0;
     SMgrRelation *srels = NULL;
     HASH_SEQ_STATUS scan;
-    pendingSync *pendingsync;
+    PendingRelSync *pendingsync;
 
     if (XLogIsNeeded())
         return;                    /* no relation can use this */
@@ -611,7 +611,7 @@ smgrDoPendingSyncs(bool isCommit)
     }
 
     hash_seq_init(&scan, pendingSyncHash);
-    while ((pendingsync = (pendingSync *) hash_seq_search(&scan)))
+    while ((pendingsync = (PendingRelSync *) hash_seq_search(&scan)))
     {
         ForkNumber    fork;
         BlockNumber nblocks[MAX_FORKNUM + 1];
-- 
2.18.2

From 20a9891da987b979247f0539dff9406961c3e8d7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 23 Mar 2020 15:50:33 +0900
Subject: [PATCH v36 4/4] Propagage pending sync information to parallel
 workers

Parallel workers needs to know about pending syncs since it performs
some WAL-related operations. This patch reconstruct pending sync hash
on workers then set the correct newness flag on relcache creation.
---
 src/backend/catalog/storage.c       | 141 ++++++++++++++++++++++++++--
 src/backend/executor/execParallel.c |  28 ++++++
 src/backend/utils/cache/relcache.c  |  17 ++++
 src/include/catalog/storage.h       |   2 +
 4 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 56be8a2d52..5bbd0f5c3c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
 
 #include "postgres.h"
 
+#include "access/parallel.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
 #include "access/xlog.h"
@@ -74,6 +75,26 @@ typedef struct PendingRelSync
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 HTAB       *pendingSyncHash = NULL;
 
+
+/*
+ *  create_pendingsync_hash - helper function to create pending sync hash
+ */
+static void
+create_pendingsync_hash(void)
+{
+    HASHCTL        ctl;
+
+    Assert(pendingSyncHash == NULL);
+
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(PendingRelSync);
+    ctl.hcxt = TopTransactionContext;
+    pendingSyncHash =
+        hash_create("pending sync hash",
+                    16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+}
+
+
 /*
  * RelationCreateStorage
  *        Create physical storage for a relation.
@@ -137,17 +158,12 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
         /* we sync only permanent relations */
         Assert(backend == InvalidBackendId);
 
+        /* we don't assume create storage on parallel workers*/
+        Assert(!IsParallelWorker());
+
+        /* create the hash if not yet */
         if (!pendingSyncHash)
-        {
-            HASHCTL        ctl;
-
-            ctl.keysize = sizeof(RelFileNode);
-            ctl.entrysize = sizeof(PendingRelSync);
-            ctl.hcxt = TopTransactionContext;
-            pendingSyncHash =
-                hash_create("pending sync hash",
-                            16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-        }
+            create_pendingsync_hash();
 
         pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found);
         Assert(!found);
@@ -493,6 +509,111 @@ RelFileNodeSkippingWAL(RelFileNode rnode)
     return true;
 }
 
+/*
+ * SerializeSkippingWALRelFileNods
+ *
+ * RelationNeedsWAL and RelFileNodeSkippingWAL must offer the correct answer to
+ * parallel workers. This function is used to serialize RelFileNodes of pending
+ * syncs so that workers know about pending syncs. Since workers are not
+ * assumed to create or truncate of relfilenodes or perform transactional
+ * operations, only rnodes are required.
+ */
+int
+SerializePendingSyncs(RelFileNode **parray)
+{
+    HTAB           *tmphash;
+    HASHCTL            ctl;
+    HASH_SEQ_STATUS    scan;
+    PendingRelSync       *sync;
+    PendingRelDelete *delete;
+    RelFileNode       *src;
+    RelFileNode       *dest;
+    int                nrnodes = 0;
+
+    /* Don't call from parallel workers */
+    Assert(!IsParallelWorker());
+
+    if (XLogIsNeeded() || !pendingSyncHash)
+        return 0;             /* No pending syncs */
+
+    /* Create temporary hash to collect active relfilenodes */
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(RelFileNode);
+    ctl.entrysize = sizeof(RelFileNode);
+    ctl.hcxt = CurrentMemoryContext;
+    tmphash = hash_create("tmp relfilenodes", 16, &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* collect all rnodes from pending syncs */
+    hash_seq_init(&scan, pendingSyncHash);
+    while ((sync = (PendingRelSync *) hash_seq_search(&scan)))
+    {
+        (void) hash_search(tmphash, &sync->rnode, HASH_ENTER, NULL);
+        nrnodes++;
+    }
+
+    /* remove deleted rnodes */
+    for (delete = pendingDeletes; delete != NULL; delete = delete->next)
+    {
+        if (delete->atCommit)
+        {
+            bool found;
+
+            (void) hash_search(tmphash, (void *) &delete->relnode,
+                               HASH_REMOVE, &found);
+            if (found)
+                nrnodes--;
+        }
+    }
+
+    Assert(nrnodes >= 0);
+
+    if (nrnodes == 0)
+        return 0;
+
+    /* Create and fill the array. It contains terminating (0,0,0) node */
+    *parray = palloc(sizeof(RelFileNode) * (nrnodes + 1));
+    dest = &(*parray)[0];
+
+    hash_seq_init(&scan, tmphash);
+    while ((src = (RelFileNode *) hash_seq_search(&scan)))
+        *dest++ = *src;
+
+    hash_destroy(tmphash);
+
+    /* set terminator */
+    MemSet(dest, 0, sizeof(RelFileNode));
+
+    return (nrnodes + 1) * sizeof(RelFileNode);
+}
+
+/*
+ * SerializeSkippingWALRelFileNods
+ */
+void
+DeserializePendingSyncs(RelFileNode *array)
+{
+    RelFileNode *rnode;
+
+    Assert(pendingSyncHash == NULL);
+    Assert(IsParallelWorker());
+
+    create_pendingsync_hash();
+
+    for (rnode = array ; rnode->relNode != 0 ; rnode++)
+    {
+        PendingRelSync *pending;
+
+        Assert(rnode->spcNode != 0 && rnode->dbNode != 0);
+
+        pending = hash_search(pendingSyncHash, rnode, HASH_ENTER, NULL);
+
+        pending->is_truncated = false; /* not cared */
+    }
+
+    return;
+}
+
 /*
  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
  *
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a753d6efa0..5964c02b8a 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -23,6 +23,7 @@
 
 #include "postgres.h"
 
+#include "catalog/storage.h"
 #include "executor/execParallel.h"
 #include "executor/executor.h"
 #include "executor/nodeAppend.h"
@@ -62,6 +63,7 @@
 #define PARALLEL_KEY_DSA                UINT64CONST(0xE000000000000007)
 #define PARALLEL_KEY_QUERY_TEXT        UINT64CONST(0xE000000000000008)
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
+#define PARALLEL_KEY_PENDING_SYNCS        UINT64CONST(0xE00000000000000A)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE        65536
 
@@ -583,6 +585,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
     Size        dsa_minsize = dsa_minimum_size();
     char       *query_string;
     int            query_len;
+    int            pendingsync_list_size = 0;
+    RelFileNode *pendingsync_buf;
+    char       *pendingsync_space;
 
     /*
      * Force any initplan outputs that we're going to pass to workers to be
@@ -684,6 +689,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
         }
     }
 
+    /* Estimate space pending-sync relfilenodes */
+    pendingsync_list_size = SerializePendingSyncs(&pendingsync_buf);
+    if (pendingsync_list_size > 0)
+    {
+        shm_toc_estimate_chunk(&pcxt->estimator, pendingsync_list_size);
+        shm_toc_estimate_keys(&pcxt->estimator, 1);
+    }
+
     /* Estimate space for DSA area. */
     shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
     shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -769,6 +782,15 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
         }
     }
 
+    /* Copy pending sync list if any */
+    if (pendingsync_list_size > 0)
+    {
+        pendingsync_space = shm_toc_allocate(pcxt->toc, pendingsync_list_size);
+        memcpy(pendingsync_space, pendingsync_buf, pendingsync_list_size);
+        shm_toc_insert(pcxt->toc, PARALLEL_KEY_PENDING_SYNCS,
+                       pendingsync_space);
+    }
+
     /*
      * Create a DSA area that can be used by the leader and all workers.
      * (However, if we failed to create a DSM and are using private memory
@@ -1332,6 +1354,7 @@ void
 ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 {
     FixedParallelExecutorState *fpes;
+    RelFileNode    *pendingsyncs;
     BufferUsage *buffer_usage;
     DestReceiver *receiver;
     QueryDesc  *queryDesc;
@@ -1345,6 +1368,11 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
     /* Get fixed-size state. */
     fpes = shm_toc_lookup(toc, PARALLEL_KEY_EXECUTOR_FIXED, false);
 
+    /* Get pending-sync information */
+    pendingsyncs = shm_toc_lookup(toc, PARALLEL_KEY_PENDING_SYNCS, true);
+    if (pendingsyncs)
+        DeserializePendingSyncs(pendingsyncs);
+
     /* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
     receiver = ExecParallelGetReceiver(seg, toc);
     instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9ee9dc8cc0..fd96b0593a 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
+#include "access/parallel.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -1241,6 +1242,22 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
     if (insertIt)
         RelationCacheInsert(relation, true);
 
+    /*
+     * Restore new-ness flags on parallel workers.
+     *
+     * Parallel workers needs WAL-skipping information. Fill in the flag using
+     * pending sync information. It is restored at worker start.
+     *
+     * We assume that parallel worker doesn't perform transactional operations,
+     * so just set rd_firstRelfilenodeSubid to 1 for relations with new
+     * relfilenodes.
+     */
+    if (IsParallelWorker())
+    {
+        if (RelFileNodeSkippingWAL(relation->rd_node))
+            relation->rd_firstRelfilenodeSubid = 1;
+    }
+
     /* It's fully valid */
     relation->rd_isvalid = true;
 
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index bd37bf311c..37d96729c2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -30,6 +30,8 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
                                 ForkNumber forkNum, char relpersistence);
 extern bool RelFileNodeSkippingWAL(RelFileNode rnode);
+extern int  SerializePendingSyncs(RelFileNode **parray);
+extern void DeserializePendingSyncs(RelFileNode *parray);
 
 /*
  * These functions used to be in storage/smgr/smgr.c, which explains the
-- 
2.18.2

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

30 марта 2020 г., 07:41:01

I think attached v41nm is ready for commit.  Would anyone like to vote against
back-patching this?  It's hard to justify lack of back-patch for a data-loss
bug, but this is atypically invasive.  (I'm repeating the question, since some
folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.

On Mon, Mar 23, 2020 at 05:20:27PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> > MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
> > parallel workers have zeroes for pendingSyncHash and rd_*Subid.

> > Kyotaro, can you look through the affected code and propose a strategy for
> > good coexistence of parallel query with the WAL skipping mechanism?
> 
> Bi-directional communication between leader and workers is too-much.
> It wouldn't be acceptable to inhibit the problematic operations on
> workers such like heap-prune or btree pin removal.  If we do pending
> syncs just before worker start, it won't fix the issue.
> 
> The attached patch passes a list of pending-sync relfilenodes at
> worker start.

If you were to issue pending syncs and also cease skipping WAL for affected
relations, that would fix the issue.  Your design is better, though.  I made
two notable changes:

- The patch was issuing syncs or FPIs every time a parallel worker exited.  I
  changed it to skip most of smgrDoPendingSyncs() in parallel workers, like
  AtEOXact_RelationMap() does.

- PARALLEL_KEY_PENDING_SYNCS is most similar to PARALLEL_KEY_REINDEX_STATE and
  PARALLEL_KEY_COMBO_CID.  parallel.c, not execParallel.c, owns those.  I
  moved PARALLEL_KEY_PENDING_SYNCS to parallel.c, which also called for style
  changes in the associated storage.c functions.

Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.

Вложения

skip-wal-v41nm.tar.gz

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

30 марта 2020 г., 08:56:11

At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch <noah@leadboat.com> wrote in 
> I think attached v41nm is ready for commit.  Would anyone like to vote against
> back-patching this?  It's hard to justify lack of back-patch for a data-loss
> bug, but this is atypically invasive.  (I'm repeating the question, since some
> folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.
> 
> On Mon, Mar 23, 2020 at 05:20:27PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > > The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> > > MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
> > > parallel workers have zeroes for pendingSyncHash and rd_*Subid.
> 
> > > Kyotaro, can you look through the affected code and propose a strategy for
> > > good coexistence of parallel query with the WAL skipping mechanism?
> > 
> > Bi-directional communication between leader and workers is too-much.
> > It wouldn't be acceptable to inhibit the problematic operations on
> > workers such like heap-prune or btree pin removal.  If we do pending
> > syncs just before worker start, it won't fix the issue.
> > 
> > The attached patch passes a list of pending-sync relfilenodes at
> > worker start.
> 
> If you were to issue pending syncs and also cease skipping WAL for affected
> relations, that would fix the issue.  Your design is better, though.  I made
> two notable changes:
>
> - The patch was issuing syncs or FPIs every time a parallel worker exited.  I
>   changed it to skip most of smgrDoPendingSyncs() in parallel workers, like
>   AtEOXact_RelationMap() does.

Exactly. Thank you for fixing it.

> - PARALLEL_KEY_PENDING_SYNCS is most similar to PARALLEL_KEY_REINDEX_STATE and
>   PARALLEL_KEY_COMBO_CID.  parallel.c, not execParallel.c, owns those.  I
>   moved PARALLEL_KEY_PENDING_SYNCS to parallel.c, which also called for style
>   changes in the associated storage.c functions.

That sounds better.

Moving the responsibility of creating pending syncs array reduces
copy. RestorePendingSyncs (And AddPendingSync()) looks better.

> Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
> XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.

Sounds reasonable. In AddPendingSync, don't we put
Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
The former guarantees the relationship between XLogIsNeeded() and
pendingSyncHash, and the existing latter assertion looks redundant as
it is placed just after "if (pendingSyncHash)".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

30 марта 2020 г., 09:08:27

On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
> > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.
> 
> Sounds reasonable. In AddPendingSync, don't we put
> Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
> The former guarantees the relationship between XLogIsNeeded() and
> pendingSyncHash, and the existing latter assertion looks redundant as
> it is placed just after "if (pendingSyncHash)".

The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it.  I
am not inclined to replace it with Assert(!XLogIsNeeded()).  This static
function is not likely to get more callers, so the chance of accidentally
calling it under XLogIsNeeded() is too low.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

30 марта 2020 г., 09:22:32

At Sun, 29 Mar 2020 23:08:27 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
> > > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.
> > 
> > Sounds reasonable. In AddPendingSync, don't we put
> > Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
> > The former guarantees the relationship between XLogIsNeeded() and
> > pendingSyncHash, and the existing latter assertion looks redundant as
> > it is placed just after "if (pendingSyncHash)".
> 
> The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it.  I
> am not inclined to replace it with Assert(!XLogIsNeeded()).  This static
> function is not likely to get more callers, so the chance of accidentally
> calling it under XLogIsNeeded() is too low.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Michael Paquier

Дата:

30 марта 2020 г., 10:43:00

On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> I think attached v41nm is ready for commit.  Would anyone like to vote against
> back-patching this?  It's hard to justify lack of back-patch for a data-loss
> bug, but this is atypically invasive.  (I'm repeating the question, since some
> folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.

The invasiveness of the patch is a concern.  Have you considered a
different strategy?  For example, we are soon going to be in beta for
13, so you could consider committing the patch only on HEAD first.
If there are issues to take care of, you can then leverage the beta
testing to address any issues found.  Finally, once some dust has
settled on the concept and we have gained enough confidence, we could
consider a back-patch.  In short, my point is just that even if this
stuff is discussed for years, I see no urgency in back-patching per
the lack of complains we have in -bugs or such.
--
Michael

Вложения

signature.asc

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

31 марта 2020 г., 09:28:54

On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > I think attached v41nm is ready for commit.  Would anyone like to vote against
> > back-patching this?  It's hard to justify lack of back-patch for a data-loss
> > bug, but this is atypically invasive.  (I'm repeating the question, since some
> > folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.
> 
> The invasiveness of the patch is a concern.  Have you considered a
> different strategy?  For example, we are soon going to be in beta for
> 13, so you could consider committing the patch only on HEAD first.
> If there are issues to take care of, you can then leverage the beta
> testing to address any issues found.  Finally, once some dust has
> settled on the concept and we have gained enough confidence, we could
> consider a back-patch.

No.  Does anyone favor this proposal more than back-patching normally?

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Andres Freund

Дата:

31 марта 2020 г., 09:37:57

Hi,

On 2020-03-30 23:28:54 -0700, Noah Misch wrote:
> On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > > I think attached v41nm is ready for commit.  Would anyone like to vote against
> > > back-patching this?  It's hard to justify lack of back-patch for a data-loss
> > > bug, but this is atypically invasive.  (I'm repeating the question, since some
> > > folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.
> > 
> > The invasiveness of the patch is a concern.  Have you considered a
> > different strategy?  For example, we are soon going to be in beta for
> > 13, so you could consider committing the patch only on HEAD first.
> > If there are issues to take care of, you can then leverage the beta
> > testing to address any issues found.  Finally, once some dust has
> > settled on the concept and we have gained enough confidence, we could
> > consider a back-patch.
> 
> No.  Does anyone favor this proposal more than back-patching normally?

I have not reviewed the patch, so I don't have a good feeling for its
riskiness. But it does sound fairly invasive. Given that we've lived
with this issue for many years by now, and that the rate of incidents
seems to have been fairly low, I think living with the issue for a bit
longer to gain confidence might be a good choice.  But I'd not push back
if you, being much more informed, think the risk/reward balance favors
immediate backpatching.

Greetings,

Andres Freund

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

02 апреля 2020 г., 06:51:29

On Mon, Mar 30, 2020 at 11:37:57PM -0700, Andres Freund wrote:
> On 2020-03-30 23:28:54 -0700, Noah Misch wrote:
> > On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> > > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > > > I think attached v41nm is ready for commit.  Would anyone like to vote against
> > > > back-patching this?  It's hard to justify lack of back-patch for a data-loss
> > > > bug, but this is atypically invasive.  (I'm repeating the question, since some
> > > > folks missed my 2020-02-18 question.)  Otherwise, I'll push this on Saturday.
> > > 
> > > The invasiveness of the patch is a concern.  Have you considered a
> > > different strategy?  For example, we are soon going to be in beta for
> > > 13, so you could consider committing the patch only on HEAD first.
> > > If there are issues to take care of, you can then leverage the beta
> > > testing to address any issues found.  Finally, once some dust has
> > > settled on the concept and we have gained enough confidence, we could
> > > consider a back-patch.
> > 
> > No.  Does anyone favor this proposal more than back-patching normally?
> 
> I have not reviewed the patch, so I don't have a good feeling for its
> riskiness. But it does sound fairly invasive. Given that we've lived
> with this issue for many years by now, and that the rate of incidents
> seems to have been fairly low, I think living with the issue for a bit
> longer to gain confidence might be a good choice.  But I'd not push back
> if you, being much more informed, think the risk/reward balance favors
> immediate backpatching.

I've translated the non-vote comments into estimated votes of -0.3, -0.6,
-0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Robert Haas

Дата:

02 апреля 2020 г., 18:24:36

On Wed, Apr 1, 2020 at 11:51 PM Noah Misch <noah@leadboat.com> wrote:
> I've translated the non-vote comments into estimated votes of -0.3, -0.6,
> -0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.

FWIW, I also think that it would be better not to back-patch. The risk
of back-patching is that this will break things, whereas the risk of
not back-patching is that we will harm people who are affected by this
bug for a longer period of time than would otherwise be the case.
Because this patch is complex, the risk of breaking things seems
higher than normal. On the other hand, the number of users adversely
affected by the bug appears to be relatively low. Taken together,
these factors persuade me that we should not back-patch at this time.

It is possible that in the future things may look different. In the
happy event that this patch causes no more problems following commit,
while at the same time we have more complaints about the underlying
problem, we can make a decision to back-patch at a later time. This
brings me to another point: because this patch changes the WAL format,
a straight revert will be impossible once a release has occurred.
Therefore, if we hold off on back-patching for now and later decide
that we erred, we can proceed at that time and it will probably not be
much harder than it would be to do it now. On the other hand, if we
decide to back-patch now and later decide that we have erred, we will
have additional engineering work to do to cater to people who have
already installed the version containing the back-patched fix and now
need to upgrade again. Perhaps the WAL format changes are simple
enough that this isn't likely to be a huge issue even if it happens,
but it does seem better to avoid the chance that it might. A further
factor is that releases which break WAL compatibility are undesirable,
and should only be undertaken when necessary.

Last but not least, I would like to join with others in expressing my
thanks to you for your hard work on this problem. While the process of
developing a fix has not been without bumps, few people would have had
the time, patience, diligence, and skill to take this effort as far as
you have. Kyotaro Horiguchi and others likewise deserve credit for all
of the many hours that they have put into this work. The entire
PostgreSQL community owes all of you a debt of gratitude, and you have
my thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Tom Lane

Дата:

05 апреля 2020 г., 01:24:34

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 1, 2020 at 11:51 PM Noah Misch <noah@leadboat.com> wrote:
>> I've translated the non-vote comments into estimated votes of -0.3, -0.6,
>> -0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.

> FWIW, I also think that it would be better not to back-patch.

FWIW, I also concur with not back-patching; the risk/reward ratio
does not look favorable.  Maybe later.

> Last but not least, I would like to join with others in expressing my
> thanks to you for your hard work on this problem.

+1 on that, too.

Shouldn't the CF entry get closed?

            regards, tom lane

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

05 апреля 2020 г., 01:32:12

On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> Shouldn't the CF entry get closed?

Once the buildfarm is clean for a day, sure.  The buildfarm has already
revealed a missing perl2host call.

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Kyotaro Horiguchi

Дата:

06 апреля 2020 г., 03:46:31

At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> > Shouldn't the CF entry get closed?
> 
> Once the buildfarm is clean for a day, sure.  The buildfarm has already
> revealed a missing perl2host call.

Thank you for (re-) committing this and the following fix. I hope this
doesn't bring in another failure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

От

Noah Misch

Дата:

06 апреля 2020 г., 10:04:25

On Mon, Apr 06, 2020 at 09:46:31AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> > > Shouldn't the CF entry get closed?
> > 
> > Once the buildfarm is clean for a day, sure.  The buildfarm has already
> > revealed a missing perl2host call.
> 
> Thank you for (re-) committing this and the following fix. I hope this
> doesn't bring in another failure.

I have closed the CF entry.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-05%2000%3A00%3A27
happened, but I doubt it is unrelated.  A wait_for_catchup that usually takes
<1s instead timed out after 397s.  I can't reproduce it.  In the past, another
animal on the same machine had the same failure:

  sysname  │      snapshot       │ branch │                                             bfurl

───────────┼─────────────────────┼────────┼───────────────────────────────────────────────────────────────────────────────────────────────
 bowerbird │ 2019-11-17 15:22:42 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-11-17%2015%3A22%3A42
 bowerbird │ 2020-01-10 17:30:49 │ HEAD   │
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2020-01-10%2017%3A30%3A49

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: WAL logging problem in 9.4.3?

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения