Обсуждение: [HACKERS] Broken hint bits (freeze)

Поиск
Список
Период
Сортировка

[HACKERS] Broken hint bits (freeze)

От
Dmitriy Sarafannikov
Дата:
Hi hackers,

We have some problems on our production with hint bits and frozen tuples.
More and more following errors began to appear on master after switchover:
ERROR:  58P01: could not access status of transaction 1952523525
DETAIL:  Could not open file "pg_clog/0746": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:896

We investigated the problem with pageinspect and found the tuples that are the cause:

xdb311g(master)=# select * from mytable where ctid = '(4,21)';
ERROR:  58P01: could not access status of transaction 1951521353
DETAIL:  Could not open file "pg_clog/0745": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:896

But the same query successfully executed on replica.

We found some difference in hint bits between master and replica:

xdb311g(master)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM
heap_page_items(get_raw_page(‘mytable',4))where lp=21; 
-[ RECORD 1 ]------------------------------
t_xmin   | 1951521353
?column? | 00000000000000000000000000000000

old master, now replica:
xdb311e(replica)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM
heap_page_items(get_raw_page(‘mytable',4))where lp=21; 
-[ RECORD 1 ]------------------------------
t_xmin   | 1951521353
?column? | 00000000000000000000001100000000

X’0300’ = HEAP_XMIN_FROZEN according to

#define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
#define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)

xdb311g(master)=# select relfrozenxid from pg_class where relname = ‘mytable';
relfrozenxid
--------------
2266835605
(1 row)

This tuple must be frozen but there are no set bits HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID on master

Another interesting thing that LSN of this page on master and replica are not the same:
xdb311g(master)=# select lsn from page_header(get_raw_page(‘mytable',4));  lsn
---------------
8092/6A26DD08
(1 row)

xdb311e(replica)=# select lsn from page_header(get_raw_page(‘mytable',4));  lsn
---------------
838D/C4A0D280
(1 row)

And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
How can this be possible?

We wrote a query which returns ctid of frozen tuples, which must be frozen but not actually frozen.

xdb311e(replica)=# select t_ctid from generate_series(0, pg_relation_size(‘mytable')/8192 - 1 ) s(i) left join lateral
heap_page_items(get_raw_page(‘mytable',s.i::int))on true where t_xmin::text::bigint < (select
relfrozenxid::text::bigintfrom pg_class where relname = ‘mytable') and t_infomask & X'0300'::int < 1; 
t_ctid
-----------
(400,16)
(2837,71)
(2837,72)
(2837,73)
(2837,75)
(2837,76)
(3042,40)
(4750,80)
(4750,81)
(5214,60)
(5214,65)
(6812,31)
(6912,63)
(7329,8)
(7374,26)
(7374,27)
(16 rows)
Same query on master returns 317 rows.

Our thoughts:
1) We think that it is related to switchover.
2) Any WAL-logged modification of this page on master will replace this page on replica due to full page writes.And all
replicaswill have broken hint bits too. It’s dangerous. 

Where to dig further?

RHEL6, PostgreSQL 9.6.3, wal_log_hints=off, full_page_writes=on, fsync=on, checksums disabled.
We don’t think that it is any hardware-related problems because this databases started from 9.4
and they survived 2 upgrades with pg_upgrade. And any hardware-related problems was not detected.
Problem appears not only in this shard.
Size of each shard is around 5TB and we can’t provide data.

Regards
Dmitriy Sarafannikov


Re: [HACKERS] Broken hint bits (freeze)

От
Dmitriy Sarafannikov
Дата:
We found that this problem appears also on shards with enabled checksums.
This shard has 1st timeline, which means there was no switchover after upgrade to 9.6.

xdb11f(master)=# select pg_current_xlog_location(),
pg_xlogfile_name(pg_current_xlog_location());pg_current_xlog_location|     pg_xlogfile_name 
--------------------------+--------------------------30BA/5966AD38            | 00000001000030BA00000059
(1 row)

xdb11f(master)=# select * from page_header(get_raw_page(‘mytable', 1787));     lsn      | checksum | flags | lower |
upper| special | pagesize | version | prune_xid 
---------------+----------+-------+-------+-------+---------+----------+---------+-----------1F43/8C432C60 |    -3337 |
   5 |   256 |   304 |    8192 |     8192 |       4 |         0 
(1 row)

xdb11h(replica)=# select * from page_header(get_raw_page(‘mytable', 1787));     lsn      | checksum | flags | lower |
upper| special | pagesize | version | prune_xid 
---------------+----------+-------+-------+-------+---------+----------+---------+-----------1B28/45819C28 |   -17617 |
   5 |   256 |   304 |    8192 |     8192 |       4 |         0 
(1 row)

xdb11e(replica)=# select * from page_header(get_raw_page('mytable', 1787));     lsn      | checksum | flags | lower |
upper| special | pagesize | version | prune_xid 
---------------+----------+-------+-------+-------+---------+----------+---------+-----------1B28/45819C28 |   -17617 |
   5 |   256 |   304 |    8192 |     8192 |       4 |         0 
(1 row)

Master has newer page version and freeze bits.

xdb11f(master)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) from
heap_page_items(get_raw_page(‘mytable',1787)) where lp = 42; t_xmin   |             ?column? 
-----------+----------------------------------516651778 | 00000000000000000000001100000000
(1 row)

xdb11h(replica)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) from
heap_page_items(get_raw_page('mytable',1787)) where lp = 42; t_xmin   |             ?column? 
-----------+----------------------------------516651778 | 00000000000000000000000000000000
(1 row)

xdb11e(replica)=# select t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) from
heap_page_items(get_raw_page('mytable',1787)) where lp = 42; t_xmin   |             ?column? 
-----------+----------------------------------516651778 | 00000000000000000000000000000000
(1 row)

It seems like replica did not replayed corresponding WAL records.
Any thoughts?

Regards,
Dmitriy Sarafannikov


Re: [HACKERS] Broken hint bits (freeze)

От
Robert Haas
Дата:
On Wed, May 24, 2017 at 7:27 AM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:
> It seems like replica did not replayed corresponding WAL records.
> Any thoughts?

heap_xlog_freeze_page() is a pretty simple function.  It's not
impossible that it could have a bug that causes it to incorrectly skip
records, but it's not clear why that wouldn't affect many other replay
routines equally, since the pattern of using the return value of
XLogReadBufferForRedo() to decide what to do is widespread.

Can you prove that other WAL records generated around the same time as
the freeze record *were* replayed on the master?  If so, that proves
that this isn't just a case of the WAL never reaching the standby.
Can you look at the segment that contains the relevant freeze record
with pg_xlogdump?  Maybe that record is messed up somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Broken hint bits (freeze)

От
Vladimir Borodin
Дата:

24 мая 2017 г., в 15:44, Robert Haas <robertmhaas@gmail.com> написал(а):

On Wed, May 24, 2017 at 7:27 AM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:
It seems like replica did not replayed corresponding WAL records.
Any thoughts?

heap_xlog_freeze_page() is a pretty simple function.  It's not
impossible that it could have a bug that causes it to incorrectly skip
records, but it's not clear why that wouldn't affect many other replay
routines equally, since the pattern of using the return value of
XLogReadBufferForRedo() to decide what to do is widespread.

Can you prove that other WAL records generated around the same time as
the freeze record *were* replayed on the master?  If so, that proves
that this isn't just a case of the WAL never reaching the standby.
Can you look at the segment that contains the relevant freeze record
with pg_xlogdump?  Maybe that record is messed up somehow.

Not yet. Most of such cases are long before our recovery window so corresponding WALs have been deleted. We have already tuned retention policy and we are now looking for a fresh case.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
May the force be with you…

Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Tue, May 23, 2017 at 10:50 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:
> Hi hackers,
>
> We have some problems on our production with hint bits and frozen tuples.
> More and more following errors began to appear on master after switchover:
> ERROR:  58P01: could not access status of transaction 1952523525
> DETAIL:  Could not open file "pg_clog/0746": No such file or directory.
> LOCATION:  SlruReportIOError, slru.c:896
>
> We investigated the problem with pageinspect and found the tuples that are the cause:
>
> xdb311g(master)=# select * from mytable where ctid = '(4,21)';
> ERROR:  58P01: could not access status of transaction 1951521353
> DETAIL:  Could not open file "pg_clog/0745": No such file or directory.
> LOCATION:  SlruReportIOError, slru.c:896
>
> But the same query successfully executed on replica.
>
> We found some difference in hint bits between master and replica:
>
> xdb311g(master)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM
heap_page_items(get_raw_page(‘mytable',4))where lp=21; 
> -[ RECORD 1 ]------------------------------
> t_xmin   | 1951521353
> ?column? | 00000000000000000000000000000000
>
> old master, now replica:
> xdb311e(replica)=# SELECT t_xmin, t_infomask::bit(32) & X'0300'::int::bit(32) FROM
heap_page_items(get_raw_page(‘mytable',4))where lp=21; 
> -[ RECORD 1 ]------------------------------
> t_xmin   | 1951521353
> ?column? | 00000000000000000000001100000000
>
> X’0300’ = HEAP_XMIN_FROZEN according to
>
> #define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
> #define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
> #define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
>
> xdb311g(master)=# select relfrozenxid from pg_class where relname = ‘mytable';
> relfrozenxid
> --------------
> 2266835605
> (1 row)
>
> This tuple must be frozen but there are no set bits HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID on master
>
> Another interesting thing that LSN of this page on master and replica are not the same:
> xdb311g(master)=# select lsn from page_header(get_raw_page(‘mytable',4));
>    lsn
> ---------------
> 8092/6A26DD08
> (1 row)
>
> xdb311e(replica)=# select lsn from page_header(get_raw_page(‘mytable',4));
>    lsn
> ---------------
> 838D/C4A0D280
> (1 row)
>
> And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
> How can this be possible?
>

Yeah, I think this is quite suspicious.  This seems to indicate that
not all WAL records are replicated before the switchover.  What is the
value of "synchronous_commit" you are using?  I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Michael Paquier
Дата:
On Fri, May 26, 2017 at 2:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, I think this is quite suspicious.  This seems to indicate that
> not all WAL records are replicated before the switchover.  What is the
> value of "synchronous_commit" you are using?  I think you somehow need
> to ensure before switchover that all the WAL is replicated to ensure
> this is not a setup problem.

It is so easy to corrupt a server because of an incorrect base backup
flow or an incorrect switchover that it would be good first to
understand how you are doing your switchover. Any corruption happening
after a promotion, a failover or a switchover may be the top of the
iceberg of what's on the data pages, and you may just see one symptom
among other problems. Particularly, did you kill the master in any
violent way after promoting the standby? Has the former master been
able to perform at least once a clean shutdown checkpoint and has it
been rewound?
-- 
Michael



Re: [HACKERS] Broken hint bits (freeze)

От
Vladimir Borodin
Дата:

26 мая 2017 г., в 23:04, Michael Paquier <michael.paquier@gmail.com> написал(а):

On Fri, May 26, 2017 at 2:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Yeah, I think this is quite suspicious.  This seems to indicate that
not all WAL records are replicated before the switchover.  What is the
value of "synchronous_commit" you are using?  I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

It is so easy to corrupt a server because of an incorrect base backup
flow or an incorrect switchover that it would be good first to
understand how you are doing your switchover. Any corruption happening
after a promotion, a failover or a switchover may be the top of the
iceberg of what's on the data pages, and you may just see one symptom
among other problems. Particularly, did you kill the master in any
violent way after promoting the standby? Has the former master been
able to perform at least once a clean shutdown checkpoint and has it
been rewound?

At first we cleanly stop the former master and then we extra check that all data has been received by the standby to be promoted. Old master is returned by simply generating recovery.conf and starting it, all other standbys are done in the same way. WAL history on all hosts and in archive remains linear, no pg_rewind is needed. This procedure has been well tested on 9.3 (when the ability to do such switchover without restoring standbys from backup appeared), automated and has not changed for all these years.

Actually we have already found that LSNs of all corrupted tuples are somewhere near the time of upgrade from 9.5 to 9.6. There is still no evidence but it seems that it is mostly related to upgrade procedure. We now extract backups of 9.5 and 9.6 for a database where we now have corrupted pages to check this version. But it is still not obvious if it could be a pg_upgrade bug or our mistake in a way we did upgrade.

--
Michael


--
May the force be with you…

Re: [HACKERS] Broken hint bits (freeze)

От
Vladimir Borodin
Дата:

26 мая 2017 г., в 21:39, Amit Kapila <amit.kapila16@gmail.com> написал(а):

And LSN on replica is greater that LSN on master (838D/C4A0D280 > 8092/6A26DD08)
How can this be possible?


Yeah, I think this is quite suspicious.  This seems to indicate that
not all WAL records are replicated before the switchover.  What is the
value of "synchronous_commit" you are using?

synchronous_commit = on.

 I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

Well, actually clean shutdown of master with exit code 0 from `pg_ctl stop -m fast` guarantees that all WAL has been replicated to standby. But just in case we also check that "Latest checkpoint's REDO location" from control file on old master after shutdown is less than pg_last_xlog_replay_location() on standby to be promoted.

And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new master. So it is highly probable not a setup problem.

--
May the force be with you…

Re: [HACKERS] Broken hint bits (freeze)

От
Andres Freund
Дата:
On 2017-05-27 19:48:24 +0300, Vladimir Borodin wrote:
> Well, actually clean shutdown of master with exit code 0 from `pg_ctl
> stop -m fast` guarantees that all WAL has been replicated to standby.

It does not.  It makes it likely, but the connection to the standby
could be not up just then, you could run into walsender timeout, and a
bunch of other scenarios.


> But just in case we also check that "Latest checkpoint's REDO
> location" from control file on old master after shutdown is less than
> pg_last_xlog_replay_location() on standby to be promoted.

The *redo* location? Or the checkpoint location itself?  Because the
latter is what needs to be *equal* than the replay location not less
than.  Normally there won't be other records inbetween, but that's not
guaranteed.


> And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new
master.So it is highly probable not a setup problem.
 

There's no such guarantee.  There's a bunch of checks that'll somewhat
likely trigger, but nothing more than that.

- Andres



Re: [HACKERS] Broken hint bits (freeze)

От
Vladimir Borodin
Дата:

27 мая 2017 г., в 19:56, Andres Freund <andres@anarazel.de> написал(а):

On 2017-05-27 19:48:24 +0300, Vladimir Borodin wrote:
Well, actually clean shutdown of master with exit code 0 from `pg_ctl
stop -m fast` guarantees that all WAL has been replicated to standby.

It does not.  It makes it likely, but the connection to the standby
could be not up just then, you could run into walsender timeout, and a
bunch of other scenarios.

AFAIK in this case exit code would not be zero. Even if archiver has not been able to archive all WALs before timeout for shutting down happened, exit code will not be zero.



But just in case we also check that "Latest checkpoint's REDO
location" from control file on old master after shutdown is less than
pg_last_xlog_replay_location() on standby to be promoted.

The *redo* location? Or the checkpoint location itself?  Because the
latter is what needs to be *equal* than the replay location not less
than.  Normally there won't be other records inbetween, but that's not
guaranteed.

I've asked about it some time ago [1]. In that case checkpoint location and redo location were equal after shutdown and last replay location on standby was higher on 104 bytes (the size of shutdown checkpoint record).

But we do check exactly redo location. Should we change it for checking checkpoint location?




And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new master. So it is highly probable not a setup problem.

There's no such guarantee.  There's a bunch of checks that'll somewhat
likely trigger, but nothing more than that.

- Andres


--
May the force be with you…

Re: [HACKERS] Broken hint bits (freeze)

От
Michael Paquier
Дата:
On Sat, May 27, 2017 at 12:56 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-05-27 19:48:24 +0300, Vladimir Borodin wrote:
>> Well, actually clean shutdown of master with exit code 0 from `pg_ctl
>> stop -m fast` guarantees that all WAL has been replicated to standby.
>
> It does not.  It makes it likely, but the connection to the standby
> could be not up just then, you could run into walsender timeout, and a
> bunch of other scenarios.

Amen.

>> And if something would go wrong in above logic, postgres will not let you attach old master as a standby of new
master.So it is highly probable not a setup problem.
 
>
> There's no such guarantee.  There's a bunch of checks that'll somewhat
> likely trigger, but nothing more than that.

Yes. Take for example the case where the host with a primary is
plugged off, and another host with a standby is promoted. If at next
restart you add directly for the old primary a recovery.conf and
attempt to use it as a standby to the new primary it may be able to
connect and to begin replication. That will result in a corrupted
standby.
-- 
Michael



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Sat, May 27, 2017 at 10:18 PM, Vladimir Borodin <root@simply.name> wrote:
>
> 26 мая 2017 г., в 21:39, Amit Kapila <amit.kapila16@gmail.com> написал(а):
>
>  I think you somehow need
> to ensure before switchover that all the WAL is replicated to ensure
> this is not a setup problem.
>
>
> Well, actually clean shutdown of master with exit code 0 from `pg_ctl stop
> -m fast` guarantees that all WAL has been replicated to standby.
>

I don't see any such guarantee in code or docs.  Can you explain what
makes you think that for 'fast' mode exit code 0 is a guarantee that
all the WAL be replicated?

> But just in
> case we also check that "Latest checkpoint's REDO location" from control
> file on old master after shutdown is less than
> pg_last_xlog_replay_location() on standby to be promoted.
>
> And if something would go wrong in above logic, postgres will not let you
> attach old master as a standby of new master.
>

I think it will be possible to attach old master as a standby of new
master as some new operations on the new master can increase its LSN
position to a value greater than what old master has.  Your statement
will make sense if you ensure that you don't allow any new operation
on the new master till old master has attached to it as standby.

> So it is highly probable not a
> setup problem.
>

Yeah, it is quite possible that your setup is perfectly fine and there
is actually some code bug due to which you are facing the problem,
however, it is better to rule out all the possibilities related to the
wrong setup.



--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Dmitriy Sarafannikov
Дата:
Thanks for all.

We found the source of the problem. It was mistake in upgrade to 9.6.

We upgrade replica with rsync as it is in the documentation:
rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir

We must provide 100% read-only availability of our shard (master + 2 replicas).
So we can’t stop master and both replicas, upgrade them one by one and start them.
We do it as follows:
Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica,
stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
But upgraded replicas died under load without statistics and we decided to perform
analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync.
The upgrade algorithm became as follows:
Close master, stop master, close master from replicas (iptables), upgrade master,
start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica,
start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
start master, open master.

If autovacuum starts vacuuming relations while we are performing analyze, wal records
generated by it will not be replayed on replicas, because next step is stopping
master with checkpoint and new redo location LSN (newer that these wal records)
will appear in pg_control file, which then will be copied by rsync to replicas.

If it was simple vacuum, we most likely will not see the consequences. Because it marks
tuples as deleted, and some of the next new tuples will be placed here, and due to FPW
replicas will receive correct page, identical to master.
But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on
master will be frozen, but on replicas not. And it will not change if nobody will not
update any tuple on this page.

It’s dangerous, because, if we perform switchover to replica, «corrupted» page
will be delivered to all replicas after next update of any tuple from this page.

We reproduced this case in our test environment and this assumption was confirmed.

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

We know how to find such «corrupted» tuples. And we want to fix this by manually
freezing tuples via calling specially written C functions. Functions are «copy-pasted»
and simplified code from vacuum functions with SQL interface (see attachment).
Can you look on them? Do you think it is safe to use them for fixing corrupted pages
or is there a better way not to loose data?

Regards,
Dmitriy Sarafannikov


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes:

> Starting and stopping master after running pg_upgrade but before rsync to collect statistics
> was a bad idea.

But, starting and stopping master after running pg_upgrade is *required*
by documentation:
https://www.postgresql.org/docs/9.6/static/pgupgrade.html
> f. Start and stop the new master cluster
> In the new master cluster, change wal_level to replica in the postgresql.conf file and then start and stop the
cluster.

and there is no any suggestion to disable autovacuum for it.

-- 
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Vladimir Borodin
Дата:

6 июня 2017 г., в 23:30, Sergey Burladyan <eshkinkot@gmail.com> написал(а):

Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes:

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

But, starting and stopping master after running pg_upgrade is *required*
by documentation:
https://www.postgresql.org/docs/9.6/static/pgupgrade.html
f. Start and stop the new master cluster
In the new master cluster, change wal_level to replica in the postgresql.conf file and then start and stop the cluster.

and there is no any suggestion to disable autovacuum for it.


Yep. This should probably be fixed in the documentation?


--
May the force be with you…

Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Vladimir Borodin <root@simply.name> writes:

> > 6 июня 2017 г., в 23:30, Sergey Burladyan <eshkinkot@gmail.com> написал(а):
> >
> > Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes:
> >
> >> Starting and stopping master after running pg_upgrade but before rsync to collect statistics
> >> was a bad idea.
> >
> > But, starting and stopping master after running pg_upgrade is *required*
> > by documentation:
> > https://www.postgresql.org/docs/9.6/static/pgupgrade.html
> >> f. Start and stop the new master cluster
> >> In the new master cluster, change wal_level to replica in the postgresql.conf file and then start and stop the
cluster.
> >
> > and there is no any suggestion to disable autovacuum for it.

> Yep. This should probably be fixed in the documentation?

I think so. There is some problem in pg_upgrade documentation, nothing about:
1. preventing heap change by vacuum, analyze, something also when master  restarted after pg_upgrade but before rsync
2. log-shipping only standby cannot shutdown at the same checkpoint with  master

I try to start discuss about this: https://www.postgresql.org/message-id/87y3ta49zp.fsf%40seb.koffice.internal
but without luck :-)

PS: I CC'd Bruce here.

--
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Fri, Jun 2, 2017 at 4:20 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:
> Thanks for all.
>
> We found the source of the problem. It was mistake in upgrade to 9.6.
>
> We upgrade replica with rsync as it is in the documentation:
> rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir
>
> We must provide 100% read-only availability of our shard (master + 2 replicas).
> So we can’t stop master and both replicas, upgrade them one by one and start them.
> We do it as follows:
> Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica,
> stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
> But upgraded replicas died under load without statistics and we decided to perform
> analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync.
> The upgrade algorithm became as follows:
> Close master, stop master, close master from replicas (iptables), upgrade master,
> start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica,
> start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
> start master, open master.
>
> If autovacuum starts vacuuming relations while we are performing analyze, wal records
> generated by it will not be replayed on replicas, because next step is stopping
> master with checkpoint and new redo location LSN (newer that these wal records)
> will appear in pg_control file, which then will be copied by rsync to replicas.
>
> If it was simple vacuum, we most likely will not see the consequences. Because it marks
> tuples as deleted, and some of the next new tuples will be placed here, and due to FPW
> replicas will receive correct page, identical to master.
> But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on
> master will be frozen, but on replicas not. And it will not change if nobody will not
> update any tuple on this page.
>

Why didn't rsync made the copies on master and replica same?

> It’s dangerous, because, if we perform switchover to replica, «corrupted» page
> will be delivered to all replicas after next update of any tuple from this page.
>
> We reproduced this case in our test environment and this assumption was confirmed.
>
> Starting and stopping master after running pg_upgrade but before rsync to collect statistics
> was a bad idea.
>
> We know how to find such «corrupted» tuples. And we want to fix this by manually
> freezing tuples via calling specially written C functions. Functions are «copy-pasted»
> and simplified code from vacuum functions with SQL interface (see attachment).
> Can you look on them? Do you think it is safe to use them for fixing corrupted pages
> or is there a better way not to loose data?
>

I haven't looked in detail, but it sounds slightly risky proposition
to manipulate the tuples by writing C functions of the form you have
in your code.  I would have preferred some way to avoid this problem
by ensuring that replicas are properly synced (complete data of master
via WAL) or by disabling autovacuum.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Dmitriy Sarafannikov
Дата:
> Why didn't rsync made the copies on master and replica same?

Because rsync was running with —size-only flag.

> I haven't looked in detail, but it sounds slightly risky proposition
> to manipulate the tuples by writing C functions of the form you have
> in your code.  I would have preferred some way to avoid this problem
> by ensuring that replicas are properly synced (complete data of master
> via WAL) or by disabling autovacuum.

Avoiding this problem is a good way. But what to do with already corrupted data?
Can you explain more what do you mean?


Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:
>
>> Why didn't rsync made the copies on master and replica same?
>
> Because rsync was running with —size-only flag.
>

IIUC the situation, the new WAL and updated pg_control file has been
copied, but not updated data files due to which the WAL has not been
replayed on replicas?  If so, why the pg_control file is copied, it's
size shouldn't have changed?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Vladimir Borodin
Дата:

8 июня 2017 г., в 17:03, Amit Kapila <amit.kapila16@gmail.com> написал(а):

On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:

Why didn't rsync made the copies on master and replica same?

Because rsync was running with —size-only flag.


IIUC the situation, the new WAL and updated pg_control file has been
copied, but not updated data files due to which the WAL has not been
replayed on replicas?  If so, why the pg_control file is copied, it's
size shouldn't have changed?

Because on master pg_upgrade moves $prefix/9.5/data/global/pg_control to $prefix/9.5/data/global/pg_control.old and creates new $prefix/9.6/data/global/pg_control without making hardlink. When running rsync from master to replica rsync sees $prefix/9.6/data/global/pg_control on master and checks if it is a hardlink. Since it is not a hardlink and $prefix/9.6/data/global/pg_control does not exist on replica rsync copies it. For data files the logic is different since they are hardlinks, corresponding files exist on replica and they are the same size.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
May the force be with you…

Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Sun, Jun 11, 2017 at 11:59 PM, Vladimir Borodin <root@simply.name> wrote:
>
> 8 июня 2017 г., в 17:03, Amit Kapila <amit.kapila16@gmail.com> написал(а):
>
> On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
> <dsarafannikov@yandex.ru> wrote:
>
>
> Why didn't rsync made the copies on master and replica same?
>
>
> Because rsync was running with —size-only flag.
>
>
> IIUC the situation, the new WAL and updated pg_control file has been
> copied, but not updated data files due to which the WAL has not been
> replayed on replicas?  If so, why the pg_control file is copied, it's
> size shouldn't have changed?
>
>
> Because on master pg_upgrade moves $prefix/9.5/data/global/pg_control to
> $prefix/9.5/data/global/pg_control.old and creates new
> $prefix/9.6/data/global/pg_control without making hardlink. When running
> rsync from master to replica rsync sees $prefix/9.6/data/global/pg_control
> on master and checks if it is a hardlink. Since it is not a hardlink and
> $prefix/9.6/data/global/pg_control does not exist on replica rsync copies
> it. For data files the logic is different since they are hardlinks,
> corresponding files exist on replica and they are the same size.
>

Okay, in that case, I guess it is better to run Analyze on master
after the upgrade is complete (including an upgrade for replicas).  If
you are worried about the performance of read-only replicas till the
time Analyze on the master in completed, you might want to use
--analyze-in-stages of vaccumdb and or use (-j njobs) along with it to
parallelize the operation.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Vladimir Borodin
Дата:

12 июня 2017 г., в 13:19, Amit Kapila <amit.kapila16@gmail.com> написал(а):

On Sun, Jun 11, 2017 at 11:59 PM, Vladimir Borodin <root@simply.name> wrote:

8 июня 2017 г., в 17:03, Amit Kapila <amit.kapila16@gmail.com> написал(а):

On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:


Why didn't rsync made the copies on master and replica same?


Because rsync was running with —size-only flag.


IIUC the situation, the new WAL and updated pg_control file has been
copied, but not updated data files due to which the WAL has not been
replayed on replicas?  If so, why the pg_control file is copied, it's
size shouldn't have changed?


Because on master pg_upgrade moves $prefix/9.5/data/global/pg_control to
$prefix/9.5/data/global/pg_control.old and creates new
$prefix/9.6/data/global/pg_control without making hardlink. When running
rsync from master to replica rsync sees $prefix/9.6/data/global/pg_control
on master and checks if it is a hardlink. Since it is not a hardlink and
$prefix/9.6/data/global/pg_control does not exist on replica rsync copies
it. For data files the logic is different since they are hardlinks,
corresponding files exist on replica and they are the same size.


Okay, in that case, I guess it is better to run Analyze on master
after the upgrade is complete (including an upgrade for replicas).  If
you are worried about the performance of read-only replicas till the
time Analyze on the master in completed, you might want to use
--analyze-in-stages of vaccumdb and or use (-j njobs) along with it to
parallelize the operation.

What about the following sequence?

1. Run pg_upgrade on master,
2. Start it in single-user mode and stop (to get right wal_level in pg_control),
3. Copy pg_control somewhere,
4. Start master, run analyze and stop.
5. Put the control file from step 3 to replicas and rsync them according to the documentation.

And I think that step 10.f in the documentation [1] should be fixed to mention starting in single-user mode or with disabled autovacuum.




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


--
May the force be with you…

Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Mon, Jun 12, 2017 at 9:01 PM, Vladimir Borodin <root@simply.name> wrote:
>
> 12 июня 2017 г., в 13:19, Amit Kapila <amit.kapila16@gmail.com> написал(а):
>
> On Sun, Jun 11, 2017 at 11:59 PM, Vladimir Borodin <root@simply.name> wrote:
>
>
> 8 июня 2017 г., в 17:03, Amit Kapila <amit.kapila16@gmail.com> написал(а):
>
> On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
> <dsarafannikov@yandex.ru> wrote:
>
>
> Why didn't rsync made the copies on master and replica same?
>
>
> Because rsync was running with —size-only flag.
>
>
> IIUC the situation, the new WAL and updated pg_control file has been
> copied, but not updated data files due to which the WAL has not been
> replayed on replicas?  If so, why the pg_control file is copied, it's
> size shouldn't have changed?
>
>
> Because on master pg_upgrade moves $prefix/9.5/data/global/pg_control to
> $prefix/9.5/data/global/pg_control.old and creates new
> $prefix/9.6/data/global/pg_control without making hardlink. When running
> rsync from master to replica rsync sees $prefix/9.6/data/global/pg_control
> on master and checks if it is a hardlink. Since it is not a hardlink and
> $prefix/9.6/data/global/pg_control does not exist on replica rsync copies
> it. For data files the logic is different since they are hardlinks,
> corresponding files exist on replica and they are the same size.
>
>
> Okay, in that case, I guess it is better to run Analyze on master
> after the upgrade is complete (including an upgrade for replicas).  If
> you are worried about the performance of read-only replicas till the
> time Analyze on the master in completed, you might want to use
> --analyze-in-stages of vaccumdb and or use (-j njobs) along with it to
> parallelize the operation.
>
>
> What about the following sequence?
>
> 1. Run pg_upgrade on master,
> 2. Start it in single-user mode and stop (to get right wal_level in
> pg_control),
> 3. Copy pg_control somewhere,
>

So the above step-3 is to allow extra WAL to be replayed on replicas
after the upgrade?

> 4. Start master, run analyze and stop.
> 5. Put the control file from step 3 to replicas and rsync them according to
> the documentation.
>

I think the above way should work for your use case unless someone
makes mistake while copying pg_control.  I am sure you are ensuring to
have a backup during above procedure.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Mon, Jun 12, 2017 at 06:31:11PM +0300, Vladimir Borodin wrote:
> What about the following sequence?
> 
> 1. Run pg_upgrade on master,
> 2. Start it in single-user mode and stop (to get right wal_level in
> pg_control),
> 3. Copy pg_control somewhere,
> 4. Start master, run analyze and stop.
> 5. Put the control file from step 3 to replicas and rsync them according to the
> documentation.
> 
> And I think that step 10.f in the documentation [1] should be fixed to mention
> starting in single-user mode or with disabled autovacuum.
> 
> [1] https://www.postgresql.org/docs/devel/static/pgupgrade.html

First, I want to apologize for not getting involved in this thread
earlier, and I want to thank everyone for the huge amount of detective
work in finding the cause of this bug.

Let me see if I can replay how the standby server upgrade instructions
evolved over time.

Initially we knew that we had to set wal_level to replica+ so that when
you reconnect to the standby servers, the WAL would have the right
contents.  (We are basically simulating pg_start/stop backup with
rsync.)  

There was a desire to have those instructions inside a documentation
block dedicated to standby server upgrades, so the wal_level adjustment
and new server start/stop was added to that block.  I assumed a
start/stop could not modify the WAL, or at least nothing important would
happen, but obviously I was wrong.  (pg_upgrade takes steps to ensure
that nothing happens.)  Adding ANALYZE in there just made it worse, but
the problem always existed.  I sure hope others haven't had a problem
with this.

Now, it seems we later added a doc section early on that talks about
"Verify standby servers" so I have moved the wal_level section into that
block, which should be safe.  There is now no need to start/stop the new
server since pg_upgrade will do that safely already.

I plan to patch this back to 9.5 where these instructions were added.  I
will mention that this should be in the minor release notes.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Wed, Jun 14, 2017 at 1:01 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Jun 12, 2017 at 06:31:11PM +0300, Vladimir Borodin wrote:
>> What about the following sequence?
>>
>> 1. Run pg_upgrade on master,
>> 2. Start it in single-user mode and stop (to get right wal_level in
>> pg_control),
>> 3. Copy pg_control somewhere,
>> 4. Start master, run analyze and stop.
>> 5. Put the control file from step 3 to replicas and rsync them according to the
>> documentation.
>>
>> And I think that step 10.f in the documentation [1] should be fixed to mention
>> starting in single-user mode or with disabled autovacuum.
>>
>> [1] https://www.postgresql.org/docs/devel/static/pgupgrade.html
>
> First, I want to apologize for not getting involved in this thread
> earlier, and I want to thank everyone for the huge amount of detective
> work in finding the cause of this bug.
>
> Let me see if I can replay how the standby server upgrade instructions
> evolved over time.
>
> Initially we knew that we had to set wal_level to replica+ so that when
> you reconnect to the standby servers, the WAL would have the right
> contents.  (We are basically simulating pg_start/stop backup with
> rsync.)
>
> There was a desire to have those instructions inside a documentation
> block dedicated to standby server upgrades, so the wal_level adjustment
> and new server start/stop was added to that block.  I assumed a
> start/stop could not modify the WAL, or at least nothing important would
> happen, but obviously I was wrong.  (pg_upgrade takes steps to ensure
> that nothing happens.)  Adding ANALYZE in there just made it worse, but
> the problem always existed.  I sure hope others haven't had a problem
> with this.
>
> Now, it seems we later added a doc section early on that talks about
> "Verify standby servers" so I have moved the wal_level section into that
> block, which should be safe.  There is now no need to start/stop the new
> server since pg_upgrade will do that safely already.
>

!     <para>
!      Also, if upgrading standby servers, change <varname>wal_level</>
!      to <literal>replica</> in the <filename>postgresql.conf</> file on
!      the new cluster.

I think it is better to indicate that this is required for the master
cluster (probably it is clear for users) /"on the new cluster."/"on
the new master cluster.". Do we need something different for v10 where
default wal_level is 'replica'


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Wed, Jun 14, 2017 at 07:45:17PM +0530, Amit Kapila wrote:
> > Now, it seems we later added a doc section early on that talks about
> > "Verify standby servers" so I have moved the wal_level section into that
> > block, which should be safe.  There is now no need to start/stop the new
> > server since pg_upgrade will do that safely already.
> >
> 
> !     <para>
> !      Also, if upgrading standby servers, change <varname>wal_level</>
> !      to <literal>replica</> in the <filename>postgresql.conf</> file on
> !      the new cluster.
> 
> I think it is better to indicate that this is required for the master
> cluster (probably it is clear for users) /"on the new cluster."/"on
> the new master cluster.". Do we need something different for v10 where
> default wal_level is 'replica'

You know, I thought about that and was afraid saying "new master
cluster" would be confusing because it isn't a master _yet_, but if you
feel it will help, and I considered it, let's add it.  The problem is
that in the old instructions, at the point we were mentioning this, it
was the new master, which is why I evaluated removing it in the first
place. (Yeah, I am amazed I considered all these cases.)

Updated patch attached.  Thanks.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Wed, Jun 14, 2017 at 8:44 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Jun 14, 2017 at 07:45:17PM +0530, Amit Kapila wrote:
>> > Now, it seems we later added a doc section early on that talks about
>> > "Verify standby servers" so I have moved the wal_level section into that
>> > block, which should be safe.  There is now no need to start/stop the new
>> > server since pg_upgrade will do that safely already.
>> >
>>
>> !     <para>
>> !      Also, if upgrading standby servers, change <varname>wal_level</>
>> !      to <literal>replica</> in the <filename>postgresql.conf</> file on
>> !      the new cluster.
>>
>> I think it is better to indicate that this is required for the master
>> cluster (probably it is clear for users) /"on the new cluster."/"on
>> the new master cluster.". Do we need something different for v10 where
>> default wal_level is 'replica'
>
> You know, I thought about that and was afraid saying "new master
> cluster" would be confusing because it isn't a master _yet_, but if you
> feel it will help, and I considered it, let's add it.  The problem is
> that in the old instructions, at the point we were mentioning this, it
> was the new master, which is why I evaluated removing it in the first
> place. (Yeah, I am amazed I considered all these cases.)
>
> Updated patch attached.  Thanks.
>

Looks good to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Thu, Jun 15, 2017 at 03:00:18PM +0530, Amit Kapila wrote:
> On Wed, Jun 14, 2017 at 8:44 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Wed, Jun 14, 2017 at 07:45:17PM +0530, Amit Kapila wrote:
> >> > Now, it seems we later added a doc section early on that talks about
> >> > "Verify standby servers" so I have moved the wal_level section into that
> >> > block, which should be safe.  There is now no need to start/stop the new
> >> > server since pg_upgrade will do that safely already.
> >> >
> >>
> >> !     <para>
> >> !      Also, if upgrading standby servers, change <varname>wal_level</>
> >> !      to <literal>replica</> in the <filename>postgresql.conf</> file on
> >> !      the new cluster.
> >>
> >> I think it is better to indicate that this is required for the master
> >> cluster (probably it is clear for users) /"on the new cluster."/"on
> >> the new master cluster.". Do we need something different for v10 where
> >> default wal_level is 'replica'
> >
> > You know, I thought about that and was afraid saying "new master
> > cluster" would be confusing because it isn't a master _yet_, but if you
> > feel it will help, and I considered it, let's add it.  The problem is
> > that in the old instructions, at the point we were mentioning this, it
> > was the new master, which is why I evaluated removing it in the first
> > place. (Yeah, I am amazed I considered all these cases.)
> >
> > Updated patch attached.  Thanks.
> >
> 
> Looks good to me.

Patch applied back to 9.5, where these instructions first appeared.  A
mention of this will appear in the minor release notes.  Thanks for
everyone's work on this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Bruce Momjian <bruce@momjian.us> writes:

> !      against the old primary and standby clusters.  Verify that the
> !      <quote>Latest checkpoint location</> values match in all clusters.

For "Log-Shipping only" standby server this cannot be satisfied, because
last WAL from master (with shutdown checkpoint) never archived. For
example (git master):
==== postgresql.conf ===
port = 5430
shared_buffers = 32MB
wal_level = hot_standby
archive_mode = on
archive_command = 'test ! -f "$ARH/%f" && ( echo "arch %p"; cp %p "$ARH/%f"; )'
max_wal_senders = 5
hot_standby = on
log_line_prefix = '%t '
log_checkpoints = on
lc_messages = C
========================

==== pg_control ====
pg_control version number:            1002
Catalog version number:               201705301
Database system identifier:           6432034080221219745
Database cluster state:               shut down
pg_control last modified:             Fri Jun 16 03:57:22 2017
Latest checkpoint location:           0/D000028
Prior checkpoint location:            0/1604878
Latest checkpoint's REDO location:    0/D000028
Latest checkpoint's REDO WAL file:    00000001000000000000000D
====================

==== WALs archive ====
-rw------- 1 sergey users 16777216 Jun 16 03:57 000000010000000000000003
-rw------- 1 sergey users 16777216 Jun 16 03:57 000000010000000000000004
-rw------- 1 sergey users 16777216 Jun 16 03:57 000000010000000000000005
-rw------- 1 sergey users 16777216 Jun 16 03:57 000000010000000000000006
-rw------- 1 sergey users 16777216 Jun 16 03:57 000000010000000000000007
-rw------- 1 sergey users 16777216 Jun 16 03:57 000000010000000000000008
-rw------- 1 sergey users 16777216 Jun 16 03:57 000000010000000000000009
-rw------- 1 sergey users 16777216 Jun 16 03:57 00000001000000000000000A
-rw------- 1 sergey users 16777216 Jun 16 03:57 00000001000000000000000B
-rw------- 1 sergey users 16777216 Jun 16 03:57 00000001000000000000000C
======================

==== logfile ====
arch pg_wal/00000001000000000000000A
arch pg_wal/00000001000000000000000B
2017-06-16 00:57:21 GMT LOG:  received fast shutdown request
2017-06-16 00:57:21 GMT LOG:  aborting any active transactions
2017-06-16 00:57:21 GMT LOG:  shutting down
arch pg_wal/00000001000000000000000C
2017-06-16 00:57:21 GMT LOG:  checkpoint starting: shutdown immediate
2017-06-16 00:57:22 GMT LOG:  checkpoint complete: wrote 4058 buffers (99.1%); 0 WAL file(s) added, 0 removed, 0
recycled;write=0.033 s, sync=0.949 s, total=1.144 s; sync files=32, longest=0.598 s, average=0.029 s; distance=190445
kB,estimate=190445 kB
 
2017-06-16 00:57:22 GMT LOG:  database system is shut down
=================

There is no 00000001000000000000000D in archive and after shutdown,
standby can only be at it previous restartpoint (0/1604878) because it
does not receive latest checkpoint (0/D000028) from master.

So, after shutdown master and "Log-Shipping only" standby, it always "one
checkpoint early" then master and "Latest checkpoint location" never
match for it.

I think this must be mentioned somehow in documentation. 


> !     <para>
> !      Also, if upgrading standby servers, change <varname>wal_level</>
> !      to <literal>replica</> in the <filename>postgresql.conf</> file on
> !      the new cluster.
>       </para>
>      </step>

I am not sure how this help.

wal_level is reset by pg_resetxlog during pg_upgrade, so it does not
depend on postgresql.conf. After pg_upgrade wal_level always is
'minimal', that is why you must start and stop new master before rsync:

==== output ====
$ "$bin"/pg_controldata "$ver" | grep wal_level
wal_level setting:                    replica

$ "$bin"/pg_resetwal "$ver"
Write-ahead log reset

$ "$bin"/pg_controldata "$ver" | grep wal_level
wal_level setting:                    minimal
================

If you rsync standby now (without start/stop new master after pg_upgrade)
you will send pg_control with wal_level=minimal into it and after that
standby abort on startup:
==== standby logfile ====
2017-06-16 01:22:14 GMT LOG:  entering standby mode
2017-06-16 01:22:14 GMT WARNING:  WAL was generated with wal_level=minimal, data may be missing
2017-06-16 01:22:14 GMT HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2017-06-16 01:22:14 GMT FATAL:  hot standby is not possible because wal_level was not set to "replica" or higher on the
masterserver
 
2017-06-16 01:22:14 GMT HINT:  Either set wal_level to "replica" on the master, or turn off hot_standby here.
2017-06-16 01:22:14 GMT LOG:  startup process (PID 27916) exited with exit code 1
=================

PS: Thank you for answer, Bruce!

-- 
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Fri, Jun 16, 2017 at 7:03 AM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>
>> !      against the old primary and standby clusters.  Verify that the
>> !      <quote>Latest checkpoint location</> values match in all clusters.
>
> For "Log-Shipping only" standby server this cannot be satisfied, because
> last WAL from master (with shutdown checkpoint) never archived.
>

Yeah, we have ensured that all the transactions before shutdown
checkpoint got archived.  It is done in commit
2e6107cb621d003dcab0df53ac8673ea67c4e467.  However, it is not clear to
me neither it is mentioned in comments why we have done it that way.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Fri, Jun 16, 2017 at 04:33:16AM +0300, Sergey Burladyan wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > !     <para>
> > !      Also, if upgrading standby servers, change <varname>wal_level</>
> > !      to <literal>replica</> in the <filename>postgresql.conf</> file on
> > !      the new cluster.
> >       </para>
> >      </step>
> 
> I am not sure how this help.
> 
> wal_level is reset by pg_resetxlog during pg_upgrade, so it does not
> depend on postgresql.conf. After pg_upgrade wal_level always is
> 'minimal', that is why you must start and stop new master before rsync:
> 
> ==== output ====
> $ "$bin"/pg_controldata "$ver" | grep wal_level
> wal_level setting:                    replica
> 
> $ "$bin"/pg_resetwal "$ver"
> Write-ahead log reset
> 
> $ "$bin"/pg_controldata "$ver" | grep wal_level
> wal_level setting:                    minimal
> ================

Yes, I see that, but pg_resetxlog is run _before_ the _new_ cluster is
started for the last time, so in my testing the wal_level at the end of
pg_upgrade matches the value in postgresql.conf, e.g. "replica".  For
example:
Upgrade Complete----------------Optimizer statistics are not transferred by pg_upgrade so,once you start the new
server,consider running:    ./analyze_new_cluster.shRunning this script will delete the old cluster's data files:
./delete_old_cluster.sh
$ pg_controldata /u/pg/data/ | grep wal_levelwal_level setting:                    replica

The way pg_upgrade uses rsync, the standby never needs to replay the WAL
when it starts up because we already copied the changed system tables
and hard linked the user data files.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Fri, Jun 16, 2017 at 08:10:13PM +0530, Amit Kapila wrote:
> On Fri, Jun 16, 2017 at 7:03 AM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >
> >> !      against the old primary and standby clusters.  Verify that the
> >> !      <quote>Latest checkpoint location</> values match in all clusters.
> >
> > For "Log-Shipping only" standby server this cannot be satisfied, because
> > last WAL from master (with shutdown checkpoint) never archived.
> >
> 
> Yeah, we have ensured that all the transactions before shutdown
> checkpoint got archived.  It is done in commit
> 2e6107cb621d003dcab0df53ac8673ea67c4e467.  However, it is not clear to
> me neither it is mentioned in comments why we have done it that way.

Yes, I am confused why Sergey doesn't see that behavior.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Bruce Momjian <bruce@momjian.us> writes:

> On Fri, Jun 16, 2017 at 08:10:13PM +0530, Amit Kapila wrote:
> > On Fri, Jun 16, 2017 at 7:03 AM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > >
> > >> !      against the old primary and standby clusters.  Verify that the
> > >> !      <quote>Latest checkpoint location</> values match in all clusters.
> > >
> > > For "Log-Shipping only" standby server this cannot be satisfied, because
> > > last WAL from master (with shutdown checkpoint) never archived.
> > >
> > 
> > Yeah, we have ensured that all the transactions before shutdown
> > checkpoint got archived.  It is done in commit
> > 2e6107cb621d003dcab0df53ac8673ea67c4e467.  However, it is not clear to
> > me neither it is mentioned in comments why we have done it that way.
>
> Yes, I am confused why Sergey doesn't see that behavior.

I think this last new switched WAL with shutdown checkpoint record is
incomplete and it does not marked as *.ready in pg_xlog/archive_status/
and not archived.

-- 
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Bruce Momjian <bruce@momjian.us> writes:

> On Fri, Jun 16, 2017 at 04:33:16AM +0300, Sergey Burladyan wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > !     <para>
> > > !      Also, if upgrading standby servers, change <varname>wal_level</>
> > > !      to <literal>replica</> in the <filename>postgresql.conf</> file on
> > > !      the new cluster.
> > >       </para>
> > >      </step>
> > 
> > I am not sure how this help.
> > 
> > wal_level is reset by pg_resetxlog during pg_upgrade, so it does not
> > depend on postgresql.conf. After pg_upgrade wal_level always is
> > 'minimal', that is why you must start and stop new master before rsync:
> > 
> > ==== output ====
> > $ "$bin"/pg_controldata "$ver" | grep wal_level
> > wal_level setting:                    replica
> > 
> > $ "$bin"/pg_resetwal "$ver"
> > Write-ahead log reset
> > 
> > $ "$bin"/pg_controldata "$ver" | grep wal_level
> > wal_level setting:                    minimal
> > ================
>
> Yes, I see that, but pg_resetxlog is run _before_ the _new_ cluster is
> started for the last time, so in my testing the wal_level at the end of
> pg_upgrade matches the value in postgresql.conf, e.g. "replica".  For
> example:
>
>     Upgrade Complete
>     ----------------
>     Optimizer statistics are not transferred by pg_upgrade so,
>     once you start the new server, consider running:
>         ./analyze_new_cluster.sh
>     
>     Running this script will delete the old cluster's data files:
>         ./delete_old_cluster.sh
>
>     $ pg_controldata /u/pg/data/ | grep wal_level
>     wal_level setting:                    replica
>
> The way pg_upgrade uses rsync, the standby never needs to replay the WAL
> when it starts up because we already copied the changed system tables
> and hard linked the user data files.

Oh, it is my fail, I was not run test script completely for current git
master. In git master it work as expected. But not in previous versions.
I used this test script and got this result:
9.2 -> master: wal_level setting:                    replica
9.2 -> 9.6: wal_level setting:                    minimal
9.2 -> 9.5: wal_level setting:                    minimal
9.2 -> 9.4: Current wal_level setting:            minimal

I also save strace for pg_upgrade:
=== 9.6 ===
pg_resetxlog", ["/home/sergey/inst/pg9.6/bin/pg_resetxlog", "-l", "000000010000000000000002", "9.6"],
pg_ctl", ["/home/sergey/inst/pg9.6/bin/pg_ctl", "-w", "-l", "pg_upgrade_server.log", "-D", "9.6",
pg_ctl", ["/home/sergey/inst/pg9.6/bin/pg_ctl", "-w", "-D", "9.6", "-o", "", "-m", "smart", "stop"],
pg_resetxlog", ["/home/sergey/inst/pg9.6/bin/pg_resetxlog", "-o", "16393", "9.6"], [/* 68 vars */]) = 0
===========

It is exec pg_resetxlog last for set next OID,
it is from src/bin/pg_upgrade/pg_upgrade.c:149

=== master ===
pg_resetwal", ["/home/sergey/inst/pg-master/bin/pg_resetwal", "-l", "000000010000000000000002", "master"],
pg_ctl", ["/home/sergey/inst/pg-master/bin/pg_ctl", "-w", "-l", "pg_upgrade_server.log", "-D", "master",
pg_ctl", ["/home/sergey/inst/pg-master/bin/pg_ctl", "-w", "-D", "master", "-o", "", "-m", "smart", "stop"],
pg_resetwal", ["/home/sergey/inst/pg-master/bin/pg_resetwal", "-o", "16393", "master"], [/* 70 vars */]) = 0
pg_ctl", ["/home/sergey/inst/pg-master/bin/pg_ctl", "-w", "-l", "pg_upgrade_server.log", "-D", "master",
pg_ctl", ["/home/sergey/inst/pg-master/bin/pg_ctl", "-w", "-D", "master", "-o", "", "-m", "smart", "stop"],
==============

From git master pg_upgrade is restart new master again after
pg_resetwal -o, as you said.

It is from src/bin/pg_upgrade/check.c:176
void
issue_warnings(void)
{
    /* Create dummy large object permissions for old < PG 9.0? */
    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 804)
    {
        start_postmaster(&new_cluster, true);
        new_9_0_populate_pg_largeobject_metadata(&new_cluster, false);
        stop_postmaster(false);
    }

    /* Reindex hash indexes for old < 10.0 */
    if (GET_MAJOR_VERSION(old_cluster.major_version) <= 906)
    {
        start_postmaster(&new_cluster, true);
        old_9_6_invalidate_hash_indexes(&new_cluster, false);
        stop_postmaster(false);
    }
}

-- 
Sergey Burladyan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Fri, Jun 16, 2017 at 04:44:46PM -0400, Bruce Momjian wrote:
> Yes, that is _exactly_ the right place to look.  Only in PG 10 do we
> restart the new cluster to invalidate hash indexes.  In previous
> releases we didn't do the restart.
> 
> That didn't matter with the old rsync instructions, but now that we have
> removed the start/stop before rsync step, the final WAL status of
> pg_upgrade matters.
> 
> I suggest applying the attached patch

Sorry, I meant to say, I suggest applying the attached patch to all
Postgres versions, of course modified.  While the rsync instructions
only appear in PG 9.5+, the instructions work for any supported version of
Postgres, so we should allow it to continue working, even if the updated
instructions are used.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Fri, Jun 16, 2017 at 11:03 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>
>> On Fri, Jun 16, 2017 at 08:10:13PM +0530, Amit Kapila wrote:
>> > On Fri, Jun 16, 2017 at 7:03 AM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
>> > > Bruce Momjian <bruce@momjian.us> writes:
>> > >
>> > >> !      against the old primary and standby clusters.  Verify that the
>> > >> !      <quote>Latest checkpoint location</> values match in all clusters.
>> > >
>> > > For "Log-Shipping only" standby server this cannot be satisfied, because
>> > > last WAL from master (with shutdown checkpoint) never archived.
>> > >
>> >
>> > Yeah, we have ensured that all the transactions before shutdown
>> > checkpoint got archived.  It is done in commit
>> > 2e6107cb621d003dcab0df53ac8673ea67c4e467.  However, it is not clear to
>> > me neither it is mentioned in comments why we have done it that way.
>>
>> Yes, I am confused why Sergey doesn't see that behavior.
>

The behavior reported by Sergey is what is expected i.e the last file
in which shutdown checkpoint record is written won't be archived and
there is a reason behind that.  We always perform shutdown checkpoint
(which will write shutdown checkpoint record) after requesting a xlog
switch.  Any record written after xlog switch won't be archived unless
it is so big that it consumes complete xlog segment.

> I think this last new switched WAL with shutdown checkpoint record is
> incomplete and it does not marked as *.ready in pg_xlog/archive_status/
> and not archived.
>

Yes, that's true and is expected behavior.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Sat, Jun 17, 2017 at 08:34:47AM +0530, Amit Kapila wrote:
> On Fri, Jun 16, 2017 at 11:03 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> >> > Yeah, we have ensured that all the transactions before shutdown
> >> > checkpoint got archived.  It is done in commit
> >> > 2e6107cb621d003dcab0df53ac8673ea67c4e467.  However, it is not clear to
> >> > me neither it is mentioned in comments why we have done it that way.
> >>
> >> Yes, I am confused why Sergey doesn't see that behavior.
> >
> 
> The behavior reported by Sergey is what is expected i.e the last file
> in which shutdown checkpoint record is written won't be archived and
> there is a reason behind that.  We always perform shutdown checkpoint
> (which will write shutdown checkpoint record) after requesting a xlog
> switch.  Any record written after xlog switch won't be archived unless
> it is so big that it consumes complete xlog segment.
> 
> > I think this last new switched WAL with shutdown checkpoint record is
> > incomplete and it does not marked as *.ready in pg_xlog/archive_status/
> > and not archived.
> >
> 
> Yes, that's true and is expected behavior.

OK, so our pg_upgrade documentation is currently incorrect:
https://www.postgresql.org/docs/10/static/pgupgrade.html8. Verify standby serversIf you are upgrading Streaming
Replicationand Log-Shipping standbyservers, verify that the old standby servers are caught up by runningpg_controldata
againstthe old primary and standby clusters. Verify thatthe "Latest checkpoint location" values match in all clusters.
(Therewillbe a mismatch if old standby servers were shut down before the oldprimary.)
 

We are saying that Log-Shipping should match "Latest checkpoint
location", but the WAL for that will not be sent to the standby, so it
will not match, but that is OK since the only thing in the non-shipped
WAL file is the checkpoint record.  How should we modify the wording on
this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
20 июн. 2017 г. 1:21 пользователь "Bruce Momjian" <bruce@momjian.us> написал: 

We are saying that Log-Shipping should match "Latest checkpoint
location", but the WAL for that will not be sent to the standby, so it
will not match, but that is OK since the only thing in the non-shipped
WAL file is the checkpoint record.  How should we modify the wording on
this?

I am afraid that without this checkpoint record standby cannot make restartpoint
and without restartpoint it does not sync shared buffers into disk at shutdown. 

Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Tue, Jun 20, 2017 at 03:50:29AM +0300, Sergey Burladyan wrote:
> 20 июн. 2017 г. 1:21 пользователь "Bruce Momjian" <bruce@momjian.us> написал: 
> 
> 
>     We are saying that Log-Shipping should match "Latest checkpoint
>     location", but the WAL for that will not be sent to the standby, so it
>     will not match, but that is OK since the only thing in the non-shipped
>     WAL file is the checkpoint record.  How should we modify the wording on
>     this?
> 
> 
> I am afraid that without this checkpoint record standby cannot make
> restartpoint
> and without restartpoint it does not sync shared buffers into disk at
> shutdown. 

Uh, as I understand it the rsync is going to copy the missing WAL file
from the new master to the standby, right, and I think pg_controldata
too, so it should be fine.  Have you tested to see if it fails?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Mon, Jun 19, 2017 at 10:59:19PM -0400, Bruce Momjian wrote:
> On Tue, Jun 20, 2017 at 03:50:29AM +0300, Sergey Burladyan wrote:
> > 20 июн. 2017 г. 1:21 пользователь "Bruce Momjian" <bruce@momjian.us> написал: 
> > 
> > 
> >     We are saying that Log-Shipping should match "Latest checkpoint
> >     location", but the WAL for that will not be sent to the standby, so it
> >     will not match, but that is OK since the only thing in the non-shipped
> >     WAL file is the checkpoint record.  How should we modify the wording on
> >     this?
> > 
> > 
> > I am afraid that without this checkpoint record standby cannot make
> > restartpoint
> > and without restartpoint it does not sync shared buffers into disk at
> > shutdown. 
> 
> Uh, as I understand it the rsync is going to copy the missing WAL file
> from the new master to the standby, right, and I think pg_controldata
> too, so it should be fine.  Have you tested to see if it fails?

The point is that we are checking the "Latest checkpoint location" to
make sure all the WAL was replayed.   We are never going to start the
old standby server.  Rsync is going to copy the missing/changed files.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Bruce Momjian <bruce@momjian.us> writes:

> On Mon, Jun 19, 2017 at 10:59:19PM -0400, Bruce Momjian wrote:
> > On Tue, Jun 20, 2017 at 03:50:29AM +0300, Sergey Burladyan wrote:
> > > 20 июн. 2017 г. 1:21 пользователь "Bruce Momjian" <bruce@momjian.us> написал: 
> > >
> > >
> > >     We are saying that Log-Shipping should match "Latest checkpoint
> > >     location", but the WAL for that will not be sent to the standby, so it
> > >     will not match, but that is OK since the only thing in the non-shipped
> > >     WAL file is the checkpoint record.  How should we modify the wording on
> > >     this?
> > >
> > >
> > > I am afraid that without this checkpoint record standby cannot make
> > > restartpoint
> > > and without restartpoint it does not sync shared buffers into disk at
> > > shutdown. 
> >
> > Uh, as I understand it the rsync is going to copy the missing WAL file
> > from the new master to the standby, right, and I think pg_controldata
> > too, so it should be fine.  Have you tested to see if it fails?

It need old WAL files from old version for correct restore heap
files. New WAL files from new version does not have this information.

> The point is that we are checking the "Latest checkpoint location" to
> make sure all the WAL was replayed.   We are never going to start the
> old standby server.  Rsync is going to copy the missing/changed files.

Only if missing/changed files changed in size, because rsync run with
--size-only it does not copy changed files with same size.

I have this test script and without copy_last_wal it make standby broken
in the first few loops, like:
=== run 1, cnt: 700000
=== run 2, cnt: 729450

PS: I think what with big shared_buffers I can make it broken more
quickly, but with big shared_buffers I cannot break it at all, hm...

--
Sergey Burladyan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Tue, Jun 20, 2017 at 3:40 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>
>> On Mon, Jun 19, 2017 at 10:59:19PM -0400, Bruce Momjian wrote:
>> > On Tue, Jun 20, 2017 at 03:50:29AM +0300, Sergey Burladyan wrote:
>> > > 20 июн. 2017 г. 1:21 пользователь "Bruce Momjian" <bruce@momjian.us> написал:
>> > >
>> > >
>> > >     We are saying that Log-Shipping should match "Latest checkpoint
>> > >     location", but the WAL for that will not be sent to the standby, so it
>> > >     will not match, but that is OK since the only thing in the non-shipped
>> > >     WAL file is the checkpoint record.  How should we modify the wording on
>> > >     this?
>> > >
>> > >
>> > > I am afraid that without this checkpoint record standby cannot make
>> > > restartpoint
>> > > and without restartpoint it does not sync shared buffers into disk at
>> > > shutdown.
>> >

It seems to me at shutdown time on standby servers we specifically
make restart points.  See below code in ShutdownXLOG()

..
if (RecoveryInProgress())
CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
..

Do you have something else in mind?

>> > Uh, as I understand it the rsync is going to copy the missing WAL file
>> > from the new master to the standby, right, and I think pg_controldata
>> > too, so it should be fine.  Have you tested to see if it fails?
>
> It need old WAL files from old version for correct restore heap
> files. New WAL files from new version does not have this information.
>

So in such a case can we run rsync once before pg_upgrade?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:

> On Tue, Jun 20, 2017 at 3:40 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >
> >> On Mon, Jun 19, 2017 at 10:59:19PM -0400, Bruce Momjian wrote:
> >> > On Tue, Jun 20, 2017 at 03:50:29AM +0300, Sergey Burladyan wrote:
> >> > > 20 июн. 2017 г. 1:21 пользователь "Bruce Momjian" <bruce@momjian.us> написал:
> >> > >
> >> > >
> >> > >     We are saying that Log-Shipping should match "Latest checkpoint
> >> > >     location", but the WAL for that will not be sent to the standby, so it
> >> > >     will not match, but that is OK since the only thing in the non-shipped
> >> > >     WAL file is the checkpoint record.  How should we modify the wording on
> >> > >     this?
> >> > >
> >> > >
> >> > > I am afraid that without this checkpoint record standby cannot make
> >> > > restartpoint
> >> > > and without restartpoint it does not sync shared buffers into disk at
> >> > > shutdown.
> >> >
>
> It seems to me at shutdown time on standby servers we specifically
> make restart points.  See below code in ShutdownXLOG()
>
> ..
> if (RecoveryInProgress())
> CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
> ..
>
> Do you have something else in mind?

What buffers this restartpoint will save into disk? I think what it can
save only buffers with LSN lower or equal to "Latest checkpoint
location". Buffers with LSN between "Minimum recovery ending
location" and "Latest checkpoint location" will not saved at all.

I set log_min_messages=debug2 and it is more clearly what happened here:
2017-06-20 13:18:32 GMT LOG:  restartpoint starting: xlog
...
2017-06-20 13:18:33 GMT DEBUG:  postmaster received signal 15
2017-06-20 13:18:33 GMT LOG:  received smart shutdown request
2017-06-20 13:18:33 GMT DEBUG:  updated min recovery point to 0/12000000
2017-06-20 13:18:33 GMT CONTEXT:  writing block 2967 of relation base/16384/16385
2017-06-20 13:18:33 GMT DEBUG:  checkpoint sync: number=1 file=global/12587 time=0.001 msec
2017-06-20 13:18:33 GMT DEBUG:  checkpoint sync: number=2 file=base/16384/12357 time=0.000 msec
2017-06-20 13:18:33 GMT DEBUG:  checkpoint sync: number=3 file=base/16384/16385 time=0.000 msec
2017-06-20 13:18:33 GMT DEBUG:  attempting to remove WAL segments older than log file 00000001000000000000000B
2017-06-20 13:18:33 GMT DEBUG:  recycled transaction log file "00000001000000000000000B"
2017-06-20 13:18:33 GMT DEBUG:  recycled transaction log file "00000001000000000000000A"
2017-06-20 13:18:33 GMT DEBUG:  recycled transaction log file "000000010000000000000009"
2017-06-20 13:18:33 GMT DEBUG:  SlruScanDirectory invoking callback on pg_subtrans/0000
2017-06-20 13:18:33 GMT LOG:  restartpoint complete: wrote 1824 buffers (44.5%); 0 transaction log file(s) added, 0
removed,3 recycled; write=1.389 s, sync=0.000 s, total=1.389 s; sync files=3, longest=0.000 s, average=0.000 s 
2017-06-20 13:18:33 GMT LOG:  recovery restart point at 0/F008D28
2017-06-20 13:18:33 GMT DETAIL:  last completed transaction was at log time 2017-06-20 13:18:29.282645+00
2017-06-20 13:18:33 GMT LOG:  shutting down
2017-06-20 13:18:33 GMT DEBUG:  skipping restartpoint, already performed at 0/F008D28
2017-06-20 13:18:33 GMT LOG:  database system is shut down
========

I use pg 9.2 and "skipping restartpoint, already performed at" is from
src/backend/access/transam/xlog.c:8643
after this statement it return from CreateRestartPoint() and do not run  8687     CheckPointGuts(lastCheckPoint.redo,
flags);

> >> > Uh, as I understand it the rsync is going to copy the missing WAL file
> >> > from the new master to the standby, right, and I think pg_controldata
> >> > too, so it should be fine.  Have you tested to see if it fails?
> >
> > It need old WAL files from old version for correct restore heap
> > files. New WAL files from new version does not have this information.
> >
>
> So in such a case can we run rsync once before pg_upgrade?

I just copy last WAL from stopped old master into running old standby
before it shutdown and wait till it replayed. After that standby can
issue restartpoint at the same location as in stopped master.

I am not sure about rsync, in my production server I have for example
111 GB in pg_xlog and if I run rsync for pg_xlog it must send ~ 40GB
of new WALs I think.

--
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Tue, Jun 20, 2017 at 7:05 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
>> On Tue, Jun 20, 2017 at 3:40 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> I use pg 9.2 and "skipping restartpoint, already performed at" is from
> src/backend/access/transam/xlog.c:8643
> after this statement it return from CreateRestartPoint() and do not run
>    8687     CheckPointGuts(lastCheckPoint.redo, flags);
>

You are right, so it will skip restartpoint in such a case.

>> >> > Uh, as I understand it the rsync is going to copy the missing WAL file
>> >> > from the new master to the standby, right, and I think pg_controldata
>> >> > too, so it should be fine.  Have you tested to see if it fails?
>> >
>> > It need old WAL files from old version for correct restore heap
>> > files. New WAL files from new version does not have this information.
>> >
>>
>> So in such a case can we run rsync once before pg_upgrade?
>
> I just copy last WAL from stopped old master into running old standby
> before it shutdown and wait till it replayed. After that standby can
> issue restartpoint at the same location as in stopped master.
>

Hmm.  I think we need something that works with lesser effort because
not all users will be as knowledgeable as you are, so if they make any
mistakes in copying the file manually, it can lead to problems.  How
about issuing a notification (XLogArchiveNotifySeg) in shutdown
checkpoint if archiving is enabled?

> I am not sure about rsync, in my production server I have for example
> 111 GB in pg_xlog and if I run rsync for pg_xlog it must send ~ 40GB
> of new WALs I think.
>

Isn't the difference between old and new is just the last WAL segment
file?  What is the source of this difference?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Tue, Jun 20, 2017 at 01:10:26PM +0300, Sergey Burladyan wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > > Uh, as I understand it the rsync is going to copy the missing WAL file
> > > from the new master to the standby, right, and I think pg_controldata
> > > too, so it should be fine.  Have you tested to see if it fails?
> 
> It need old WAL files from old version for correct restore heap
> files. New WAL files from new version does not have this information.
> 
> > The point is that we are checking the "Latest checkpoint location" to
> > make sure all the WAL was replayed.   We are never going to start the
> > old standby server.  Rsync is going to copy the missing/changed files.
> 
> Only if missing/changed files changed in size, because rsync run with
> --size-only it does not copy changed files with same size.

I am sorry but I am not understanding.  Step 10.b says:
10.b Make sure the new standby data directories do not existMake sure the new standby data directories do not exist or
areempty. Ifinitdb was run, delete the standby server data directories.
 

so the _entire_ new data directory is empty before rsync is run, meaning
that it is an exact copy of the new master.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Bruce Momjian <bruce@momjian.us> writes:

> On Tue, Jun 20, 2017 at 01:10:26PM +0300, Sergey Burladyan wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > > Uh, as I understand it the rsync is going to copy the missing WAL file
> > > > from the new master to the standby, right, and I think pg_controldata
> > > > too, so it should be fine.  Have you tested to see if it fails?
> > 
> > It need old WAL files from old version for correct restore heap
> > files. New WAL files from new version does not have this information.
> > 
> > > The point is that we are checking the "Latest checkpoint location" to
> > > make sure all the WAL was replayed.   We are never going to start the
> > > old standby server.  Rsync is going to copy the missing/changed files.
> > 
> > Only if missing/changed files changed in size, because rsync run with
> > --size-only it does not copy changed files with same size.
>
> I am sorry but I am not understanding.  Step 10.b says:
>
>     10.b Make sure the new standby data directories do not exist
>     
>     Make sure the new standby data directories do not exist or are empty. If
>     initdb was run, delete the standby server data directories.
>
> so the _entire_ new data directory is empty before rsync is run, meaning
> that it is an exact copy of the new master.

Yes, new data directory at standby is empty, but you missed old data
directory at standby which is hardlink'ed by rsync into new as at master.

rsync run with _three_ arguments and with --hard-links option:
rsync --archive --delete --hard-links --size-only old_pgdata new_pgdata remote_dir
(remote_dir is parent directory for old and new data at standby)

In this mode rsync compare not only new_pgdata with new empty data
directory at standby, but also compare it with old data directory from
standby and with --size-only it doing this compare only by the file
existence or file size.

If file at standby in old data directory is different from same file at
master, but it have same size, it will be hardlinked into new data
directory at standby and does not copied from master.

-- 
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Tue, Jun 20, 2017 at 06:42:58PM +0300, Sergey Burladyan wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> 
> > On Tue, Jun 20, 2017 at 01:10:26PM +0300, Sergey Burladyan wrote:
> > > Only if missing/changed files changed in size, because rsync run with
> > > --size-only it does not copy changed files with same size.
> >
> > I am sorry but I am not understanding.  Step 10.b says:
> >
> >     10.b Make sure the new standby data directories do not exist
> >     
> >     Make sure the new standby data directories do not exist or are empty. If
> >     initdb was run, delete the standby server data directories.
> >
> > so the _entire_ new data directory is empty before rsync is run, meaning
> > that it is an exact copy of the new master.
> 
> Yes, new data directory at standby is empty, but you missed old data
> directory at standby which is hardlink'ed by rsync into new as at master.

OK, I think I am getting closer to understanding.  Only some files are
hard-linked from the old master to the new master, specifically the user
data files (table and indexes).

> rsync run with _three_ arguments and with --hard-links option:
> rsync --archive --delete --hard-links --size-only old_pgdata new_pgdata remote_dir
> (remote_dir is parent directory for old and new data at standby)
> 
> In this mode rsync compare not only new_pgdata with new empty data
> directory at standby, but also compare it with old data directory from
> standby and with --size-only it doing this compare only by the file
> existence or file size.

but it only going to create hard links for hard links that already exist
between the old and new masters.  If I am wrong, we are in big trouble
because rsync would not work.

> If file at standby in old data directory is different from same file at
> master, but it have same size, it will be hardlinked into new data
> directory at standby and does not copied from master.

Only if pg_upgrade created the hardlinks, right?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:

> > I am not sure about rsync, in my production server I have for example
> > 111 GB in pg_xlog and if I run rsync for pg_xlog it must send ~ 40GB
> > of new WALs I think.
> >
>
> Isn't the difference between old and new is just the last WAL segment
> file?  What is the source of this difference?

Postgres generate WAL files forward, and at standby too :-(
For example:
=== master ===
$ psql -c 'select pg_current_xlog_insert_location()'pg_current_xlog_insert_location 
---------------------------------4ED09/34A74590
(1 row)

$ ls 9.2/main/pg_xlog/ | awk '/4ED0900000034/,/xxx/ { print }' | wc -l
2262
==============

=== standby ===
$ psql -c 'select pg_last_xlog_replay_location()'pg_last_xlog_replay_location 
------------------------------4ED0A/AECFD7B8
(1 row)

postgres@avi-sql29:~$ ls 9.2/main/pg_xlog/ | awk '/4ED0A000000AE/,/xxx/ { print }' | wc -l
2456
===============

See https://www.postgresql.org/docs/9.2/static/wal-configuration.html
> they are recycled (renamed to become the next segments in the numbered sequence)

-- 
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Bruce Momjian <bruce@momjian.us> writes:

> On Tue, Jun 20, 2017 at 06:42:58PM +0300, Sergey Burladyan wrote:
> > If file at standby in old data directory is different from same file at
> > master, but it have same size, it will be hardlinked into new data
> > directory at standby and does not copied from master.
>
> Only if pg_upgrade created the hardlinks, right?

Yes, I have not tested rsync itself, but I think that you are right.

-- 
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
Sorry, this email from June 16 didn't make it to the lists for some odd
reason so I am reposting it now.  I will apply a patch based on this
email shortly.

What is really odd is that I replied to this email already but the
original wasn't posted.  I think it was something about my email reader.

---------------------------------------------------------------------------

On Fri, Jun 16, 2017 at 10:57:33PM +0300, Sergey Burladyan wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Fri, Jun 16, 2017 at 04:33:16AM +0300, Sergey Burladyan wrote:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > The way pg_upgrade uses rsync, the standby never needs to replay the WAL
> > when it starts up because we already copied the changed system tables
> > and hard linked the user data files.
> 
> Oh, it is my fail, I was not run test script completely for current git
> master. In git master it work as expected. But not in previous versions.
> I used this test script and got this result:
> 9.2 -> master: wal_level setting:                    replica
> 9.2 -> 9.6: wal_level setting:                    minimal
> 9.2 -> 9.5: wal_level setting:                    minimal
> 9.2 -> 9.4: Current wal_level setting:            minimal

Wow, thank you again for your excellent research.

> >From git master pg_upgrade is restart new master again after
> pg_resetwal -o, as you said.
> 
> It is from src/bin/pg_upgrade/check.c:176
> void
> issue_warnings(void)
> {
>     /* Create dummy large object permissions for old < PG 9.0? */
>     if (GET_MAJOR_VERSION(old_cluster.major_version) <= 804)
>     {
>         start_postmaster(&new_cluster, true);
>         new_9_0_populate_pg_largeobject_metadata(&new_cluster, false);
>         stop_postmaster(false);
>     }
> 
>     /* Reindex hash indexes for old < 10.0 */
>     if (GET_MAJOR_VERSION(old_cluster.major_version) <= 906)
>     {
>         start_postmaster(&new_cluster, true);
>         old_9_6_invalidate_hash_indexes(&new_cluster, false);
>         stop_postmaster(false);
>     }
> }

Yes, that is _exactly_ the right place to look.  Only in PG 10 do we
restart the new cluster to invalidate hash indexes.  In previous
releases we didn't do the restart.

That didn't matter with the old rsync instructions, but now that we have
removed the start/stop before rsync step, the final WAL status of
pg_upgrade matters.

I suggest applying the attached patch 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--2oS5YaxWCcQjTEyO
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="wal.diff"

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
new file mode 100644
index 8b9e81e..b79e54a
*** a/src/bin/pg_upgrade/check.c
--- b/src/bin/pg_upgrade/check.c
*************** report_clusters_compatible(void)
*** 174,196 ****   void
! issue_warnings(void) {     /* Create dummy large object permissions for old < PG 9.0? */     if
(GET_MAJOR_VERSION(old_cluster.major_version)<= 804)
 
-     {
-         start_postmaster(&new_cluster, true);         new_9_0_populate_pg_largeobject_metadata(&new_cluster, false);
-         stop_postmaster(false);
-     }      /* Reindex hash indexes for old < 10.0 */     if (GET_MAJOR_VERSION(old_cluster.major_version) <= 906)
-     {
-         start_postmaster(&new_cluster, true);         old_9_6_invalidate_hash_indexes(&new_cluster, false);
!         stop_postmaster(false);
!     } }  
--- 174,198 ----   void
! issue_warnings_and_set_wal_level(void) {
+     /*
+      * We unconditionally start/stop the new server because pg_resetwal -o
+      * set wal_level to 'minimum'.  If the user is upgrading standby
+      * servers using the rsync instructions, they will need pg_upgrade
+      * to write its final WAL record showing wal_level as 'replica'.
+      */
+     start_postmaster(&new_cluster, true);
+      /* Create dummy large object permissions for old < PG 9.0? */     if
(GET_MAJOR_VERSION(old_cluster.major_version)<= 804)         new_9_0_populate_pg_largeobject_metadata(&new_cluster,
false);     /* Reindex hash indexes for old < 10.0 */     if (GET_MAJOR_VERSION(old_cluster.major_version) <= 906)
  old_9_6_invalidate_hash_indexes(&new_cluster, false);
 
! 
!     stop_postmaster(false); }  
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
new file mode 100644
index ca1aa5c..2a9c397
*** a/src/bin/pg_upgrade/pg_upgrade.c
--- b/src/bin/pg_upgrade/pg_upgrade.c
*************** main(int argc, char **argv)
*** 162,168 ****     create_script_for_cluster_analyze(&analyze_script_file_name);
create_script_for_old_cluster_deletion(&deletion_script_file_name);
 
!     issue_warnings();      pg_log(PG_REPORT, "\nUpgrade Complete\n");     pg_log(PG_REPORT, "----------------\n");
--- 162,168 ----     create_script_for_cluster_analyze(&analyze_script_file_name);
create_script_for_old_cluster_deletion(&deletion_script_file_name);
 
!     issue_warnings_and_set_wal_level();      pg_log(PG_REPORT, "\nUpgrade Complete\n");     pg_log(PG_REPORT,
"----------------\n");
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
new file mode 100644
index 8fbf8ac..e3a577a
*** a/src/bin/pg_upgrade/pg_upgrade.h
--- b/src/bin/pg_upgrade/pg_upgrade.h
*************** void        output_check_banner(bool live_chec
*** 332,338 **** void        check_and_dump_old_cluster(bool live_check); void        check_new_cluster(void); void
  report_clusters_compatible(void);
 
! void        issue_warnings(void); void output_completion_banner(char *analyze_script_file_name,
  char *deletion_script_file_name); void        check_cluster_versions(void);
 
--- 332,338 ---- void        check_and_dump_old_cluster(bool live_check); void        check_new_cluster(void); void
  report_clusters_compatible(void);
 
! void        issue_warnings_and_set_wal_level(void); void output_completion_banner(char *analyze_script_file_name,
                    char *deletion_script_file_name); void        check_cluster_versions(void);
 

--2oS5YaxWCcQjTEyO--


--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 7:05 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>
>>> On Tue, Jun 20, 2017 at 3:40 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
>> I use pg 9.2 and "skipping restartpoint, already performed at" is from
>> src/backend/access/transam/xlog.c:8643
>> after this statement it return from CreateRestartPoint() and do not run
>>    8687     CheckPointGuts(lastCheckPoint.redo, flags);
>>
>
> You are right, so it will skip restartpoint in such a case.
>
>>> >> > Uh, as I understand it the rsync is going to copy the missing WAL file
>>> >> > from the new master to the standby, right, and I think pg_controldata
>>> >> > too, so it should be fine.  Have you tested to see if it fails?
>>> >
>>> > It need old WAL files from old version for correct restore heap
>>> > files. New WAL files from new version does not have this information.
>>> >
>>>
>>> So in such a case can we run rsync once before pg_upgrade?
>>
>> I just copy last WAL from stopped old master into running old standby
>> before it shutdown and wait till it replayed. After that standby can
>> issue restartpoint at the same location as in stopped master.
>>
>
> Hmm.  I think we need something that works with lesser effort because
> not all users will be as knowledgeable as you are, so if they make any
> mistakes in copying the file manually, it can lead to problems.  How
> about issuing a notification (XLogArchiveNotifySeg) in shutdown
> checkpoint if archiving is enabled?
>

I have thought more about the above solution and it seems risky to
notify archiver for incomplete WAL segments (which will be possible in
this case as there is no guarantee that Checkpoint record will fill
the segment).  So, it seems to me we should update the document unless
you or someone has some solution to this problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote:
> On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Hmm.  I think we need something that works with lesser effort because
> > not all users will be as knowledgeable as you are, so if they make any
> > mistakes in copying the file manually, it can lead to problems.  How
> > about issuing a notification (XLogArchiveNotifySeg) in shutdown
> > checkpoint if archiving is enabled?
> >
> 
> I have thought more about the above solution and it seems risky to
> notify archiver for incomplete WAL segments (which will be possible in
> this case as there is no guarantee that Checkpoint record will fill
> the segment).  So, it seems to me we should update the document unless
> you or someone has some solution to this problem.

The over-arching question is how do we tell users to verify that the WAL
has been replayed on the standby?  I am thinking we would say that for
streaming replication, the "Latest checkpoint location" should match on
the primary and standby, while for log shipping, the standbys should be
exactly one WAL file less than the primary.

As far as I know this is the only remaining open issue.  Sergey, please
verify.  I appreciate the work everyone has done to improve this, and
all the existing fixes have been pushed to all supported branches.  :-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Wed, Jun 21, 2017 at 10:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote:
>> On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > Hmm.  I think we need something that works with lesser effort because
>> > not all users will be as knowledgeable as you are, so if they make any
>> > mistakes in copying the file manually, it can lead to problems.  How
>> > about issuing a notification (XLogArchiveNotifySeg) in shutdown
>> > checkpoint if archiving is enabled?
>> >
>>
>> I have thought more about the above solution and it seems risky to
>> notify archiver for incomplete WAL segments (which will be possible in
>> this case as there is no guarantee that Checkpoint record will fill
>> the segment).  So, it seems to me we should update the document unless
>> you or someone has some solution to this problem.
>
> The over-arching question is how do we tell users to verify that the WAL
> has been replayed on the standby?  I am thinking we would say that for
> streaming replication, the "Latest checkpoint location" should match on
> the primary and standby, while for log shipping, the standbys should be
> exactly one WAL file less than the primary.
>

I am not sure if we can say "standbys should be exactly one WAL file
less than the primary" because checkpoint can create few more WAL
segments for future use.  I think to make this work user needs to
carefully just copy the next WAL segment (next to the last file in
standby) which will contain checkpoint record.  Ideally, there should
be some way either in form of a tool or a functionality in the
database server with which this last file can be copied but I think in
the absence of that we can at least document this fact.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Fri, Jun 23, 2017 at 08:10:17AM +0530, Amit Kapila wrote:
> On Wed, Jun 21, 2017 at 10:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote:
> >> On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> > Hmm.  I think we need something that works with lesser effort because
> >> > not all users will be as knowledgeable as you are, so if they make any
> >> > mistakes in copying the file manually, it can lead to problems.  How
> >> > about issuing a notification (XLogArchiveNotifySeg) in shutdown
> >> > checkpoint if archiving is enabled?
> >> >
> >>
> >> I have thought more about the above solution and it seems risky to
> >> notify archiver for incomplete WAL segments (which will be possible in
> >> this case as there is no guarantee that Checkpoint record will fill
> >> the segment).  So, it seems to me we should update the document unless
> >> you or someone has some solution to this problem.
> >
> > The over-arching question is how do we tell users to verify that the WAL
> > has been replayed on the standby?  I am thinking we would say that for
> > streaming replication, the "Latest checkpoint location" should match on
> > the primary and standby, while for log shipping, the standbys should be
> > exactly one WAL file less than the primary.
> >
> 
> I am not sure if we can say "standbys should be exactly one WAL file
> less than the primary" because checkpoint can create few more WAL
> segments for future use.  I think to make this work user needs to
> carefully just copy the next WAL segment (next to the last file in
> standby) which will contain checkpoint record.  Ideally, there should
> be some way either in form of a tool or a functionality in the
> database server with which this last file can be copied but I think in
> the absence of that we can at least document this fact.

I was not clear.  I was not saying there can be only one extra WAL file.
I am saying the "Latest checkpoint location" should be one WAL file
farther on the master.  I think the big problem is that we need a full
replay of that WAL file, not just having it one less than the master.  I
have no idea how do explain that.  It is easy for streaming replication
since the "Latest checkpoint location" should match, which is simple.

Also, we need something that can be backpatched.  

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Bruce Momjian <bruce@momjian.us> writes:

> On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote:
> > On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Hmm.  I think we need something that works with lesser effort because
> > > not all users will be as knowledgeable as you are, so if they make any
> > > mistakes in copying the file manually, it can lead to problems.  How
> > > about issuing a notification (XLogArchiveNotifySeg) in shutdown
> > > checkpoint if archiving is enabled?
> > >
> > 
> > I have thought more about the above solution and it seems risky to
> > notify archiver for incomplete WAL segments (which will be possible in
> > this case as there is no guarantee that Checkpoint record will fill
> > the segment).  So, it seems to me we should update the document unless
> > you or someone has some solution to this problem.

> As far as I know this is the only remaining open issue.  Sergey, please
> verify.  I appreciate the work everyone has done to improve this, and
> all the existing fixes have been pushed to all supported branches.  :-)

Yes, thank you all for your help!

Yes, this is last issue with checkpoint that I know, how to ensure that
standby sync all shared buffers into disk on it shutdown.

I thinking about enforce restartpoint on shutdown, like:
src/backend/access/transam/xlog.c
-   8639     if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
-   8640         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
-   8641     {
+   8639     if ( !(flags & CHECKPOINT_IS_SHUTDOWN) && (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
+   8640         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo) )
+   8641     {

But I still not read source and not sure about this solution.


PS:
I successfully upgraded last night from 9.2 to 9.4 and find other issue :-)

It is about hash index and promote:
1. create master
2. create standby from it
3. create unlogged table and hash index like:create unlogged table test (id int primary key, v text);create index on
testusing hash (id);
 
3. stop master
4. promote standby

now, if you try to upgrade this new promoted master pg_upgrade will stop
on this hash index:
error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"): No
suchfile or directory
 
Failure, exiting

I touch this file (s/9.2/base/16384/16393) and rerun pg_upgrade from
scratch and it complete successfully.

-- 
Sergey Burladyan



Re: [HACKERS] Broken hint bits (freeze)

От
Sergey Burladyan
Дата:
Sergey Burladyan <eshkinkot@gmail.com> writes:

> 1. create master
> 2. create standby from it
> 3. create unlogged table and hash index like:
>  create unlogged table test (id int primary key, v text);
>  create index on test using hash (id);
> 3. stop master
> 4. promote standby
>
> now, if you try to upgrade this new promoted master pg_upgrade will stop
> on this hash index:
> error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"):
Nosuch file or directory
 
> Failure, exiting
>
> I touch this file (s/9.2/base/16384/16393) and rerun pg_upgrade from
> scratch and it complete successfully.

Missed test script for it.

-- 
Sergey Burladyan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Fri, Jun 23, 2017 at 8:47 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>
>> On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote:
>> > On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> > > Hmm.  I think we need something that works with lesser effort because
>> > > not all users will be as knowledgeable as you are, so if they make any
>> > > mistakes in copying the file manually, it can lead to problems.  How
>> > > about issuing a notification (XLogArchiveNotifySeg) in shutdown
>> > > checkpoint if archiving is enabled?
>> > >
>> >
>> > I have thought more about the above solution and it seems risky to
>> > notify archiver for incomplete WAL segments (which will be possible in
>> > this case as there is no guarantee that Checkpoint record will fill
>> > the segment).  So, it seems to me we should update the document unless
>> > you or someone has some solution to this problem.
>
>> As far as I know this is the only remaining open issue.  Sergey, please
>> verify.  I appreciate the work everyone has done to improve this, and
>> all the existing fixes have been pushed to all supported branches.  :-)
>
> Yes, thank you all for your help!
>
> Yes, this is last issue with checkpoint that I know, how to ensure that
> standby sync all shared buffers into disk on it shutdown.
>

I think if we have a command like Alter System Flush Shared Buffers,
then it would have been helpful in what you need here.  You could have
run it before shutdown.

> I thinking about enforce restartpoint on shutdown, like:
> src/backend/access/transam/xlog.c
> -   8639     if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
> -   8640         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
> -   8641     {
> +   8639     if ( !(flags & CHECKPOINT_IS_SHUTDOWN) && (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
> +   8640         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo) )
> +   8641     {
>
> But I still not read source and not sure about this solution.
>

It might serve your purpose, but I think it will not be safe to
perform restartpoint always at shutdown.  It will delete the WAL files
which should be deleted only after the actual checkpoint record is
received from the master.

>
> PS:
> I successfully upgraded last night from 9.2 to 9.4 and find other issue :-)
>
> It is about hash index and promote:
> 1. create master
> 2. create standby from it
> 3. create unlogged table and hash index like:
>  create unlogged table test (id int primary key, v text);
>  create index on test using hash (id);
> 3. stop master
> 4. promote standby
>
> now, if you try to upgrade this new promoted master pg_upgrade will stop
> on this hash index:
> error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"):
Nosuch file or directory
 
> Failure, exiting
>

I am not sure if this is a problem because in the version you are
trying hash indexes are not WAL-logged and the creation of same will
not be replicated on standby, so the error seems to be expected.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Fri, Jun 23, 2017 at 8:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Jun 23, 2017 at 08:10:17AM +0530, Amit Kapila wrote:
>> On Wed, Jun 21, 2017 at 10:03 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> > On Wed, Jun 21, 2017 at 07:49:21PM +0530, Amit Kapila wrote:
>> >> On Tue, Jun 20, 2017 at 7:24 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >> > Hmm.  I think we need something that works with lesser effort because
>> >> > not all users will be as knowledgeable as you are, so if they make any
>> >> > mistakes in copying the file manually, it can lead to problems.  How
>> >> > about issuing a notification (XLogArchiveNotifySeg) in shutdown
>> >> > checkpoint if archiving is enabled?
>> >> >
>> >>
>> >> I have thought more about the above solution and it seems risky to
>> >> notify archiver for incomplete WAL segments (which will be possible in
>> >> this case as there is no guarantee that Checkpoint record will fill
>> >> the segment).  So, it seems to me we should update the document unless
>> >> you or someone has some solution to this problem.
>> >
>> > The over-arching question is how do we tell users to verify that the WAL
>> > has been replayed on the standby?  I am thinking we would say that for
>> > streaming replication, the "Latest checkpoint location" should match on
>> > the primary and standby, while for log shipping, the standbys should be
>> > exactly one WAL file less than the primary.
>> >
>>
>> I am not sure if we can say "standbys should be exactly one WAL file
>> less than the primary" because checkpoint can create few more WAL
>> segments for future use.  I think to make this work user needs to
>> carefully just copy the next WAL segment (next to the last file in
>> standby) which will contain checkpoint record.  Ideally, there should
>> be some way either in form of a tool or a functionality in the
>> database server with which this last file can be copied but I think in
>> the absence of that we can at least document this fact.
>
> I was not clear.  I was not saying there can be only one extra WAL file.
> I am saying the "Latest checkpoint location" should be one WAL file
> farther on the master.  I think the big problem is that we need a full
> replay of that WAL file, not just having it one less than the master.
>

If the user has properly shutdown, then that last file should only
have checkpoint record, is it safe to proceed with upgrade without
actually copying that file?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Sat, Jun 24, 2017 at 09:19:10AM +0530, Amit Kapila wrote:
> > I successfully upgraded last night from 9.2 to 9.4 and find other issue :-)
> >
> > It is about hash index and promote:
> > 1. create master
> > 2. create standby from it
> > 3. create unlogged table and hash index like:
> >  create unlogged table test (id int primary key, v text);
> >  create index on test using hash (id);
> > 3. stop master
> > 4. promote standby
> >
> > now, if you try to upgrade this new promoted master pg_upgrade will stop
> > on this hash index:
> > error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"):
Nosuch file or directory
 
> > Failure, exiting
> >
> 
> I am not sure if this is a problem because in the version you are
> trying hash indexes are not WAL-logged and the creation of same will
> not be replicated on standby, so the error seems to be expected.

Well, it certainly should not error out like this.  I have not seen such
a failure report before.

I think the fundamental problem is that unlogged objects
(pg_class.relpersistence='u') creates a file on the master, but doesn't
create anything on the standby since it is never transmitted over the
WAL (assuming the object is created after the base backup).

I assume the standby creates them as empty when it is promoted to
primary and someone tries to access the object.  I wonder if I need to
add a boolean to each object to record if it is unlogged, and allow
copy/link to silently fail in such cases.  Does that make sense?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Sat, Jun 24, 2017 at 09:24:21AM +0530, Amit Kapila wrote:
> > I was not clear.  I was not saying there can be only one extra WAL file.
> > I am saying the "Latest checkpoint location" should be one WAL file
> > farther on the master.  I think the big problem is that we need a full
> > replay of that WAL file, not just having it one less than the master.
> >
> 
> If the user has properly shutdown, then that last file should only
> have checkpoint record, is it safe to proceed with upgrade without
> actually copying that file?

Yes, but how do we know they processed all the records in the 
second-to-last WAL file (in WAL shipping mode).

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Fri, Jun 23, 2017 at 06:17:47PM +0300, Sergey Burladyan wrote:
> PS:
> I successfully upgraded last night from 9.2 to 9.4 and find other issue :-)
> 
> It is about hash index and promote:
> 1. create master
> 2. create standby from it
> 3. create unlogged table and hash index like:
>  create unlogged table test (id int primary key, v text);
>  create index on test using hash (id);
> 3. stop master
> 4. promote standby
> 
> now, if you try to upgrade this new promoted master pg_upgrade will stop
> on this hash index:
> error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"):
Nosuch file or directory
 
> Failure, exiting
> 
> I touch this file (s/9.2/base/16384/16393) and rerun pg_upgrade from
> scratch and it complete successfully.

Sergey, can you please test if the table "test" is not unlogged, does
pg_upgrade still fail on the hash index file?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Bruce Momjian
Дата:
On Wed, Jun 28, 2017 at 10:11:35PM -0400, Bruce Momjian wrote:
> On Fri, Jun 23, 2017 at 06:17:47PM +0300, Sergey Burladyan wrote:
> > PS:
> > I successfully upgraded last night from 9.2 to 9.4 and find other issue :-)
> > 
> > It is about hash index and promote:
> > 1. create master
> > 2. create standby from it
> > 3. create unlogged table and hash index like:
> >  create unlogged table test (id int primary key, v text);
> >  create index on test using hash (id);
> > 3. stop master
> > 4. promote standby
> > 
> > now, if you try to upgrade this new promoted master pg_upgrade will stop
> > on this hash index:
> > error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to "m/9.4/base/16422/16393"):
Nosuch file or directory
 
> > Failure, exiting
> > 
> > I touch this file (s/9.2/base/16384/16393) and rerun pg_upgrade from
> > scratch and it complete successfully.
> 
> Sergey, can you please test if the table "test" is not unlogged, does
> pg_upgrade still fail on the hash index file?

I was able to reproduce this failure on my server.  :-)

What I found is that the problem is larger than I thought.  Sergey is
correct that pg_upgrade fails because there is no hash file associated
with the unlogged table, but in fact a simple access of the unlogged
table with a hash index generates an error:
test=> SELECT * FROM t_u_hash;ERROR:  could not open file "base/16384/16392": No such file or directory

What is interesting is that this is the only combination that generates
an error.  A unlogged able with a btree index or a logged table with a
hash index are fine, e.g.:
           List of relations Schema |   Name    | Type  |  Owner--------+-----------+-------+---------- public |
t_btree  | table | postgres public | t_hash    | table | postgres public | t_u_btree | table | postgres
 
fail-->     public | t_u_hash  | table | postgres

This doesn't fail on PG 10 since we WAL-log hash indexes.

I think we have two questions:

1.  do we fix this in the server
2.  if not, do we fix pg_upgrade

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Fri, Jun 30, 2017 at 6:26 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Jun 28, 2017 at 10:11:35PM -0400, Bruce Momjian wrote:
>> On Fri, Jun 23, 2017 at 06:17:47PM +0300, Sergey Burladyan wrote:
>> > PS:
>> > I successfully upgraded last night from 9.2 to 9.4 and find other issue :-)
>> >
>> > It is about hash index and promote:
>> > 1. create master
>> > 2. create standby from it
>> > 3. create unlogged table and hash index like:
>> >  create unlogged table test (id int primary key, v text);
>> >  create index on test using hash (id);
>> > 3. stop master
>> > 4. promote standby
>> >
>> > now, if you try to upgrade this new promoted master pg_upgrade will stop
>> > on this hash index:
>> > error while creating link for relation "public.test_id_idx" ("s/9.2/base/16384/16393" to
"m/9.4/base/16422/16393"):No such file or directory
 
>> > Failure, exiting
>> >
>> > I touch this file (s/9.2/base/16384/16393) and rerun pg_upgrade from
>> > scratch and it complete successfully.
>>
>> Sergey, can you please test if the table "test" is not unlogged, does
>> pg_upgrade still fail on the hash index file?
>
> I was able to reproduce this failure on my server.  :-)
>
> What I found is that the problem is larger than I thought.  Sergey is
> correct that pg_upgrade fails because there is no hash file associated
> with the unlogged table, but in fact a simple access of the unlogged
> table with a hash index generates an error:
>
>         test=> SELECT * FROM t_u_hash;
>         ERROR:  could not open file "base/16384/16392": No such file or directory
>
> What is interesting is that this is the only combination that generates
> an error.
>

Yes and that is because normally we log the creation of init fork for
unlogged relations (both heap and index, refer btbuildempty for index
and
heap_create_init_fork for heap), but for hash indexes prior to 10, we
don't log for init forks.

>  A unlogged able with a btree index or a logged table with a
> hash index are fine, e.g.:
>
>                    List of relations
>          Schema |   Name    | Type  |  Owner
>         --------+-----------+-------+----------
>          public | t_btree   | table | postgres
>          public | t_hash    | table | postgres
>          public | t_u_btree | table | postgres
> fail-->  public | t_u_hash  | table | postgres
>
> This doesn't fail on PG 10 since we WAL-log hash indexes.
>
> I think we have two questions:
>
> 1.  do we fix this in the server

If we want to fix this in the server then we need to log (write WAL)
the init fork for hash indexes.

> 2.  if not, do we fix pg_upgrade
>

I think even if we provide a fix in pg_upgrade, it might not suffice
the need because this problem can come if the user just promotes
standby server (<=9.6) to master considering we had unlogged table and
hash index on that table.

I think we should fix the server.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Broken hint bits (freeze)

От
Amit Kapila
Дата:
On Thu, Jun 29, 2017 at 6:57 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sat, Jun 24, 2017 at 09:24:21AM +0530, Amit Kapila wrote:
>> > I was not clear.  I was not saying there can be only one extra WAL file.
>> > I am saying the "Latest checkpoint location" should be one WAL file
>> > farther on the master.  I think the big problem is that we need a full
>> > replay of that WAL file, not just having it one less than the master.
>> >
>>
>> If the user has properly shutdown, then that last file should only
>> have checkpoint record, is it safe to proceed with upgrade without
>> actually copying that file?
>
> Yes, but how do we know they processed all the records in the
> second-to-last WAL file (in WAL shipping mode).
>

I don't see any straightforward way to know the same except that user
gets the latest WAL location (replay or flush) and then verify it
against last wal file (maybe by using something like pg_waldump).  I
think the another problem as pointed by Sergey up thread is how to
ensure all the buffers that contain changes are flushed to disk.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com