ERROR could not access transaction/Could not open file pg_commit_ts

Поиск
Список
Период
Сортировка
От Jeremy Finzel
Тема ERROR could not access transaction/Could not open file pg_commit_ts
Дата
Msg-id CAMa1XUghTLmd7sbEfiJ2HOVZLfLh7LPZZBR+K6PjHwLOfqUrHQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: ERROR could not access transaction/Could not open file pg_commit_ts
Список pgsql-general
Hello -

Here is our cluster setup:

cluster_a 9.5.11 Ubuntu 16.04.4 LTS
 --> cluster_b (streamer) 9.5.11 Ubuntu 16.04.4 LTS
 --> cluster_c (streamer) 9.5.11 Ubuntu 16.04.4 LTS

Very recently, we started seeing these errors when running a query on a specific table on the streamer:

2018-03-09 08:28:16.280 CST,"uname","foo",18692,"0.0.0.0:0",5aa29292.4904,4,"SELECT",2018-03-09 07:56:34 CST,18/15992,0,ERROR,58P01,"could not access status of transaction 1035047007","Could not open file ""pg_commit_ts/9A45"": No such file or directory."

A little history on the cluster:
  • The most recent change we made was a point release upgrade from 9.5.5 to 9.5.11 on the master, and 9.5.9 to 9.5.11 for the 2 streamers
  • It is a very high WAL traffic reporting system.
  • We actually have synchronous_commit set to off.  It's possible this could have bitten us and we are just now seeing issues, however there have been no crashes since the table in question was created.
  • We have run pg_repack on many tables on this cluster, but that also has not happened since over a month
  • We had a similar error of missing pg_commit_ts file over a year ago after an actual crash.  We had serious issues getting the cluster to start, and had to resort to recreating the missing pg_commit_ts with null bytes (IIRC, we had a snapshot of the system which still showed the file), which worked but left us questioning what really caused the issue.

The table that is causing the error has been in production and used fine since 2/15/2018 when it was created. It is fed by pglogical replication (v. 2.1.1 on subscriber) from a system running 9.6.1 and pglogical v. 1.2.1.  The point release upgrade from earlier 9.5 did take place after this.

However, we only just started seeing errors in the past 12 hours.  The table was autovacuumed on master at 2018-03-08 18:18:15.532137-06, which was about 3 hours before the first user query errored, however, I saw that 2 hours after the autovac, there was another user query that worked successfully on the table.  Not sure if related?

Any insight/ideas would be much appreciated!

Thanks,
Jeremy

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: pg/tcl performance related
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Feature request: min/max for macaddr type