Обсуждение: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

Поиск
Список
Период
Сортировка

'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 1st May, I saw this message in my postgres log:

May  2 06:52:02 db10 postgres[3590]: [29829-1] 2011-05-02 06:52:02 SGT
ERROR:  could not access status of transaction 1573786613
May  2 06:52:02 db10 postgres[3590]: [29829-2] 2011-05-02 06:52:02 SGT
DETAIL:  Could not open file "pg_clog/05DC": No such file or directory.
May  2 06:52:02 db10 postgres[3590]: [29829-3] 2011-05-02 06:52:02 SGT
STATEMENT:  SELECT 1 FROM core_bill_id_seq FOR UPDATE


Now, I'm not sure what I should do about it. Database behaves "funny",
some inserts do not work.


Searching the internet suggests that:

1) such errors could happen with PostgreSQL 8.1.x under heavy load -
this server is under constant heavy load, but runs 8.3.14 on Debian Lenny

2) I should simply create a 256k pg_clog/05DC empty file with dd - I
wouldn't like to do it, without first knowing what happened, and if it's
really "good fix"

3) some tables can be corrupted - how can I check that? pg_dump works
fine and doesn't report any errors

4) I may have hardware problems - but this server is running for almost
1 year now, is super stable - servers with hardware issues are likely to
show some issues as well

5) database corrupted due to a server crash - this server never crashed


How should I continue from that (assuming I can't reliably verify if
something wrong is going with the hardware or not - points 4 and 5)?


--
Tomasz Chmielewski
http://wpkg.org

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
"Kevin Grittner"
Дата:
Tomasz Chmielewski <mangoo@wpkg.org> wrote:

> On 1st May, I saw this message in my postgres log:
>
> May  2 06:52:02 db10 postgres[3590]: [29829-1] 2011-05-02 06:52:02
> SGT ERROR:  could not access status of transaction 1573786613
> May  2 06:52:02 db10 postgres[3590]: [29829-2] 2011-05-02 06:52:02
> SGT DETAIL:  Could not open file "pg_clog/05DC": No such file or
> directory.
> May  2 06:52:02 db10 postgres[3590]: [29829-3] 2011-05-02 06:52:02
> SGT STATEMENT:  SELECT 1 FROM core_bill_id_seq FOR UPDATE

You saw errors on the 1st dated for the 2nd?

> Now, I'm not sure what I should do about it. Database behaves
> "funny", some inserts do not work.

Define "funny".  What happens when you attempt the inserts which
don't work.  (Copy and paste any error messages.)  Is it all tables?
All inserts to one table?  Any other discernible pattern?

> 4) I may have hardware problems - but this server is running for
> almost 1 year now, is super stable - servers with hardware issues
> are likely to show some issues as well

Does the server have ECC memory?  Do you have SMART monitoring of
the storage system, or something similar?  Any errors showing in any
system logs?

-Kevin

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 04.05.2011 20:14, Kevin Grittner wrote:
> Tomasz Chmielewski<mangoo@wpkg.org>  wrote:
>
>> On 1st May, I saw this message in my postgres log:
>>
>> May  2 06:52:02 db10 postgres[3590]: [29829-1] 2011-05-02 06:52:02
>> SGT ERROR:  could not access status of transaction 1573786613
>> May  2 06:52:02 db10 postgres[3590]: [29829-2] 2011-05-02 06:52:02
>> SGT DETAIL:  Could not open file "pg_clog/05DC": No such file or
>> directory.
>> May  2 06:52:02 db10 postgres[3590]: [29829-3] 2011-05-02 06:52:02
>> SGT STATEMENT:  SELECT 1 FROM core_bill_id_seq FOR UPDATE
>
> You saw errors on the 1st dated for the 2nd?

My bad; it was 2nd, not 1st.


>> Now, I'm not sure what I should do about it. Database behaves
>> "funny", some inserts do not work.
>
> Define "funny".  What happens when you attempt the inserts which
> don't work.  (Copy and paste any error messages.)  Is it all tables?
> All inserts to one table?  Any other discernible pattern?

This repeated many times:

/var/log/postgresql/postgresql_log.1:May  3 18:24:49 db10 postgres[21363]: [26999-1] 2011-05-03 18:24:49 SGT ERROR:
couldnot access status of transaction 1573786613 
/var/log/postgresql/postgresql_log.1-May  3 18:24:49 db10 postgres[21363]: [26999-2] 2011-05-03 18:24:49 SGT DETAIL:
Couldnot open file "pg_clog/05DC": No such file or directory. 
/var/log/postgresql/postgresql_log.1-May  3 18:24:49 db10 postgres[21363]: [26999-3] 2011-05-03 18:24:49 SGT STATEMENT:
SELECT 1 FROM core_wot_seq FOR UPDATE 


Today I have this:

/var/log/postgresql/postgresql_log:May  4 22:43:44 db10 postgres[15773]: [555-1] 2011-05-04 22:43:44 SGT ERROR:  could
notaccess status of transaction 1612337841 
/var/log/postgresql/postgresql_log-May  4 22:43:44 db10 postgres[15773]: [555-2] 2011-05-04 22:43:44 SGT DETAIL:  Could
notopen file "pg_clog/0601": No such file or directory. 
/var/log/postgresql/postgresql_log-May  4 22:43:44 db10 postgres[15773]: [555-3] 2011-05-04 22:43:44 SGT STATEMENT:
SELECT1 FROM core_wbl_seq FOR UPDATE 

Only such two (different) occurrences; repeated 10-20 times; two different tables.

The system is used heavily, so it would show lots of other errors in other places if it was some major fault.
Which does not include some "minor" fault.


>> 4) I may have hardware problems - but this server is running for
>> almost 1 year now, is super stable - servers with hardware issues
>> are likely to show some issues as well
>
> Does the server have ECC memory?  Do you have SMART monitoring of
> the storage system, or something similar?  Any errors showing in any
> system logs?

No errors at all anywhere (dmesg, syslog etc.).

It's ProLiant DL180 G6, and I think it should have ECC. At least I see it being mentioned in dmidecode.

Assuming we can't determine what caused the corruption (bitflip, kernel bug, bad RAM, postgres bug, silent HDD error
etc.)- how should I best recover from this? 


--
Tomasz Chmielewski
http://wpkg.org

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
"Kevin Grittner"
Дата:
Tomasz Chmielewski <mangoo@wpkg.org> wrote:

> This repeated many times:
>
> /var/log/postgresql/postgresql_log.1:May  3 18:24:49 db10
> postgres[21363]: [26999-1] 2011-05-03 18:24:49 SGT ERROR:  could
> not access status of transaction 1573786613
> /var/log/postgresql/postgresql_log.1-May  3 18:24:49 db10
< postgres[21363]: [26999-2] 2011-05-03 18:24:49 SGT DETAIL:  Could
> not open file "pg_clog/05DC": No such file or directory.
> /var/log/postgresql/postgresql_log.1-May  3 18:24:49 db10
> postgres[21363]: [26999-3] 2011-05-03 18:24:49 SGT STATEMENT:
> SELECT 1 FROM core_wot_seq FOR UPDATE
>
>
> Today I have this:
>
> /var/log/postgresql/postgresql_log:May  4 22:43:44 db10
> postgres[15773]: [555-1] 2011-05-04 22:43:44 SGT ERROR:  could not
> access status of transaction 1612337841
> /var/log/postgresql/postgresql_log-May  4 22:43:44 db10
> postgres[15773]: [555-2] 2011-05-04 22:43:44 SGT DETAIL:  Could
> not open file "pg_clog/0601": No such file or directory.
> /var/log/postgresql/postgresql_log-May  4 22:43:44 db10
> postgres[15773]: [555-3] 2011-05-04 22:43:44 SGT STATEMENT:
> SELECT 1 FROM core_wbl_seq FOR UPDATE
>
> Only such two (different) occurrences; repeated 10-20 times; two
> different tables.

> how should I best recover from this?

If you hadn't already said you were running on 8.3.14 I would have
wondered whether you had used pg_migrator/pg_upgrade.  As it is, I'm
going to admit I'm out of my depth and hope someone else jumps in
here.

-Kevin

On Wed, May 4, 2011 at 12:37 PM, Tomasz Chmielewski <mangoo@wpkg.org> wrote:
> /var/log/postgresql/postgresql_log.1:May  3 18:24:49 db10 postgres[21363]: [26999-1] 2011-05-03 18:24:49 SGT ERROR:
 couldnot access status of transaction 1573786613 
> /var/log/postgresql/postgresql_log.1-May  3 18:24:49 db10 postgres[21363]: [26999-2] 2011-05-03 18:24:49 SGT DETAIL:
 Couldnot open file "pg_clog/05DC": No such file or directory. 
> /var/log/postgresql/postgresql_log.1-May  3 18:24:49 db10 postgres[21363]: [26999-3] 2011-05-03 18:24:49 SGT
STATEMENT: SELECT 1 FROM core_wot_seq FOR UPDATE 

So a pg_clog file disappeared.

> Today I have this:
>
> /var/log/postgresql/postgresql_log:May  4 22:43:44 db10 postgres[15773]: [555-1] 2011-05-04 22:43:44 SGT ERROR:
 couldnot access status of transaction 1612337841 
> /var/log/postgresql/postgresql_log-May  4 22:43:44 db10 postgres[15773]: [555-2] 2011-05-04 22:43:44 SGT DETAIL:
 Couldnot open file "pg_clog/0601": No such file or directory. 
> /var/log/postgresql/postgresql_log-May  4 22:43:44 db10 postgres[15773]: [555-3] 2011-05-04 22:43:44 SGT STATEMENT:
 SELECT1 FROM core_wbl_seq FOR UPDATE 

Then another pg_clog file disappeared.

Is it possible there's some rogue process deleting files in pg_clog
somehow?  Have you run an fsck on this drive to make sure it's not got
any file system errors?

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 04.05.2011 21:50, Scott Marlowe wrote:

> Then another pg_clog file disappeared.
>
> Is it possible there's some rogue process deleting files in pg_clog
> somehow?

I don't think.


> Have you run an fsck on this drive to make sure it's not got
> any file system errors?

Also, don't think there is any corruption here. AFAIR, this system never crashed.
Could be there is some silent corruption though - but if there really was one, we would likely see the kernel
complaining,stale files elsewhere, and so on. 

Without such clues on filesystem corruption, I can't afford downtime.


I didn't mention, but the application first talks to pgpool, which talks to two database servers (i.e. inserts to
both).

The real fun begins here - this is from two different servers:

db10:/var/log/postgresql# zgrep "No such" *
postgresql_log:May  4 18:24:28 db10 postgres[15751]: [23-2] 2011-05-04 18:24:28 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 
postgresql_log:May  4 22:43:44 db10 postgres[15773]: [555-2] 2011-05-04 22:43:44 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 
postgresql_log:May  4 22:44:30 db10 postgres[15791]: [1841-2] 2011-05-04 22:44:30 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 
postgresql_log:May  4 22:55:53 db10 postgres[15741]: [4114-2] 2011-05-04 22:55:53 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 


db20:/var/log/postgresql# zgrep "No such" *
postgresql_log:May  4 18:24:28 db20 postgres[27114]: [2-2] 2011-05-04 18:24:28 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 
postgresql_log:May  4 22:43:44 db20 postgres[27116]: [2-2] 2011-05-04 22:43:44 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 
postgresql_log:May  4 22:44:30 db20 postgres[27138]: [2-2] 2011-05-04 22:44:30 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 
postgresql_log:May  4 22:55:53 db20 postgres[27104]: [2-2] 2011-05-04 22:55:53 SGT DETAIL:  Could not open file
"pg_clog/0601":No such file or directory. 


I can't exclude some corruption happened much earlier on db10; the whole database (as binary files) was copied to db20
almost2 months ago. 

Why would it start showing pg_clog files missing just 2 days ago, and not earlier? Hmm.


--
Tomasz Chmielewski
http://wpkg.org

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 04.05.2011 22:13, Tomasz Chmielewski wrote:
> On 04.05.2011 21:50, Scott Marlowe wrote:
>
>> Then another pg_clog file disappeared.

OK, I have:

bookstor=# SELECT * FROM core_wot_seq;
  sequence_name   | last_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled |
is_called 

------------------+------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
 core_wot_seq     |       2593 |            1 | 9223372036854775807 |         1 |           1 |       8 | f         | t
(1 row)

bookstor=# SELECT 1 FROM core_wot_seq;
 ?column?
----------
        1
(1 row)

bookstor=# SELECT 1 FROM core_wot_seq FOR UPDATE;
ERROR:  could not access status of transaction 1573786613
DETAIL:  Could not open file "pg_clog/05DC": No such file or directory.


How do I best recover from this? Stop postgres, create an empty, 256k pg_clog/05DC file, start postgres?

Export table, drop table, import table? Anything else?


--
Tomasz Chmielewski
http://wpkg.org

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 04.05.2011 22:27, Tomasz Chmielewski wrote:

>>> Then another pg_clog file disappeared.
>
> OK, I have:
>
> bookstor=# SELECT * FROM core_wot_seq;
>    sequence_name   | last_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled
|is_called 
>
------------------+------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
>   core_wot_seq     |       2593 |            1 | 9223372036854775807 |         1 |           1 |       8 | f
|t 
> (1 row)
>
> bookstor=# SELECT 1 FROM core_wot_seq;
>   ?column?
> ----------
>          1
> (1 row)
>
> bookstor=# SELECT 1 FROM core_wot_seq FOR UPDATE;
> ERROR:  could not access status of transaction 1573786613
> DETAIL:  Could not open file "pg_clog/05DC": No such file or directory.
>
> How do I best recover from this? Stop postgres, create an empty, 256k pg_clog/05DC file, start postgres?
>
> Export table, drop table, import table? Anything else?

Nobody has a clue? :|


--
Tomasz Chmielewski
http://wpkg.org


Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 06.05.2011 10:42, Tomasz Chmielewski wrote:
> On 04.05.2011 22:27, Tomasz Chmielewski wrote:
>
>>>> Then another pg_clog file disappeared.
>>
>> OK, I have:
>>
>> bookstor=# SELECT * FROM core_wot_seq;
>> sequence_name | last_value | increment_by | max_value | min_value |
>> cache_value | log_cnt | is_cycled | is_called
>>
------------------+------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
>>
>> core_wot_seq | 2593 | 1 | 9223372036854775807 | 1 | 1 | 8 | f | t
>> (1 row)
>>
>> bookstor=# SELECT 1 FROM core_wot_seq;
>> ?column?
>> ----------
>> 1
>> (1 row)
>>
>> bookstor=# SELECT 1 FROM core_wot_seq FOR UPDATE;
>> ERROR: could not access status of transaction 1573786613
>> DETAIL: Could not open file "pg_clog/05DC": No such file or directory.
>>
>> How do I best recover from this? Stop postgres, create an empty, 256k
>> pg_clog/05DC file, start postgres?
>>
>> Export table, drop table, import table? Anything else?
>
> Nobody has a clue? :|

Just as a follow up, it turned out several sequences, and only sequences
were affected this way.

I used pg_dump to export these sequences, dropped the sequences, and
imported them again.

As there were some tables which depended on these sequences, I had to
use ALTER TABLE as well several times - grepping for the affected
sequence in the whole database dump gave me hints on what I had to do.


--
Tomasz Chmielewski
http://wpkg.org

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 07.05.2011 23:19, Tomasz Chmielewski wrote:
> On 06.05.2011 10:42, Tomasz Chmielewski wrote:

>>> bookstor=# SELECT 1 FROM core_wot_seq FOR UPDATE;
>>> ERROR: could not access status of transaction 1573786613
>>> DETAIL: Could not open file "pg_clog/05DC": No such file or directory.
>>>
>>> How do I best recover from this? Stop postgres, create an empty, 256k
>>> pg_clog/05DC file, start postgres?
>>>
>>> Export table, drop table, import table? Anything else?
>>
>> Nobody has a clue? :|
>
> Just as a follow up, it turned out several sequences, and only sequences
> were affected this way.
>
> I used pg_dump to export these sequences, dropped the sequences, and
> imported them again.

Unfortunately, the issue is back, and again, only affects sequences.

I'd be really grateful for any more ideas here (why it happens, how to
best recover from it)!


--
Tomasz Chmielewski
http://wpkg.org

Tomasz Chmielewski <mangoo@wpkg.org> writes:
> bookstor=# SELECT 1 FROM core_wot_seq FOR UPDATE;

Um ... why are you doing that on a sequence?

> ERROR: could not access status of transaction 1573786613
> DETAIL: Could not open file "pg_clog/05DC": No such file or directory.

This doesn't surprise me too much, because sequences are not expected
to contain any live XIDs, so the XID freezing mechanism ignores them.
So if you did that in the past, this would eventually happen.

I think the most appropriate solution may be to disallow SELECT FOR
UPDATE/SHARE on sequences ... so if you have a good reason why we
shouldn't do so, please explain it.

            regards, tom lane

Re: 'SGT DETAIL: Could not open file "pg_clog/05DC": No such file or directory' - what to do now?

От
Tomasz Chmielewski
Дата:
On 31.05.2011 05:16, Tom Lane wrote:
> Tomasz Chmielewski<mangoo@wpkg.org>  writes:
>> bookstor=# SELECT 1 FROM core_wot_seq FOR UPDATE;
>
> Um ... why are you doing that on a sequence?
>
>> ERROR: could not access status of transaction 1573786613
>> DETAIL: Could not open file "pg_clog/05DC": No such file or directory.
>
> This doesn't surprise me too much, because sequences are not expected
> to contain any live XIDs, so the XID freezing mechanism ignores them.
> So if you did that in the past, this would eventually happen.
>
> I think the most appropriate solution may be to disallow SELECT FOR
> UPDATE/SHARE on sequences ... so if you have a good reason why we
> shouldn't do so, please explain it.

That's a good question.

I grepped the sources of the application using postgres, and it certainly doesn't do it.


We use pgpool though, and I see:

pool_process_query.c:        snprintf(qbuf, sizeof(qbuf), "SELECT 1 FROM %s FOR UPDATE", seq_rel_name);


So it looks to be coming from pgpool 3.x (it didn't do it in 2.x version).

This is a message explaining why it was introduced to pgpool:

http://comments.gmane.org/gmane.comp.db.postgresql.pgpool.devel/348


This brings two questions:

1) whatever command I send to postgres, should I expect "Could not open file "pg_clog/05DC": No such file or
directory"?
If so, it should be documented, and a way to recover from such a situation should be explained.

2) is pgpool behaviour correct?


--
Tomasz Chmielewski
http://wpkg.org