Re: Issue with WAL logs temporary not replaying

Поиск
Список
Период
Сортировка
От Adrian Klaver
Тема Re: Issue with WAL logs temporary not replaying
Дата
Msg-id e35c3336-d0c4-c11a-079d-84a22100dbcd@aklaver.com
обсуждение исходный текст
Ответ на Issue with WAL logs temporary not replaying  (Eugene Pirogov <iamexile@gmail.com>)
Список pgsql-general
On 01/31/2018 09:56 AM, Eugene Pirogov wrote:
> Hi,
> 
> We've experienced a DB incident related to WAL. The issue is now 
> somewhat resolved, and I'm trying to understand what could have caused 
> the problem to occur in the first place. Below is a full chronicle of an 
> incident we've ran into.
> 
> At some point we realized almost no new data was coming in the DB during 
> the day. We quickly looked and realized the following:
> 
>   * our application was still writing data, unaware that there any
>     issues on the DB end, like if nothing happened at all,
>   * the database was accepting connections (we were able to connect and
>     query),
>   * the database was writing regular text logs,
>   * the database was not writing new data to the tables,
>   * the logical DB replica actually had the expected records replicated
>     and present in the replica tables,
>   * so we made assumption original source WAL data existed and was
>     correct (given the fact logical DB is relying on WAL).
> 
> We tried to restart the DB, but it would not come up. Instead, a few 
> core dumps in the postgres data folder were generated during our attempts.
> 
> This is what we saw in the postgres logs:
> 
>     LOG:  invalid resource manager ID 112 at 3/89005B10
>     LOG:  invalid primary checkpoint record
>     LOG:  invalid resource manager ID 138 at 3/89005208
>     LOG:  invalid secondary checkpoint record
>     PANIC:  could not locate a valid checkpoint record
> 
> 
> To aid this we used pg_resetxlog. After running the pg_resetxlog 
> command, the DB successfully started and slowly began recovering the 
> data from WAL, automatically. During the recovery process, we ran VACUUM 
> ANALYSE couple of time, which seems to have somehow aided the speed of 
> recovery process (though I could have been a confirmation bias).
> 
> The DB was fully recovered in 30 minutes.
> 
> Upon closer inspection, however, we realized we've lost two records from 
> a particular table. The records were replicated to logical replica 
> though, and further inspection showed their insertion timestamp sometime 
> during the day, long after the database supposedly stopped applying new 
> data from WAL to tables.
> 
> We don't know the reason why these two records were gone (no DELETE 
> statements ever happen in the table in question). The records in 
> question existed in replica table, luckly.
> 
> The logs for the entire day when the incident happened are available 
> here 
>
<https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6

> Mb). Each log line was prepended with a timestamp. Also, we have the 
> core dumps that were generated during attempts to restart the database.

What the role of the user 'db'?

Any idea what is causing this?:

WARNING:  concurrent insert in progress within table "audit_log"
...repeated... many times

WARNING:  concurrent insert in progress within table 
"declarations_status_hstr"
...repeated many times

Are either of these tables the one that had the two missing records?

> 
> My questions are:
> 
>   * what would be some typical reasons for primary/secondary WAL
>     checkpoints to fail? Disk issues? File system issues? OS issues? I'm
>     running out of ideas here.
> 
>     Is there any way I could separate primary and secondary checkpoints
>     in space, so to increase postgres's chances to restore one of them?
> 
>   * how come postgresql would stop applying WAL logs to tables, and yet
>     keep silence? There was no indication WAL is not being applied,
>     postgres log would be silent (even though it kept writing other
>     commodity messages, such as "incomplete startup packet" and
>     malformed query attempts).
> 
>     Due to this, we noticed the issue only hours after postgres
>     supposedly stopped applying the WAL logs to tables.
> 
>   * Is there any other log (like some lower level system log?) where I
>     could see the status of writing / reading / applying WAL segments?
> 
>   * What measures could be done to prevent this in future, or at least
>     make the situation easier to recover from (e.g. without data loss)?
> 
> Thank you in advance.
> 
> -- 
> Eugene
> 
> http://www.gmile.me


-- 
Adrian Klaver
adrian.klaver@aklaver.com


В списке pgsql-general по дате отправления:

Предыдущее
От: "David G. Johnston"
Дата:
Сообщение: Re: Information on savepoint requirement within transctions
Следующее
От: Jan Wieck
Дата:
Сообщение: Re: ERROR: invalid memory alloc request size 1073741824