Re: Issue with WAL logs temporary not replaying
От | Adrian Klaver |
---|---|
Тема | Re: Issue with WAL logs temporary not replaying |
Дата | |
Msg-id | e35c3336-d0c4-c11a-079d-84a22100dbcd@aklaver.com обсуждение исходный текст |
Ответ на | Issue with WAL logs temporary not replaying (Eugene Pirogov <iamexile@gmail.com>) |
Список | pgsql-general |
On 01/31/2018 09:56 AM, Eugene Pirogov wrote: > Hi, > > We've experienced a DB incident related to WAL. The issue is now > somewhat resolved, and I'm trying to understand what could have caused > the problem to occur in the first place. Below is a full chronicle of an > incident we've ran into. > > At some point we realized almost no new data was coming in the DB during > the day. We quickly looked and realized the following: > > * our application was still writing data, unaware that there any > issues on the DB end, like if nothing happened at all, > * the database was accepting connections (we were able to connect and > query), > * the database was writing regular text logs, > * the database was not writing new data to the tables, > * the logical DB replica actually had the expected records replicated > and present in the replica tables, > * so we made assumption original source WAL data existed and was > correct (given the fact logical DB is relying on WAL). > > We tried to restart the DB, but it would not come up. Instead, a few > core dumps in the postgres data folder were generated during our attempts. > > This is what we saw in the postgres logs: > > LOG: invalid resource manager ID 112 at 3/89005B10 > LOG: invalid primary checkpoint record > LOG: invalid resource manager ID 138 at 3/89005208 > LOG: invalid secondary checkpoint record > PANIC: could not locate a valid checkpoint record > > > To aid this we used pg_resetxlog. After running the pg_resetxlog > command, the DB successfully started and slowly began recovering the > data from WAL, automatically. During the recovery process, we ran VACUUM > ANALYSE couple of time, which seems to have somehow aided the speed of > recovery process (though I could have been a confirmation bias). > > The DB was fully recovered in 30 minutes. > > Upon closer inspection, however, we realized we've lost two records from > a particular table. The records were replicated to logical replica > though, and further inspection showed their insertion timestamp sometime > during the day, long after the database supposedly stopped applying new > data from WAL to tables. > > We don't know the reason why these two records were gone (no DELETE > statements ever happen in the table in question). The records in > question existed in replica table, luckly. > > The logs for the entire day when the incident happened are available > here > <https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6 > Mb). Each log line was prepended with a timestamp. Also, we have the > core dumps that were generated during attempts to restart the database. What the role of the user 'db'? Any idea what is causing this?: WARNING: concurrent insert in progress within table "audit_log" ...repeated... many times WARNING: concurrent insert in progress within table "declarations_status_hstr" ...repeated many times Are either of these tables the one that had the two missing records? > > My questions are: > > * what would be some typical reasons for primary/secondary WAL > checkpoints to fail? Disk issues? File system issues? OS issues? I'm > running out of ideas here. > > Is there any way I could separate primary and secondary checkpoints > in space, so to increase postgres's chances to restore one of them? > > * how come postgresql would stop applying WAL logs to tables, and yet > keep silence? There was no indication WAL is not being applied, > postgres log would be silent (even though it kept writing other > commodity messages, such as "incomplete startup packet" and > malformed query attempts). > > Due to this, we noticed the issue only hours after postgres > supposedly stopped applying the WAL logs to tables. > > * Is there any other log (like some lower level system log?) where I > could see the status of writing / reading / applying WAL segments? > > * What measures could be done to prevent this in future, or at least > make the situation easier to recover from (e.g. without data loss)? > > Thank you in advance. > > -- > Eugene > > http://www.gmile.me -- Adrian Klaver adrian.klaver@aklaver.com
В списке pgsql-general по дате отправления:
Предыдущее
От: "David G. Johnston"Дата:
Сообщение: Re: Information on savepoint requirement within transctions