Обсуждение: Jira database won't start after disk filled up

Поиск
Список
Период
Сортировка

Jira database won't start after disk filled up

От
Paul Costello
Дата:
I have a database that wouldn't start due to the disk filling up back on 1/10, unbeknownst to us until 2/27.  This is jira, so it's critical data.  It appears jira was running in memory that entire time.

I needed to run pg_resetxlog -f in order to start the database.  It started, but upon logging in I found the system catalog and some data to be corrupt. 

I was able to run a pg_dumpall on the database and restore it to an re-initialized cluster.  However, there were 3 primary key errors during the restore, because duplicate data got into the tables. 

My hypothesis is that because of the system catalog corruption the primary key uniqueness was not being enforced.  Not sure when this occurred though  1) right after the disk filled up 2) when I ran pg_resetxlog -f or 3) after I ran pg_resetxlog and before I did the backup.  jira was still running after I got it started and I waited a few hours to do the backup.  My guess is the duplicate data got in there right after the disk filled up on 1/10 though.

We had a snapshot from 1/5 which is restored to production, such as it is.  But, they created another test vm for me to attempt to bring data back to 2/27. 

Is there anything I can do short of pg_resetxlog -f to bring this database back up more safely, and possibly avoid the duplicate data/primary key errors?  It wouldn't start without the force option.  Should I simply shut down jira, try pg_restxlog -f again and do the pg_dumpall immediately?

These are the errors I am currently seeing while trying to start the database. 

2018-03-02 11:01:06 CST LOG:  database system was interrupted; last known up at 2018-01-10 12:19:01 CST
2018-03-02 11:01:06 CST LOG:  database system was not properly shut down; automatic recovery in progress
2018-03-02 11:01:06 CST LOG:  redo starts at 36/B8556D58
2018-03-02 11:01:06 CST LOG:  incomplete startup packet
2018-03-02 11:01:07 CST FATAL:  the database system is starting up
...
2018-03-02 11:01:12 CST LOG:  incomplete startup packet
2018-03-02 11:01:29 CST FATAL:  the database system is starting up
...
2018-03-02 11:01:30 CST LOG:  record with zero length at 36/F754CBD8
2018-03-02 11:01:30 CST LOG:  redo done at 36/F754CBA8
2018-03-02 11:01:30 CST LOG:  last completed transaction was at log time 2018-02-26 17:55:43.238541-06

Any ideas or thoughts are appreciated.

Paul

Re: Jira database won't start after disk filled up

От
Vick Khera
Дата:
On Fri, Mar 2, 2018 at 4:32 PM, Paul Costello <paulc1217@gmail.com> wrote:
I have a database that wouldn't start due to the disk filling up back on 1/10, unbeknownst to us until 2/27.  This is jira, so it's critical data.  It appears jira was running in memory that entire time.


Those first two sentences seem contradictory...
 

I needed to run pg_resetxlog -f in order to start the database.  It started, but upon logging in I found the system catalog and some data to be corrupt. 


Once you did this, fixing the data is really on you. Postgres has no way to know what any of the data mean, nor how to decide what to keep and what to toss on those conflicting rows with duplicate keys.

What I'd personally do is take your 1/5 backup, then merge in rows for tickets and affiliated data from whatever you can recover in the current database copy you have. Once that's done, run jira's built-in integrity checker then do a full export to XML backup format. Finally re-import that into a fresh jira so you know what's in there is consistent.  You'll probably also have to cross-reference the attachments directory for missing tickets and clean up those files (or synthesize tickets for them).

If your jira is configured to send email somewhere on ticket updates, gathering those (even if it is in multiple people's mailboxes) and recreating ticket info from them would also move you along.

You will lose some of your data because not all of it was written to disk.

Re: Jira database won't start after disk filled up

От
Paul Costello
Дата:
Yes, contradictory.  We recently disabled email forwarding, so on 1/10 when the disk filled up, we never received any alerts.  New position, so I was completely unaware of this database, except on a conceptual level, until Tuesday.

I think the best I can do is get this database back to 1/10.  In my first restore attempt I noticed that the most recent jiraissue.resolutiondate was 1/10, so jira was running in memory the entire time and nothing was flushed to the db - probably because it was corrupted, and possibly other reasons related to neglect.

My hope is that I can get the db back to 1/10 and maybe we can, with Atlassian's help, somehow sync the lucene files back to the db.  I don't think I will have any postgres data to work with beyond 1/10.

Does this still sound do-able with that kind of data gap?

On Fri, Mar 2, 2018 at 3:44 PM, Vick Khera <vivek@khera.org> wrote:
On Fri, Mar 2, 2018 at 4:32 PM, Paul Costello <paulc1217@gmail.com> wrote:
I have a database that wouldn't start due to the disk filling up back on 1/10, unbeknownst to us until 2/27.  This is jira, so it's critical data.  It appears jira was running in memory that entire time.


Those first two sentences seem contradictory...
 

I needed to run pg_resetxlog -f in order to start the database.  It started, but upon logging in I found the system catalog and some data to be corrupt. 


Once you did this, fixing the data is really on you. Postgres has no way to know what any of the data mean, nor how to decide what to keep and what to toss on those conflicting rows with duplicate keys.

What I'd personally do is take your 1/5 backup, then merge in rows for tickets and affiliated data from whatever you can recover in the current database copy you have. Once that's done, run jira's built-in integrity checker then do a full export to XML backup format. Finally re-import that into a fresh jira so you know what's in there is consistent.  You'll probably also have to cross-reference the attachments directory for missing tickets and clean up those files (or synthesize tickets for them).

If your jira is configured to send email somewhere on ticket updates, gathering those (even if it is in multiple people's mailboxes) and recreating ticket info from them would also move you along.

You will lose some of your data because not all of it was written to disk.


Re: Jira database won't start after disk filled up

От
Vick Khera
Дата:
On Fri, Mar 2, 2018 at 5:13 PM, Paul Costello <paulc1217@gmail.com> wrote:
My hope is that I can get the db back to 1/10 and maybe we can, with Atlassian's help, somehow sync the lucene files back to the db.  I don't think I will have any postgres data to work with beyond 1/10.

Does this still sound do-able with that kind of data gap?


I'm not sure how the incremental updates to the lucene indexes work with Jira. If they are parallel to writing to the DB maybe you can recover some info there; if they are trickled out asynchronously after writing to the DB by an index process that reads back the DB, then I'd expect there to be no additional info there.

Perhaps the best you can do is get Jira to run its integrity checker on the current data and fix whatever it tells you to fix. I think Atlassian will know best.