Обсуждение: Decreasing the data loss after failover

Поиск
Список
Период
Сортировка

Decreasing the data loss after failover

От
sinasaharkhiz
Дата:
Hi,
I use barman to backup the database of a production server of the company I
work in. Last month a hard disk failed and I had to recover the backup from
my barman server. But we've lost near 15 minutes of transactions, and since
this server is responsible for the bank payments 15 minutes of lost
transactions is too much.
I want to know what was the problem here and how can I fix it to prevent
this data loss next time? I think the wal files that postgresql were writing
on took more than 15 minutes at the time to finish.




--
View this message in context: http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


Re: Decreasing the data loss after failover

От
Jorge Torralba
Дата:
I don't use barman and am not familiar with it. But at a very high level assuming you are using PITR strategy, the below may help.

If your backups involve a base backup and you are archiving your WAL files with archive_command sometime the simplest thing can get over looked.

For example, prior to restoring your base backup and applying WAL files, did you save the original pg_xlog content? You need to restore that content to the pg_xlog directory after the base restore. Then restore with the command to read your archived WAL files which when complete will apply what is in the pg_xlog directory and bring you to your latest state.

JT

On Fri, Jun 5, 2015 at 9:47 AM, sinasaharkhiz <sinas1991@gmail.com> wrote:
Hi,
I use barman to backup the database of a production server of the company I
work in. Last month a hard disk failed and I had to recover the backup from
my barman server. But we've lost near 15 minutes of transactions, and since
this server is responsible for the bank payments 15 minutes of lost
transactions is too much.
I want to know what was the problem here and how can I fix it to prevent
this data loss next time? I think the wal files that postgresql were writing
on took more than 15 minutes at the time to finish.




--
View this message in context: http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin



--
Thanks,

Jorge Torralba
----------------------------

Note: This communication may contain privileged or other confidential information. If you are not the intended recipient, please do not print, copy, retransmit, disseminate or otherwise use the information. Please indicate to the sender that you have received this email in error and delete the copy you received. Thank You.

Re: Decreasing the data loss after failover

От
sinasaharkhiz
Дата:
Thanks for your answer.
This will not work for my situation. Because the hard disk of my server is
failed and the only data I got are the wal files transferred to my backup
server and I can only perform my recovery process with base backup and wal
files stored by barman.
PS: I'm trying to use a specific archive_timeout (less than 15 minutes) to
see how that will affect the backup size.

Sina



--
View this message in context:
http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659p5852767.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


Re: Decreasing the data loss after failover

От
Keith Fiske
Дата:

On Fri, Jun 5, 2015 at 8:47 PM, sinasaharkhiz <sinas1991@gmail.com> wrote:
Thanks for your answer.
This will not work for my situation. Because the hard disk of my server is
failed and the only data I got are the wal files transferred to my backup
server and I can only perform my recovery process with base backup and wal
files stored by barman.
PS: I'm trying to use a specific archive_timeout (less than 15 minutes) to
see how that will affect the backup size.

Sina



--
View this message in context: http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659p5852767.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

I'm not sure how you set Barman up, but it should have been backing up your WAL files as well as performing base backups of your data. Part of the Barman setup is configuring your archive_command. Wherever this points to is where your missing WAL files should have gone. If you don't have those WAL files, or know where Barman was saving them, then you may be out of luck on that missing data. If the built in restore commands did not bring things back up to the way you expected them, I'd take a closer look and make sure Barman was actually doing proper backups.

What version of PostgreSQL are you running? If you're on 9.0+, you should be using streaming replication instead of WAL shipping. Otherwise your slave will only ever be as caught up as fast as you can ship WAL files over. This was a severe limitation in older versions of Postgres that has been overcome for nearly 5 years now.

Also, if you're not running 9.0+, you're out of support and no longer receiving security and bug fix updates, which should be of great concern if you're handling financial data.

I would also encourage you to read up on how to manually perform backup & recovery with the tools that PostgreSQL comes with. Be sure you understand everything in this section of the PostgreSQL documentation (note this is for 9.4): http://www.postgresql.org/docs/9.4/interactive/backup.html. Pay particular attention to basebackups & wal archiving. Barman is a nice tool, but it really hides a lot of important steps that you need to know to understand how to perform recovery in a disaster situation. If you don't understand what it's doing under the hood, you'll very likely be in this predicament again because if its built in recovery script doesn't work, then you're left not knowing what to do.

--
Keith Fiske
Database Administrator
OmniTI Computer Consulting, Inc.
http://www.keithf4.com

Re: Decreasing the data loss after failover

От
Keith Fiske
Дата:


On Sat, Jun 6, 2015 at 12:09 AM, Keith Fiske <keith@omniti.com> wrote:

On Fri, Jun 5, 2015 at 8:47 PM, sinasaharkhiz <sinas1991@gmail.com> wrote:
Thanks for your answer.
This will not work for my situation. Because the hard disk of my server is
failed and the only data I got are the wal files transferred to my backup
server and I can only perform my recovery process with base backup and wal
files stored by barman.
PS: I'm trying to use a specific archive_timeout (less than 15 minutes) to
see how that will affect the backup size.

Sina



--
View this message in context: http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659p5852767.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

I'm not sure how you set Barman up, but it should have been backing up your WAL files as well as performing base backups of your data. Part of the Barman setup is configuring your archive_command. Wherever this points to is where your missing WAL files should have gone. If you don't have those WAL files, or know where Barman was saving them, then you may be out of luck on that missing data. If the built in restore commands did not bring things back up to the way you expected them, I'd take a closer look and make sure Barman was actually doing proper backups.

What version of PostgreSQL are you running? If you're on 9.0+, you should be using streaming replication instead of WAL shipping. Otherwise your slave will only ever be as caught up as fast as you can ship WAL files over. This was a severe limitation in older versions of Postgres that has been overcome for nearly 5 years now.

Also, if you're not running 9.0+, you're out of support and no longer receiving security and bug fix updates, which should be of great concern if you're handling financial data.

I would also encourage you to read up on how to manually perform backup & recovery with the tools that PostgreSQL comes with. Be sure you understand everything in this section of the PostgreSQL documentation (note this is for 9.4): http://www.postgresql.org/docs/9.4/interactive/backup.html. Pay particular attention to basebackups & wal archiving. Barman is a nice tool, but it really hides a lot of important steps that you need to know to understand how to perform recovery in a disaster situation. If you don't understand what it's doing under the hood, you'll very likely be in this predicament again because if its built in recovery script doesn't work, then you're left not knowing what to do.

--
Keith Fiske
Database Administrator
OmniTI Computer Consulting, Inc.
http://www.keithf4.com


Sorry, nevermind on the streaming replication stuff. For some reason when I was writing the response, I had in my head that you did a failover to a slave, not a backup recovery.

Another thing you may want to look at is the archive_timeout value. If your database didn't do enough writes to fill a WAL file (16MB), it would not have written out a new WAL file unless you have the archive_timeout set. That ensures a WAL file is always written at least that often.

Re: Decreasing the data loss after failover

От
Scott Ribe
Дата:
On Jun 5, 2015, at 10:09 PM, Keith Fiske <keith@omniti.com> wrote:
>
> If you don't understand what it's doing under the hood, you'll very likely be in this predicament again because if
itsbuilt in recovery script doesn't work, then you're left not knowing what to do. 

Also, for the future, make sure that a single disk failure never puts you into the situation of needing to recover in
thefirst place. 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice







Re: Decreasing the data loss after failover

От
sinasaharkhiz
Дата:
Yeah I just set the archive_timeout to 5 minutes. But I thought that would
make postgresql to generate more wal files than before. Because the wal
files would be switched before getting completely full. So that would make
more 16MB wal files with less than 16MB data, and that will increase the
size of my backup.
But after I changed archive_timeout and restarted the barman backup, it has
not increased. So I wonder if I'm thinking wrong about it or I have done
something wrong. Do you have any idea about this?



--
View this message in context:
http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659p5852817.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


Re: Decreasing the data loss after failover

От
Keith
Дата:


On Sat, Jun 6, 2015 at 11:43 PM, sinasaharkhiz <sinas1991@gmail.com> wrote:
Yeah I just set the archive_timeout to 5 minutes. But I thought that would
make postgresql to generate more wal files than before. Because the wal
files would be switched before getting completely full. So that would make
more 16MB wal files with less than 16MB data, and that will increase the
size of my backup.
But after I changed archive_timeout and restarted the barman backup, it has
not increased. So I wonder if I'm thinking wrong about it or I have done
something wrong. Do you have any idea about this?



--
View this message in context: http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659p5852817.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Check the archive_command destination folder and ensure a new files is being created there every 5 minutes. If not, check your postgres logs (or barman logs, not sure if they log that as well) to ensure your archive command is running properly. Looking at the docs, I don't see that barman itself supports having the archive_command compress the WAL files. But it's got instructions for copying them to a secondary location that is compressed.

I would highly recommend getting a streaming slave system setup. This would allow you to have a system that is only seconds (or less) behind the master system at all times in case of issues like this. Which sounds like is what your end goal is, and what a backup will not be able to provide you.

Re: Decreasing the data loss after failover

От
Craig James
Дата:
On Sat, Jun 6, 2015 at 8:43 PM, sinasaharkhiz <sinas1991@gmail.com> wrote:
Yeah I just set the archive_timeout to 5 minutes. But I thought that would
make postgresql to generate more wal files than before. Because the wal
files would be switched before getting completely full. So that would make
more 16MB wal files with less than 16MB data, and that will increase the
size of my backup.
But after I changed archive_timeout and restarted the barman backup, it has
not increased. So I wonder if I'm thinking wrong about it or I have done
something wrong. Do you have any idea about this?

If the files are compressed, you will have more files, but the total size will be almost the same. Gzip compression reduces large sections of identical bytes to almost nothing.

For example, I just compressed a 750 MB file of zeros to 750 KB -- a 1000:1 compression ratio.

Craig




--
View this message in context: http://postgresql.nabble.com/Decreasing-the-data-loss-after-failover-tp5852659p5852817.html
Sent from the PostgreSQL - admin mailing list archive at Nabble.com.


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin



--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------