Обсуждение: WAL partition filling up after high WAL activity
Hi Everyone,
I recently saw a crash on one of our databases, and I was wondering if this might be an indication of something with WAL that might be unexpectedly creating more files than it needs to?
Nov 5 16:18:27 localhost postgres[25092]: [111-1] 2011-11-05 16:18:27.524 PDT [user=slony,db=uk dbhub.com(35180) PID:25092 XID:2142895751]PANIC: could not write to file "pg_xlog/xlogtemp.25092": No space left on device
Nov 5 16:18:27 localhost postgres[25092]: [111-2] 2011-11-05 16:18:27.524 PDT [user=slony,db=uk dbhub.com(35180) PID:25092 XID:2142895751]STATEMENT: select "_sac_uk".forwardConfirm(2, 4, '5003717188', '2011-11-05 16:18:26.977112');
Nov 5 16:18:27 localhost postgres[32121]: [7-1] 2011-11-05 16:18:27.531 PDT [user=,db= PID:32121 XID:0]LOG: server process (PID 25092) was terminated by signal 6: Aborted
Nov 5 16:18:27 localhost postgres[32121]: [8-1] 2011-11-05 16:18:27.531 PDT [user=,db= PID:32121 XID:0]LOG: terminating any other active server processes
Nov 5 16:18:27 localhost postgres[25092]: [111-2] 2011-11-05 16:18:27.524 PDT [user=slony,db=uk dbhub.com(35180) PID:25092 XID:2142895751]STATEMENT: select "_sac_uk".forwardConfirm(2, 4, '5003717188', '2011-11-05 16:18:26.977112');
Nov 5 16:18:27 localhost postgres[32121]: [7-1] 2011-11-05 16:18:27.531 PDT [user=,db= PID:32121 XID:0]LOG: server process (PID 25092) was terminated by signal 6: Aborted
Nov 5 16:18:27 localhost postgres[32121]: [8-1] 2011-11-05 16:18:27.531 PDT [user=,db= PID:32121 XID:0]LOG: terminating any other active server processes
If you look at this graph (http://cl.ly/2y0W27330t3o2J281H3K), the partition actually fills up, and the logs show that postgres crashed. After postgres crashed, it automatically restarted, cleared out its WAL files, and began processing things again at 4:30PM.
From the graph, it looks like a vacuum on table m_dg_read finished at 4:08PM, which might explain why the downward slope levels off for a few minutes:
Nov 5 16:08:03 localhost postgres[18741]: [20-1] 2011-11-05 16:08:03.400 PDT [user=,db= PID:18741 XID:0]LOG: automatic vacuum of table "uk.public.m_dg_read": index scans: 1
Nov 5 16:08:03 localhost postgres[18741]: [20-2] pages: 0 removed, 65356 remain
Nov 5 16:08:03 localhost postgres[18741]: [20-3] tuples: 31770 removed, 1394263 remain
Nov 5 16:08:03 localhost postgres[18741]: [20-4] system usage: CPU 2.08s/5.35u sec elapsed 619.39 sec
Looks like right afterwards, it got started on table m_object, which finished at 4:18PM:
Nov 5 16:18:19 localhost postgres[18686]: [9-1] 2011-11-05 16:18:19.448 PDT [user=,db= PID:18686 XID:0]LOG: automatic vacuum of table "uk.public.m_object": index scans: 1
Nov 5 16:18:19 localhost postgres[18686]: [9-2] pages: 0 removed, 152862 remain
Nov 5 16:18:19 localhost postgres[18686]: [9-3] tuples: 17084 removed, 12455761 remain
Nov 5 16:18:19 localhost postgres[18686]: [9-4] system usage: CPU 4.55s/15.09u sec elapsed 1319.98 sec
It could very well be the case that upon the finish of m_object's vacuum, another vacuum was beginning, and it eventually just crashed because there was no room for another vacuum to finish.
We encountered a situation like this last summer on 7/4/2010 for a different database cluster -- a big vacuum-for-wraparound on at 15GB table filled the pg_xlog partition--this is how we started monitoring the pg_xlog file size and wraparound countdown. Seems like there was some sort of vacuum-for-wraparound process happening during the time of this crash, as we also track the XID to see when we should expect a VACUUM FREEZE (http://cl.ly/3s1S373I0l0v3E171Z0V).
Some configs:
checkpoint_segments=16
wal_buffers=8MB
#archive_mode=off
checkpoint_completion_target=0.9
Postgres version is 8.4.5
Note also that the pg_xlog partition is 9.7GB. No other apps run on the machine besides pgbouncer, so it's highly unlikely that files are written to this partition by another process. Also, our five largest tables are the following:
gm3_load_times | 2231 MB
m_object_paper | 1692 MB
m_object | 1192 MB
m_report_stats | 911 MB
gm3_mark | 891 MB
My biggest question is: we know from the docs that there should be no more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 files. For us, that would mean no more than 48 files, which equates to 384MB--far lower than the 9.7GB partition size. **Why would WAL use up so much disk space?**
Thanks for reading, and thanks in advance for any help you may provide.
--Richard
On 11/07/2011 05:18 PM, Richard Yen wrote:
That's only true if things are operating normally. There are at least two ways this can fail to be a proper upper limit on space used:
1) You are archiving to a second system, and the archiving isn't keeping up. Things that haven't been archived can't be re-used, so more disk space is used.
2) Disk I/O is slow, and the checkpoint writes take a significant period of time. The internal scheduling assumes each individual write will happen without too much delay. That assumption can easily be untrue on a busy system. The worst I've seen now are checkpoints that take 6 hours to sync, where the time is supposed to be a few seconds. Disk space in that case was a giant multiple of checkpoint_segments. (The source of that problem is very much improved in PostgreSQL 9.1)
The info needed to figure out which category you're in would appear after tuning log_checkpoints on in the postgresql.conf ; you only need to reload the server config after that, doesn't require a restart. I would guess you have realy long sync times there.
As for what to do about it, checkpoint_segments=16 is a low setting. You might as well set it to a large number, say 128, and let checkpoints get driven by time instead. The existing limit isn't working effectively anyway, and having more segments lets the checkpoint spreading code work more evenly.
My biggest question is: we know from the docs that there should be no more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 files. For us, that would mean no more than 48 files, which equates to 384MB--far lower than the 9.7GB partition size. **Why would WAL use up so much disk space?**
That's only true if things are operating normally. There are at least two ways this can fail to be a proper upper limit on space used:
1) You are archiving to a second system, and the archiving isn't keeping up. Things that haven't been archived can't be re-used, so more disk space is used.
2) Disk I/O is slow, and the checkpoint writes take a significant period of time. The internal scheduling assumes each individual write will happen without too much delay. That assumption can easily be untrue on a busy system. The worst I've seen now are checkpoints that take 6 hours to sync, where the time is supposed to be a few seconds. Disk space in that case was a giant multiple of checkpoint_segments. (The source of that problem is very much improved in PostgreSQL 9.1)
The info needed to figure out which category you're in would appear after tuning log_checkpoints on in the postgresql.conf ; you only need to reload the server config after that, doesn't require a restart. I would guess you have realy long sync times there.
As for what to do about it, checkpoint_segments=16 is a low setting. You might as well set it to a large number, say 128, and let checkpoints get driven by time instead. The existing limit isn't working effectively anyway, and having more segments lets the checkpoint spreading code work more evenly.
-- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/09/2011 05:06 PM, Greg Smith wrote: > On 11/07/2011 05:18 PM, Richard Yen wrote: >> My biggest question is: we know from the docs that there should be no >> more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 >> files. For us, that would mean no more than 48 files, which equates >> to 384MB--far lower than the 9.7GB partition size. **Why would WAL >> use up so much disk space?** >> > > That's only true if things are operating normally. There are at least > two ways this can fail to be a proper upper limit on space used: > > 1) You are archiving to a second system, and the archiving isn't keeping > up. Things that haven't been archived can't be re-used, so more disk > space is used. > > 2) Disk I/O is slow, and the checkpoint writes take a significant period > of time. The internal scheduling assumes each individual write will > happen without too much delay. That assumption can easily be untrue on > a busy system. The worst I've seen now are checkpoints that take 6 > hours to sync, where the time is supposed to be a few seconds. Disk > space in that case was a giant multiple of checkpoint_segments. (The > source of that problem is very much improved in PostgreSQL 9.1) > Hello We have a similar case in june but we did not find the cause of our problem. More details and information: http://archives.postgresql.org/pgsql-docs/2011-06/msg00007.php Your explanation in 2) sounds like a good candidate for the problem we had. As I said in june, I think we need to improve the documentation in this area. A note in the documentation about what you have explained in 2) with maybe some hints about how to find out if this is happening will be a great improvement. We did not understand why we experienced this problem in june when creating a GIN index on a tsvector column. But we found out that a lot of the tsvector data was generated from "garbage" data (base64 encoding of huge attachments). When we generated the right tsvector data, the creation of the GIN index ran smoothly and the problem with extra WAL files disappeared. PS.- In our case, the disk space used by all the extra WAL files was almost the equivalent to the 17GB of our GIN index. regards, - -- Rafael Martinez Guerrero Center for Information Technology University of Oslo, Norway PGP Public Key: http://folk.uio.no/rafael/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk688LoACgkQBhuKQurGihTbvQCfaSBdYNF2oOtErcx/e4u0Zw1J pLIAn2Ztdbuz33es2uw8ddSIjj8UXe3s =olkD -----END PGP SIGNATURE-----
On 11/11/2011 04:54 AM, Rafael Martinez wrote: > Your explanation in 2) sounds like a good candidate for the problem we > had. As I said in june, I think we need to improve the documentation in > this area. A note in the documentation about what you have explained in > 2) with maybe some hints about how to find out if this is happening will > be a great improvement. > A new counter was added to pg_stat_bgwriter in PostgreSQL 9.1 that tracks when the problem I described happens. It's hard to identify it specifically without a source code change of some sort. Initially I added new logging to the server code to identify the issue before the new counter was there. The only thing you can easily look at that tends to correlate well with the worst problems here is the output from turning log_checkpoint on. Specifically, the "sync" times going way up is a sign there's a problem with write speed. As for the documentation, not much has really changed from when you brought this up on the docs list. The amount of WAL files that can be created by a "short-term peak" is unlimited, which is why there's no better limit listed than that. Some of the underlying things that make the problem worse are operating system level issues, not ones in the database itself; the PostgreSQL documentation doesn't try to wander too far into that level. There are also a large number of things you can do at the application level that will generate a lot of WAL activity. It would be impractical to list all of them in the checkpoint documentation though. On reviewing this section of the docs again, one thing that we could do is make the "WAL Configuration" section talk more about log_checkpoints and interpreting its output. Right now there's no mention of that parameter in the section that talks about parameters to configure; there really should be. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us