Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

Поиск
Список
Период
Сортировка
От Jeff Frost
Тема Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
Дата
Msg-id D2DE538F-32D2-4F70-9260-1FDAAE0F219D@pgexperts.com
обсуждение исходный текст
Ответ на Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues  (Jeff Frost <jeff@pgexperts.com>)
Ответы Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-bugs
On Feb 24, 2013, at 1:05 PM, Jeff Frost <jeff@pgexperts.com> wrote:

>=20
> On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero =
<r.m.guerrero@usit.uio.no> wrote:
>=20
>> We reported this back in 2011, but we did not get to any conclusion:
>> http://www.postgresql.org/message-id/4DE89072.7070305@usit.uio.no
>>=20
>> In our case, we had this problem when creating a GIN index.
>>=20
>> I think the problem has something to do with checkpoints, I think the
>> number of WAL files will grow beyond the limit defined because the
>> system can not be finish with checkpoints. A good start to try to
>> explain what it is happening would be to try to identify or describe =
the
>> situations where checkpoints can take very long to complete or fail
>> altogether.
>>=20
>> We are interested in this subject because we have had this problem a =
few
>> times. But we have not been able to identify the reason that triggers =
this.
>=20
> Rafael, did you do a followup post on pgsql-performance, or did the =
thread die at the end of the one you post above?

This is how the log_checkpoint output looks during the run:

2013-02-24 21:06:31.156 UTC,,,1624,,51282598.658,114,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: immediate force =
wait",,,,,,,,,""
2013-02-24 21:06:31.216 UTC,,,1624,,51282598.658,115,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 108 buffers =
(0.0%); 0 transaction log file(s) added, 0 removed, 4 recycled; =
write=3D0.054 s, sync=3D0.002 s, total=3D0.059 s; sync files=3D10, =
longest=3D0.000 s, average=3D0.000 s",,,,,,,,,""

So, this ^^^^^ is the manual checkpoint prior to the pgbench run and =
everything else is part of the pgbench run.

2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,116,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoints are occurring too frequently (14 =
seconds apart)",,"Consider increasing the configuration parameter =
""checkpoint_segments"".",,,,,,,""
2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,117,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:08:35.239 UTC,,,1624,,51282598.658,118,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 266935 buffers =
(12.7%); 0 transaction log file(s) added, 0 removed, 1 recycled; =
write=3D10.889 s, sync=3D99.079 s, total=3D110.011 s; sync files=3D29, =
longest=3D12.888 s, average=3D3.416 s",,,,,,,,,""
2013-02-24 21:08:35.241 UTC,,,1624,,51282598.658,119,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,120,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 801517 buffers =
(38.2%); 0 transaction log file(s) added, 0 removed, 128 recycled; =
write=3D15.484 s, sync=3D148.739 s, total=3D164.266 s; sync files=3D32, =
longest=3D11.773 s, average=3D4.648 s",,,,,,,,,""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,121,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:14:20.486 UTC,,,1624,,51282598.658,122,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1009466 buffers =
(48.1%); 0 transaction log file(s) added, 140 removed, 257 recycled; =
write=3D13.690 s, sync=3D167.245 s, total=3D180.980 s; sync files=3D33, =
longest=3D14.442 s, average=3D5.067 s",,,,,,,,,""
2013-02-24 21:14:20.487 UTC,,,1624,,51282598.658,123,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,124,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1055059 buffers =
(50.3%); 0 transaction log file(s) added, 259 removed, 257 recycled; =
write=3D22.731 s, sync=3D168.239 s, total=3D191.020 s; sync files=3D32, =
longest=3D13.650 s, average=3D5.257 s",,,,,,,,,""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,125,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:23:36.201 UTC,,,1624,,51282598.658,126,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 814767 buffers =
(38.9%); 0 transaction log file(s) added, 546 removed, 0 recycled; =
write=3D269.553 s, sync=3D95.042 s, total=3D364.693 s; sync files=3D31, =
longest=3D11.211 s, average=3D3.065 s",,,,,,,,,""

They're all number of xlog based and the first one doesn't remove any, =
that's not surprising, but I'm surprised that the second one doesn't =
remove any.  It does recycle quite a few.  I guess the issue is that it =
can generate WAL files way faster than it can checkpoint and cleanup.  I =
think I need to write a script to output the number of WAL files each =
minute so we can better correlate what's going on.=

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Jeff Frost
Дата:
Сообщение: Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues
Следующее
От: Rafael Martinez Guerrero
Дата:
Сообщение: Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues