Обсуждение: could not truncate directory "pg_serial": apparent wraparound

Поиск
Список
Период
Сортировка

could not truncate directory "pg_serial": apparent wraparound

От
"Kevin Grittner"
Дата:
We had a report of the subject message during testing a while back
and attempted to address the issue.  It can result in a LOG level
message and the accumulation of files in the pg_serial SLRU
subdirectory.  We haven't seen a recurrence, until I hit it during
testing of the just-posted patch for SSI DDL.  I re-read the code
and believe that the attached is the correct fix.

-Kevin


Вложения

Re: could not truncate directory "pg_serial": apparent wraparound

От
Heikki Linnakangas
Дата:
On 08.06.2011 03:28, Kevin Grittner wrote:
> We had a report of the subject message during testing a while back
> and attempted to address the issue.  It can result in a LOG level
> message and the accumulation of files in the pg_serial SLRU
> subdirectory.  We haven't seen a recurrence, until I hit it during
> testing of the just-posted patch for SSI DDL.  I re-read the code
> and believe that the attached is the correct fix.

Doesn't this patch bring back the issue mentioned in the comment: the 
slru page might not get used again until we wrap-around?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: could not truncate directory "pg_serial": apparent wraparound

От
"Kevin Grittner"
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> On 08.06.2011 03:28, Kevin Grittner wrote:
>> We had a report of the subject message during testing a while
>> back and attempted to address the issue.  It can result in a LOG
>< level message and the accumulation of files in the pg_serial SLRU
>> subdirectory.  We haven't seen a recurrence, until I hit it
>> during testing of the just-posted patch for SSI DDL.  I re-read
>> the code and believe that the attached is the correct fix.
> 
> Doesn't this patch bring back the issue mentioned in the comment:
> the slru page might not get used again until we wrap-around?
In the previous attempt I thought it was looking at the allocated
files to assess that it was approaching wraparound.  In looking at
the SLRU code last night, it seemed that it was really using the
"last zeroed" page as the comparison point, and wanted the
truncation point to precede that.  Given that last night didn't seem
to be my sharpest hour, it would be wise to confirm this.  This code
is pretty hard to test, since it only comes into play on overflow of
the predicate locking shared memory structures, so desk-checking is
important.
So, to directly address your question, if we don't overflow again
and go to the SLRU summary data, one file could sit in the pg_serial
directory indefinitely; but we should avoid the LOG message and the
accumulation of large numbers of such files -- if I'm reading the
SLRU code correctly....
-Kevin


Re: could not truncate directory "pg_serial": apparent wraparound

От
Heikki Linnakangas
Дата:
While testing this, I noticed another serious bug in the OldSerXidSLRU 
handling: we never set the dirty-flag on any page. I believe the reason 
we haven't bumped into this in testing before is that when a new page is 
initialized, it's marked as dirty, so everything goes smoothly when we 
modify recently-zeroed pages. But if a page falls out of the cache, and 
is later read back in and modified, the modifications are lost.

The comments in SLRU could be more explicit about this. It was 
coincidental that I started to wonder where the pages are marked as 
dirty, I somehow thought the SLRU functions do that for you.

Fortunately the fix is very simple, we just need to set the page_dirty 
flag whenever we modify an slru page. But clearly this slru stuff needs 
more testing. It's pretty hard to write good repeatable test cases for 
these things, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: could not truncate directory "pg_serial": apparent wraparound

От
Alvaro Herrera
Дата:
Excerpts from Heikki Linnakangas's message of jue jun 09 04:56:41 -0400 2011:

> Fortunately the fix is very simple, we just need to set the page_dirty 
> flag whenever we modify an slru page. But clearly this slru stuff needs 
> more testing. It's pretty hard to write good repeatable test cases for 
> these things, though.

Maybe reduce the number of SLRU buffers used?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: could not truncate directory "pg_serial": apparent wraparound

От
"Kevin Grittner"
Дата:
Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Excerpts from Heikki Linnakangas's message of jue jun 09 04:56:41
-0400 2011:
> 
>> Fortunately the fix is very simple, we just need to set the
>> page_dirty flag whenever we modify an slru page. But clearly this
>> slru stuff needs more testing. It's pretty hard to write good
>> repeatable test cases for these things, though.
> 
> Maybe reduce the number of SLRU buffers used?
That's a thought.  I'll see about getting a build with
TEST_OLDSERXID defined and just one or two SLRU buffers for this,
and see if anything pops out of that.
.... after I finish reviewing Dan's patch from last night.
Thanks,
-Kevin


Re: could not truncate directory "pg_serial": apparent wraparound

От
Heikki Linnakangas
Дата:
On 08.06.2011 22:40, Kevin Grittner wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  wrote:
>> On 08.06.2011 03:28, Kevin Grittner wrote:
>>> We had a report of the subject message during testing a while
>>> back and attempted to address the issue.  It can result in a LOG
>> <  level message and the accumulation of files in the pg_serial SLRU
>>> subdirectory.  We haven't seen a recurrence, until I hit it
>>> during testing of the just-posted patch for SSI DDL.  I re-read
>>> the code and believe that the attached is the correct fix.
>>
>> Doesn't this patch bring back the issue mentioned in the comment:
>> the slru page might not get used again until we wrap-around?
>
> In the previous attempt I thought it was looking at the allocated
> files to assess that it was approaching wraparound.  In looking at
> the SLRU code last night, it seemed that it was really using the
> "last zeroed" page as the comparison point, and wanted the
> truncation point to precede that.

I've committed your patch for now. It does in fact bring back the 
original problem the clever but broken code was trying to address: if a 
pg_serial is not needed for a very long time, the last segment that we 
leave behind will eventually appear to be new again, and won't be 
cleaned up until a lot more XIDs pass.

> So, to directly address your question, if we don't overflow again
> and go to the SLRU summary data, one file could sit in the pg_serial
> directory indefinitely; but we should avoid the LOG message and the
> accumulation of large numbers of such files -- if I'm reading the
> SLRU code correctly....

Yeah, as far as I can see it's harmless except for the small waste of 
disk space. We probably want to revisit this later, maybe still for 9.1 
or maybe not, but for now I just put in a comment about it.

I also fixed the broken warning logic. Please double-check that too when 
you get a chance.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: could not truncate directory "pg_serial": apparent wraparound

От
"Kevin Grittner"
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> I've committed your patch for now.
Thanks!
I don't see it yet on the public git repo, nor on the -commiters
list.  I'll keep an eye out for it.
> as far as I can see it's harmless except for the small waste of 
> disk space. We probably want to revisit this later, maybe still
> for 9.1 or maybe not, but for now I just put in a comment about
> it.
I'm inclined to think it's not worth the trouble to revisit it.  At
most it would be one segment.  In fact, in line with your desire to
flush the SLRU pages so they're useful for debugging, we could call
this a feature -- it provides a quick way to check the date and time
of the last need for SLRU summarization.  I can see that being a
useful piece of information for forensics.
-Kevin


Re: could not truncate directory "pg_serial": apparent wraparound

От
Heikki Linnakangas
Дата:
On 09.06.2011 21:15, Kevin Grittner wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  wrote:
>
>> I've committed your patch for now.
>
> Thanks!
>
> I don't see it yet on the public git repo, nor on the -commiters
> list.  I'll keep an eye out for it.

Oh, rats! Forgot to push.. Will do so now.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: could not truncate directory "pg_serial": apparent wraparound

От
"Kevin Grittner"
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
> I also fixed the broken warning logic. Please double-check that
> too when you get a chance.
As usual, I like your code better than mine.  I don't think my code
was broken in the sense that it would generate different results
than yours, but it was too clever by half, and depended on knowledge
that a TransactionId is 32 bits.  Yours is more tolerant of changes
to implementation details, and much easier to read and understand.
While I'm *pretty sure* my code for this worked, I am *positive*
that yours does.  :-)
Thanks again.
-Kevin