Re: Write Ahead Logging for Hash Indexes

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: Write Ahead Logging for Hash Indexes
Дата
Msg-id CAMkU=1z=NzD5XC+q1+qanzTdJx5i7vZkji36rRPYMU=mGcgibQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Write Ahead Logging for Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Write Ahead Logging for Hash Indexes  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1].  I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422  00000 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> properly shut down; automatic recovery in progress
> 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification failed,
> calculated checksum 65067 but expected 21260
> 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
> relation base/16384/17334
> 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

!         if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
!  errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
!  errhint("JJ is screwing with the database.")));
!         } else {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code.  The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Correct.  But any torn page write must be covered by the restoration of a full page image during replay, shouldn't it?  And that restoration should happen blindly, without first reading in the old page and verifying the checksum.  Failure to restore the page from a FPI would be a bug.  (That was the purpose for which I wrote this testing harness in the first place, to verify that the restoration of FPI happens correctly; although most of the bugs it happens to uncover have been unrelated to that.)

 

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page.  I think this could happen though
rarely for any page with the above kind of tests.

I think Ashutosh's examples are of warnings, not errors.   I think the warnings occur when replay needs to read in the block (for reason's I don't understand yet) but then doesn't care if it passes the checksum or not because it will just be blown away by the replay anyway.  


Does this explanation explains the reason of problem you are seeing?

If it can't survive artificial torn page writes, then it probably can't survive reals ones either.  So I am pretty sure it is a bug of some sort.  Perhaps the bug is that it is generating an ERROR when should just be a WARNING?
 

>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>

Okay, but have you ensured in some way that redo is applied to bitmap page?


I haven't done that yet.  I can't start the system without destroying the evidence, and I haven't figured out yet how to import a specific block from a shut-down server into a bytea of a running server, in order to inspect it using pageinspect.

Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1].  That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards.  I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

I'm afraid that is over my head.  I can study it until it makes sense, but it will take me a while.

Cheers,

Jeff


[1] - https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Petr Jelinek
Дата:
Сообщение: Re: ICU integration
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Hash Indexes