Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Дата
Msg-id CAHGQGwF0uBtRoWOObmYjH_Mpi5CJardA8TM89XhCyXYc=1-ewQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Список pgsql-hackers
On Thu, Feb 16, 2012 at 1:01 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 13.02.2012 19:13, Fujii Masao wrote:
>>
>> On Mon, Feb 13, 2012 at 8:37 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> On 13.02.2012 01:04, Jeff Janes wrote:
>>>>
>>>>
>>>> Attached is my quick and dirty attempt to set XLP_FIRST_IS_CONTRECORD.
>>>>  I have no idea if I did it correctly, in particular if calling
>>>> GetXLogBuffer(CurrPos) twice is OK or if GetXLogBuffer has side
>>>> effects that make that a bad thing to do.  I'm not proposing it as the
>>>> real fix, I just wanted to get around this problem in order to do more
>>>> testing.
>>>
>>>
>>>
>>> Thanks. That's basically the right approach. Attached patch contains a
>>> cleaned up version of that.
>>>
>>>
>>>> It does get rid of the "there is no contrecord flag" errors, but
>>>> recover still does not work.
>>>>
>>>> Now the count of tuples in the table is always correct (I never
>>>> provoke a crash during the initial table load), but sometimes updates
>>>> to those tuples that were reported to have been committed are lost.
>>>>
>>>> This is more subtle, it does not happen on every crash.
>>>>
>>>> It seems that when recovery ends on "record with zero length at...",
>>>> that recovery is correct.
>>>>
>>>> But when it ends on "invalid magic number 0000 in log file.." then the
>>>> recovery is screwed up.
>>>
>>>
>>>
>>> Can you write a self-contained test case for that? I've been trying to
>>> reproduce that by running the regression tests and pgbench with a
>>> streaming
>>> replication standby, which should be pretty much the same as crash
>>> recovery.
>>> No luck this far.
>>
>>
>> Probably I could reproduce the same problem as Jeff got. Here is the test
>> case:
>>
>> $ initdb -D data
>> $ pg_ctl -D data start
>> $ psql -c "create table t (i int); insert into t
>> values(generate_series(1,10000)); delete from t"
>> $ pg_ctl -D data stop -m i
>> $ pg_ctl -D data start
>>
>> The crash recovery emitted the following server logs:
>>
>> LOG:  database system was interrupted; last known up at 2012-02-14
>> 02:07:01 JST
>> LOG:  database system was not properly shut down; automatic recovery in
>> progress
>> LOG:  redo starts at 0/179CC90
>> LOG:  invalid magic number 0000 in log file 0, segment 1, offset 8060928
>> LOG:  redo done at 0/17AD858
>> LOG:  database system is ready to accept connections
>> LOG:  autovacuum launcher started
>>
>> After recovery, I could not see the table "t" which I created before:
>>
>> $ psql -c "select count(*) from t"
>> ERROR:  relation "t" does not exist
>
>
> Are you still seeing this failure with the latest patch I posted
> (http://archives.postgresql.org/message-id/4F38F5E5.8050203@enterprisedb.com)?

Yes. Just to be safe, I again applied the latest patch to HEAD,
compiled that and tried
the same test. Then unfortunately I got the same failure again.

I ran the configure with '--enable-debug' '--enable-cassert'
'CPPFLAGS=-DWAL_DEBUG',
and make with -j 2 option.

When I ran the test with wal_debug = on, I got the following assertion failure.

LOG:  INSERT @ 0/17B3F90: prev 0/17B3F10; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/197
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
LOG:  INSERT @ 0/17B3FD0: prev 0/17B3F50; xid 998; len 31: Heap -
insert: rel 1663/12277/16384; tid 0/198
STATEMENT:  create table t (i int); insert into t
values(generate_series(1,10000)); delete from t
TRAP: FailedAssertion("!(((bool) (((void*)(&(target->tid)) != ((void
*)0)) && ((&(target->tid))->ip_posid != 0))))", File: "heapam.c",
Line: 5578)
LOG:  xlog bg flush request 0/17B4000; write 0/17A6000; flush 0/179D5C0
LOG:  xlog bg flush request 0/17B4000; write 0/17B0000; flush 0/17B0000
LOG:  server process (PID 16806) was terminated by signal 6: Abort trap

This might be related to the original problem which Jeff and I saw.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Marti Raudsepp
Дата:
Сообщение: Re: CUDA Sorting
Следующее
От: Robert Haas
Дата:
Сообщение: Re: bitfield and gcc