Обсуждение: [sqlsmith] crashes in RestoreSnapshot on hot standby

Поиск
Список
Период
Сортировка

[sqlsmith] crashes in RestoreSnapshot on hot standby

От
Andreas Seltenreich
Дата:
Running sqlsmith on a streaming slave (master as of f8c5855) is
inconspicuous as long as the master is idle.  As soon as I start it on
the master as well, the standby frequently crashes in RestoreSnapshot.
It doesn't seem to be specific to the queries, as they don't trigger a
crash when re-run.

Backtraces always look like the ones below.

regards,
Andreas

**** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.9826@.core ON doombat ****
Core was generated by `postgres: smith regression [local] SELECT                         '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
167    ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
#1  0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access
memoryat address 0x7f2701d5a110>) at snapmgr.c:2020
 
#2  0x00000000004a934a in heap_beginscan_parallel (relation=0x2060a90,
parallel_scan=parallel_scan@entry=0x7f2701d5a0f8)at heapam.c:1657
 
#3  0x00000000005fbedf in ExecSeqScanInitializeDSM (node=0x1f5b470, pcxt=0x221af88) at nodeSeqscan.c:327
#4  0x00000000005dd0ad in ExecParallelInitializeDSM (planstate=planstate@entry=0x1f5b470, d=d@entry=0x7ffd4ba200d0) at
execParallel.c:245
#5  0x00000000005dd425 in ExecInitParallelPlan (planstate=0x1f5b470, estate=estate@entry=0x1f5ab28, nworkers=2) at
execParallel.c:477
#6  0x00000000005ef4a4 in ExecGather (node=node@entry=0x1f5b048) at nodeGather.c:159
#7  0x00000000005dda48 in ExecProcNode (node=node@entry=0x1f5b048) at execProcnode.c:515
#8  0x00000000005f4b30 in ExecLimit (node=node@entry=0x1f5acd0) at nodeLimit.c:91
#9  0x00000000005dd9d8 in ExecProcNode (node=node@entry=0x1f5acd0) at execProcnode.c:531
#10 0x00000000005fef7c in ExecSetParamPlan (node=<optimized out>, econtext=0x1f5c138) at nodeSubplan.c:999
#11 0x00000000005e28b5 in ExecEvalParamExec (exprstate=<optimized out>, econtext=<optimized out>, isNull=0x22045b0 "",
isDone=<optimizedout>) at execQual.c:1135
 
#12 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x2204200, econtext=0x1f5c138, isNull=0x2203d98 "",
isDone=<optimizedout>) at execQual.c:2015
 
#13 0x00000000005de29a in ExecEvalCoalesce (coalesceExpr=<optimized out>, econtext=0x1f5c138, isNull=0x2203d98 "",
isDone=<optimizedout>) at execQual.c:3446
 
#14 0x00000000005deb6d in ExecMakeFunctionResultNoSets (fcache=0x22039e8, econtext=0x1f5c138, isNull=0x7ffd4ba203df "",
isDone=<optimizedout>) at execQual.c:2015
 
#15 0x00000000005e4939 in ExecQual (qual=<optimized out>, econtext=econtext@entry=0x1f5c138,
resultForNull=resultForNull@entry=0'\000') at execQual.c:5269
 
#16 0x00000000005faef1 in ExecResult (node=node@entry=0x1f5c020) at nodeResult.c:82
#17 0x00000000005ddbf8 in ExecProcNode (node=node@entry=0x1f5c020) at execProcnode.c:392
#18 0x00000000005d9c1f in ExecutePlan (dest=0x1ebb7d0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized
out>,operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1f5c020, estate=0x1f5ab28) at
execMain.c:1567
#19 standard_ExecutorRun (queryDesc=0x1f5a718, direction=<optimized out>, count=0) at execMain.c:338
#20 0x00000000006f7238 in PortalRunSelect (portal=portal@entry=0x1d13be8, forward=forward@entry=1 '\001', count=0,
count@entry=9223372036854775807,dest=dest@entry=0x1ebb7d0) at pquery.c:946
 
#21 0x00000000006f875e in PortalRun (portal=0x1d13be8, count=9223372036854775807, isTopLevel=<optimized out>,
dest=0x1ebb7d0,altdest=0x1ebb7d0, completionTag=0x7ffd4ba20840 "") at pquery.c:787
 
#22 0x00000000006f6003 in exec_simple_query (query_string=<optimized out>) at postgres.c:1094
#23 PostgresMain (argc=30489576, argv=0x1ecfb08, dbname=0x1cf5a00 "regression", username=0x1ecfc20 "\b\373\354\001") at
postgres.c:4074
#24 0x000000000046ca67 in BackendRun (port=0x1d17b50) at postmaster.c:4262
#25 BackendStartup (port=0x1d17b50) at postmaster.c:3936
#26 ServerLoop () at postmaster.c:1693
#27 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1cf45e0) at postmaster.c:1301
#28 0x000000000046d9cd in main (argc=3, argv=0x1cf45e0) at main.c:228
(gdb) p debug_query_string
$1 = 0x1d68a78 "select  \n  sample_0.j as c0\nfrom \n  public.testjsonb as sample_0 tablesample system (8) \nwhere
cast(coalesce(pg_catalog.char_length(\n     cast((select comment from public.room limit 1 offset 20)\n         as
text)),\n   pg_catalog.pg_trigger_depth()) as integer) <> 3"
 

**** BEGIN BACKTRACE OF CORE FILE ./slave/postgres.8104@.core ON marbit ****
Core was generated by `postgres: bgworker: parallel worker for PID 2610                  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
167    ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:167
#1  0x0000000000822032 in RestoreSnapshot (start_address=0x7f7b1ee4fa58 <error: Cannot access memory at address
0x7f7b1ee4fa58>)at snapmgr.c:2020
 
#2  0x00000000004e6142 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1010
#3  0x0000000000683b62 in StartBackgroundWorker () at bgworker.c:726
#4  0x000000000068ec32 in do_start_bgworker (rw=0x22a8900) at postmaster.c:5535
#5  maybe_start_bgworker () at postmaster.c:5709
#6  0x000000000068f686 in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:4971
#7  <signal handler called>
#8  0x00007f7b1e3f3ac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
#9  0x000000000046c381 in ServerLoop () at postmaster.c:1657
#10 0x0000000000690ab7 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x22825e0) at postmaster.c:1301
#11 0x000000000046d9cd in main (argc=3, argv=0x22825e0) at main.c:228



Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

От
Michael Paquier
Дата:
On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> #1  0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access
memoryat address 0x7f2701d5a110>) at snapmgr.c:2020
 
       memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,              serialized_snapshot->subxcnt
*sizeof(TransactionId));
 
So this is choking here? Is one of those pointers NULL?
-- 
Michael



Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

От
Thomas Munro
Дата:
On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
>> #1  0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access
memoryat address 0x7f2701d5a110>) at snapmgr.c:2020 
>
>         memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
>                serialized_snapshot->subxcnt * sizeof(TransactionId));
> So this is choking here? Is one of those pointers NULL?

Theory 1:
If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
initialized to a non-NULL value.  Then if serialized_snapshot->subxcnt
> 0, we set snapshot->subxip = snapshot->xip +
serialized_snapshot->xcnt (so that's NULL too).  Then in line the line
you show we call memcpy(snapshot->subxip, ...).  The fix might be
something like the attached.

Theory 2:
The DSM segment was deleted underneath us.  We can see that it was not
mapped by the time GDB dumped that (start_address is not accessible).

Theory 3:
Somehow the xcnt or xsubcnt was wrong or the serialized snapshot was
truncated, and we read past the end of the DSM, who knows...

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

От
Amit Kapila
Дата:
On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> On Fri, Jul 1, 2016 at 2:17 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Fri, Jul 1, 2016 at 6:26 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> >> #1  0x0000000000822032 in RestoreSnapshot (start_address=start_address@entry=0x7f2701d5a110 <error: Cannot access memory at address 0x7f2701d5a110>) at snapmgr.c:2020
> >
> >         memcpy(snapshot->subxip, serialized_xids + serialized_snapshot->xcnt,
> >                serialized_snapshot->subxcnt * sizeof(TransactionId));
> > So this is choking here? Is one of those pointers NULL?
>
> Theory 1:
> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
> initialized to a non-NULL value.  Then if serialized_snapshot->subxcnt
> > 0, we set snapshot->subxip = snapshot->xip +
> serialized_snapshot->xcnt (so that's NULL too).  Then in line the line
> you show we call memcpy(snapshot->subxip, ...).  The fix might be
> something like the attached.
>

I was just typing the mail, when I see this mail.  I also reached to the conclusion that this is the reason of crash.  You can see how CopySnapshot calculates the subxipoff, may be writing code that way will be more consistent.  In case of recovery, I think serialized_snapshot->xcnt will always be zero as we fill everything in subxip array (refer below code in GetSnapshotData).

GetSnapshotData()
{
/*
* We're in hot standby, so get XIDs from KnownAssignedXids.
..
..
}


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

От
Thomas Munro
Дата:
On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com>
> wrote:
>> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
>> initialized to a non-NULL value.  Then if serialized_snapshot->subxcnt
>> > 0, we set snapshot->subxip = snapshot->xip +
>> serialized_snapshot->xcnt (so that's NULL too).  Then in line the line
>> you show we call memcpy(snapshot->subxip, ...).  The fix might be
>> something like the attached.
>
> I was just typing the mail, when I see this mail.  I also reached to the
> conclusion that this is the reason of crash.  You can see how CopySnapshot
> calculates the subxipoff, may be writing code that way will be more
> consistent.

Or maybe just like this?

-               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
+               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
+                       serialized_snapshot->xcnt;

--
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

От
Amit Kapila
Дата:
On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Jul 1, 2016 at 3:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Jul 1, 2016 at 8:48 AM, Thomas Munro <thomas.munro@enterprisedb.com>
>> wrote:
>>> If serialized_snapshot->xcnt == 0, then snapshot->xip never gets
>>> initialized to a non-NULL value.  Then if serialized_snapshot->subxcnt
>>> > 0, we set snapshot->subxip = snapshot->xip +
>>> serialized_snapshot->xcnt (so that's NULL too).  Then in line the line
>>> you show we call memcpy(snapshot->subxip, ...).  The fix might be
>>> something like the attached.
>>
>> I was just typing the mail, when I see this mail.  I also reached to the
>> conclusion that this is the reason of crash.  You can see how CopySnapshot
>> calculates the subxipoff, may be writing code that way will be more
>> consistent.
>
> Or maybe just like this?
>
> -               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
> +               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
> +                       serialized_snapshot->xcnt;
>

This way it looks better to me.  Thanks for the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

От
Andreas Seltenreich
Дата:
Amit Kapila writes:
> On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> Or maybe just like this?
>>
>> -               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
>> +               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
>> +                       serialized_snapshot->xcnt;
>>
>
> This way it looks better to me.  Thanks for the patch.

I no longer see these crashes when testing with the patch applied.

thanks,
Andreas



Re: [sqlsmith] crashes in RestoreSnapshot on hot standby

От
Robert Haas
Дата:
On Fri, Jul 1, 2016 at 1:41 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
> Amit Kapila writes:
>> On Fri, Jul 1, 2016 at 9:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> Or maybe just like this?
>>>
>>> -               snapshot->subxip = snapshot->xip + serialized_snapshot->xcnt;
>>> +               snapshot->subxip = ((TransactionId *) (snapshot + 1)) +
>>> +                       serialized_snapshot->xcnt;
>>>
>>
>> This way it looks better to me.  Thanks for the patch.
>
> I no longer see these crashes when testing with the patch applied.

Committed and back-patched to 9.5 where this code was added.  Thanks
to all for the report, patch, review, and testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company