Обсуждение: BUG #14134: segmentation fault with large table with gist index

Поиск
Список
Период
Сортировка

BUG #14134: segmentation fault with large table with gist index

От
yjh0502@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      14134
Logged by:          Jihyun Yu
Email address:      yjh0502@gmail.com
PostgreSQL version: 9.5.2
Operating system:   FreeBSD 10.3
Description:

Here's a stacktrace from a coredump:

#0  0x0000000000a020b9 in MemoryContextAlloc (context=0x0, size=984) at
mcxt.c:680
680             context->isReset = false;
[New Thread 802006400 (LWP 100856/<unknown>)]
(gdb) bt
#0  0x0000000000a020b9 in MemoryContextAlloc (context=0x0, size=984) at
mcxt.c:680
#1  0x0000000000a08198 in PrepareSortSupportComparisonShim (cmpFunc=1315,
ssup=0x8022cd108) at sortsupport.c:73
#2  0x0000000000a08484 in FinishSortSupportFunction (opfamily=1982,
opcintype=1186, ssup=0x8022cd108) at sortsupport.c:123
#3  0x0000000000a0837e in PrepareSortSupportFromOrderingOp (orderingOp=1332,
ssup=0x8022cd108) at sortsupport.c:150
#4  0x00000000006bc69e in ExecInitIndexScan (node=0x8022bcc00,
estate=0x8022cb030, eflags=16) at nodeIndexscan.c:970
#5  0x000000000069cec6 in ExecInitNode (node=0x8022bcc00,
estate=0x8022cb030, eflags=16) at execProcnode.c:200
#6  0x00000000006bf6c2 in ExecInitLimit (node=0x8022bf830,
estate=0x8022cb030, eflags=16) at nodeLimit.c:415
#7  0x000000000069d166 in ExecInitNode (node=0x8022bf830,
estate=0x8022cb030, eflags=16) at execProcnode.c:326
#8  0x000000000069898a in InitPlan (queryDesc=0x8022bfc90, eflags=16) at
execMain.c:957
#9  0x00000000006982bc in standard_ExecutorStart (queryDesc=0x8022bfc90,
eflags=16) at execMain.c:237
#10 0x0000000802403072 in _PG_init () from
/usr/local/lib/postgresql/pg_stat_statements.so
#11 0x00000000006980a2 in ExecutorStart (queryDesc=0x8022bfc90, eflags=0) at
execMain.c:137
#12 0x000000000061ad75 in ExplainOnePlan (plannedstmt=0x8022bfc00, into=0x0,
es=0x8021ef9e8,
    queryString=0x80208f030 "explain analyze select order_at from d_article
where deleted=false and root_id = parent_id order by order_at <->
'2016-03-01' limit 10;",
    params=0x0, planduration=0x7fffffffdc60) at explain.c:489
#13 0x000000000061a6e5 in ExplainOneQuery (query=0x8021efaa8, into=0x0,
es=0x8021ef9e8,
    queryString=0x80208f030 "explain analyze select order_at from d_article
where deleted=false and root_id = parent_id order by order_at <->
'2016-03-01' limit 10;", params=0x0)
    at explain.c:357
#14 0x000000000061a365 in ExplainQuery (stmt=0x802090720,
    queryString=0x80208f030 "explain analyze select order_at from d_article
where deleted=false and root_id = parent_id order by order_at <->
'2016-03-01' limit 10;",
    params=0x0, dest=0x8021ef958) at explain.c:245
#15 0x0000000000844379 in standard_ProcessUtility (parsetree=0x802090720,
    queryString=0x80208f030 "explain analyze select order_at from d_article
where deleted=false and root_id = parent_id order by order_at <->
'2016-03-01' limit 10;",
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x8021ef958,
completionTag=0x7fffffffe240 "") at utility.c:658
#16 0x0000000802403492 in _PG_init () from
/usr/local/lib/postgresql/pg_stat_statements.so
#17 0x0000000000843b22 in ProcessUtility (parsetree=0x802090720,
    queryString=0x80208f030 "explain analyze select order_at from d_article
where deleted=false and root_id = parent_id order by order_at <->
'2016-03-01' limit 10;",
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x8021ef958,
completionTag=0x7fffffffe240 "") at utility.c:330
#18 0x0000000000843688 in PortalRunUtility (portal=0x802207030,
utilityStmt=0x802090720, isTopLevel=1 '\001', dest=0x8021ef958,
completionTag=0x7fffffffe240 "") at pquery.c:1183
#19 0x0000000000842217 in FillPortalStore (portal=0x802207030, isTopLevel=1
'\001') at pquery.c:1057
#20 0x0000000000841e6a in PortalRun (portal=0x802207030,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x802090f20,
altdest=0x802090f20, completionTag=0x7fffffffe460 "")
    at pquery.c:781
#21 0x000000000083d9a3 in exec_simple_query (
    query_string=0x80208f030 "explain analyze select order_at from d_article
where deleted=false and root_id = parent_id order by order_at <->
'2016-03-01' limit 10;")
    at postgres.c:1104
#22 0x000000000083ccdb in PostgresMain (argc=1, argv=0x80201de10,
dbname=0x80201dcc8 "board", username=0x80201dcb0 "root") at postgres.c:4030
#23 0x00000000007b1b7c in BackendRun (port=0x8020e51c0) at
postmaster.c:4239
#24 0x00000000007b11a4 in BackendStartup (port=0x8020e51c0) at
postmaster.c:3913
#25 0x00000000007adc87 in ServerLoop () at postmaster.c:1684
#26 0x00000000007ab5ff in PostmasterMain (argc=3, argv=0x7fffffffebd8) at
postmaster.c:1292
#27 0x00000000006f3f98 in main (argc=3, argv=0x7fffffffebd8) at main.c:228


The postgresql server crashed with segfault. Here's an index which causes
the crash:

create index concurrently d_article_gist_idx on d_article using btree_gist
(order_at) where deleted=false and parent_id = root_id

'd_article' table contains ~4M rows and server runs on machine with 2GB of
RAM.

Re: BUG #14134: segmentation fault with large table with gist index

От
Euler Taveira
Дата:
On 11-05-2016 12:49, yjh0502@gmail.com wrote:
> The postgresql server crashed with segfault. Here's an index which causes
> the crash:
>
It seems someone (5ea86e6e?) forgot to assign CurrentMemoryContext to
ssup_ctx. I am not sure if it should be done at
PrepareSortSupportFromOrderingOp or elsewhere.


--
   Euler Taveira                   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: BUG #14134: segmentation fault with large table with gist index

От
Peter Geoghegan
Дата:
On Wed, May 11, 2016 at 9:47 AM, Euler Taveira <euler@timbira.com.br> wrote:
> On 11-05-2016 12:49, yjh0502@gmail.com wrote:
>> The postgresql server crashed with segfault. Here's an index which causes
>> the crash:
>>
> It seems someone (5ea86e6e?) forgot to assign CurrentMemoryContext to
> ssup_ctx. I am not sure if it should be done at
> PrepareSortSupportFromOrderingOp or elsewhere.

That commit did not change anything about memory contexts, and did not
add new functionality to the SortSupport path taken here.

The bug is in commit 35fcb1b3, which failed to initialize ssup_ctx.
I'm surprised that it took this long for there to be trouble, because
that commit doesn't initialize anything at all in the sortsupport
object.

--
Peter Geoghegan

Re: BUG #14134: segmentation fault with large table with gist index

От
Peter Geoghegan
Дата:
On Wed, May 11, 2016 at 12:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The bug is in commit 35fcb1b3, which failed to initialize ssup_ctx.
> I'm surprised that it took this long for there to be trouble, because
> that commit doesn't initialize anything at all in the sortsupport
> object.

It makes some sense that it took this long, actually.

Consider that most types have SortSupport, and so will use their own
memory context if memory allocation is needed for nodeIndexScan.c's
new-to-9.5 sort-like merging involving SortSupport. The field
ssup_collation was also not correctly initialized, for example, but
the SortSupport was allocated with palloc0(), so any problems that
that causes for collatable types will be relatively subtle. According
to convention, we won't attempt allocation with the ssup_ctx "parent"
memory context (which is NULL here), or we won't allocate any memory
at all in the case of simple pass-by-value types like int4.

I'll look into a patch to fix this.

--
Peter Geoghegan

Re: BUG #14134: segmentation fault with large table with gist index

От
Peter Geoghegan
Дата:
On Wed, May 11, 2016 at 12:56 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Consider that most types have SortSupport, and so will use their own
> memory context if memory allocation is needed for nodeIndexScan.c's
> new-to-9.5 sort-like merging involving SortSupport. The field
> ssup_collation was also not correctly initialized, for example, but
> the SortSupport was allocated with palloc0(), so any problems that
> that causes for collatable types will be relatively subtle. According
> to convention, we won't attempt allocation with the ssup_ctx "parent"
> memory context (which is NULL here), or we won't allocate any memory
> at all in the case of simple pass-by-value types like int4.

On second thought, any memory allocation would almost certainly lead
to a segfault in practice. That just leaves pass-by-value types as
space for the bug to hide.

--
Peter Geoghegan

Re: BUG #14134: segmentation fault with large table with gist index

От
Peter Geoghegan
Дата:
On Wed, May 11, 2016 at 12:37 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The bug is in commit 35fcb1b3, which failed to initialize ssup_ctx.
> I'm surprised that it took this long for there to be trouble, because
> that commit doesn't initialize anything at all in the sortsupport
> object.

Here are simple steps to reproduce the bug:

postgres=# create table bug as select (now() - (current_date + i))
intv from generate_series(0,10000) i;
SELECT 10001
postgres=# set enable_indexonlyscan = off;
SET
postgres=# set enable_sort = off;
SET
postgres=# create extension btree_gist;
CREATE EXTENSION
postgres=# create index sortsupport_bug on bug using gist (intv);
CREATE INDEX
postgres=# SELECT * FROM bug ORDER BY intv <-> '1 days' LIMIT 10;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

My previous analysis on why this occurred so infrequently as to only
see a problem report months after a stable release was wrong. This bug
only happens in narrow situations where a distance function exists
that is indexable by GiST, while that also lacks SortSupport. GiST
isn't doing anything with any other SortSupport attribute that lacks a
distance operator.

The lack of SortSupport will make SortSupport use a shim comparator,
which tries to use caller's memory context, which was found to be NULL
(since we palloc0()). So, this bug is fairly narrow in practice,
because you had to be using the distance operator for interval, which
looks like the only example of where this is possible.

Attached patch fixes the bug by initializing the SortSupport states used.

--
Peter Geoghegan

Вложения

Re: BUG #14134: segmentation fault with large table with gist index

От
Tom Lane
Дата:
Peter Geoghegan <pg@heroku.com> writes:
> My previous analysis on why this occurred so infrequently as to only
> see a problem report months after a stable release was wrong. This bug
> only happens in narrow situations where a distance function exists
> that is indexable by GiST, while that also lacks SortSupport. GiST
> isn't doing anything with any other SortSupport attribute that lacks a
> distance operator.

> The lack of SortSupport will make SortSupport use a shim comparator,
> which tries to use caller's memory context, which was found to be NULL
> (since we palloc0()). So, this bug is fairly narrow in practice,
> because you had to be using the distance operator for interval, which
> looks like the only example of where this is possible.

> Attached patch fixes the bug by initializing the SortSupport states used.

Pushed.  I added an explicit initialization of orderbysort->abbreviate,
because all the other callers of PrepareSortSupportFromOrderingOp make
a point of setting that.  Also a regression test.

            regards, tom lane

Re: BUG #14134: segmentation fault with large table with gist index

От
Peter Geoghegan
Дата:
On Sun, Jun 5, 2016 at 8:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pushed.  I added an explicit initialization of orderbysort->abbreviate,
> because all the other callers of PrepareSortSupportFromOrderingOp make
> a point of setting that.  Also a regression test.

Thanks.

It hardly matters, but this bug did not occur because interval is
pass-by-reference (I withdrew my previous remarks on typbyval-ness
being a factor here). As it happens, the built-in pass-by-value types
almost all have SortSupport for as long there has been a
sortsupport.h, but that's not automatically true. The shim code path
would still be taken and would still dereference a NULL memory context
pointer if there was an affected type. (A pass-by-value type with a
distance operator usable by GiST, but no SortSupport, used in the same
way as the test case shows.)


--
Peter Geoghegan