Обсуждение: BUG #15378: SP-GIST memory context screwup?

Поиск
Список
Период
Сортировка

BUG #15378: SP-GIST memory context screwup?

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15378
Logged by:          Andrew Gierth
Email address:      andrew@tao11.riddles.org.uk
PostgreSQL version: 11beta3
Operating system:   Debian
Description:

I found this while analyzing a report from IRC that initially looked like a
PostGIS bug, but which I now think is breakage in spgist:

spgrescan starts out by doing
    MemoryContextReset(so->traversalCxt);

then later it calls resetSpGistScanOpaque(so);
which calls freeScanStack(so)
which calls freeScanStackEntry(so)
which does:

    if (stackEntry->traversalValue)
        pfree(stackEntry->traversalValue);

But stackEntry->traversalValue, if not NULL, is supposed to have been
allocated in so->traversalCxt, and so it's already gone.

The specific case we were looking at involved a query using a LATERAL
subselect with a LIMIT 1, with conditions that would have used the spgist
index (on a PostGIS geometry column). The PostGIS code seems to be correctly
allocating the traversalValues in the correct context, but we were getting
this crash (on pg11 beta3):

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000561e38427e1c in freeScanStackEntry (stackEntry=0x561e396ef718,
so=<optimized out>) at ./build/../src/backend/access/spgist/spgscan.c:47
#2  0x0000561e38428a6e in freeScanStack (so=0x561e396ed508) at
./build/../src/backend/access/spgist/spgscan.c:60
#3  resetSpGistScanOpaque (so=0x561e396ed508) at
./build/../src/backend/access/spgist/spgscan.c:75
#4  spgrescan (scan=<optimized out>, scankey=0x561e3969eab8,
nscankeys=<optimized out>, orderbys=<optimized out>, norderbys=<optimized
out>)
    at ./build/../src/backend/access/spgist/spgscan.c:229

which is consistent with a bad pfree(), I think.

The original reporter's data set is proprietary, but this is the query
that died:

SELECT b.idhu, layer, max(dba)
  FROM (SELECT b."IDHU" AS idhu, first(geom) AS geom
          FROM do_buildings b JOIN do_data d ON b."IDHU"::text =
d."IDHU"::text
         GROUP BY 1 LIMIT 100) b
       CROSS JOIN unnest('{str_isof_den}'::text[]) AS lyr
       CROSS JOIN LATERAL (
          SELECT idhu, layer, dba
            FROM do_laerm l
           WHERE layer =lyr AND (b.geom && l.geom AND ST_Intersects(l.geom,
b.geom))
           LIMIT 1
       ) AS target
GROUP by 1, 2;

Note that do_laerm(geom) has an SP-GIST index, and the LATERAL placement
means that this is going to be rescanned at a fairly arbitrary point in the
scan (due to the LIMIT). Unfortunately I don't think this can be
demonstrated with the built-in spgist opclasses, which don't allocate
traversalValues.

I /think/ the offending pfree could just be removed...


Re: BUG #15378: SP-GIST memory context screwup?

От
Andrew Gierth
Дата:
[CCing Teodor as committer of ccd6eb49a]

>>>>> "PG" == PG Bug reporting form <noreply@postgresql.org> writes:

 PG> I found this while analyzing a report from IRC that initially
 PG> looked like a PostGIS bug, but which I now think is breakage in
 PG> spgist:

 PG> spgrescan starts out by doing
 PG>     MemoryContextReset(so->traversalCxt);

 PG> then later it calls resetSpGistScanOpaque(so);
 PG> which calls freeScanStack(so)
 PG> which calls freeScanStackEntry(so)
 PG> which does:

 PG>     if (stackEntry->traversalValue)
 PG>         pfree(stackEntry->traversalValue);

 PG> But stackEntry->traversalValue, if not NULL, is supposed to have
 PG> been allocated in so->traversalCxt, and so it's already gone.
 [...]
 PG> Unfortunately I don't think this can be demonstrated with the
 PG> built-in spgist opclasses, which don't allocate traversalValues.

Turns out I was looking in the wrong place, and in fact we can reproduce
this easily (at least on an assert build):

create table boxes (b box);
insert into boxes
  select box(point(i,j),point(i+s,j+s))
    from generate_series(1,100,5) i,
         generate_series(1,100,5) j,
         generate_series(1,10) s;
create index on boxes using spgist (b);
select *
  from (values (box(point(5,5),point(8,8))),(box(point(2,2),point(12,12)))) v(b)
       cross join lateral (select * from boxes where boxes.b && v.b limit 1) pt;

So this logic was added in ccd6eb49a and was wrong from the start.
Testing suggests that removing the offending pfree does indeed fix the
issue; any objections?

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15378: SP-GIST memory context screwup?

От
Alexander Korotkov
Дата:
On Tue, Sep 11, 2018 at 6:13 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
[CCing Teodor as committer of ccd6eb49a]

>>>>> "PG" == PG Bug reporting form <noreply@postgresql.org> writes:

 PG> I found this while analyzing a report from IRC that initially
 PG> looked like a PostGIS bug, but which I now think is breakage in
 PG> spgist:

 PG> spgrescan starts out by doing
 PG>     MemoryContextReset(so->traversalCxt);

 PG> then later it calls resetSpGistScanOpaque(so);
 PG> which calls freeScanStack(so)
 PG> which calls freeScanStackEntry(so)
 PG> which does:

 PG>     if (stackEntry->traversalValue)
 PG>         pfree(stackEntry->traversalValue);

 PG> But stackEntry->traversalValue, if not NULL, is supposed to have
 PG> been allocated in so->traversalCxt, and so it's already gone.
 [...]
 PG> Unfortunately I don't think this can be demonstrated with the
 PG> built-in spgist opclasses, which don't allocate traversalValues.

Turns out I was looking in the wrong place, and in fact we can reproduce
this easily (at least on an assert build):

create table boxes (b box);
insert into boxes
  select box(point(i,j),point(i+s,j+s))
    from generate_series(1,100,5) i,
         generate_series(1,100,5) j,
         generate_series(1,10) s;
create index on boxes using spgist (b);
select *
  from (values (box(point(5,5),point(8,8))),(box(point(2,2),point(12,12)))) v(b)
       cross join lateral (select * from boxes where boxes.b && v.b limit 1) pt;

So this logic was added in ccd6eb49a and was wrong from the start.
Testing suggests that removing the offending pfree does indeed fix the
issue; any objections?

No objections from me.  What about moving MemoryContextReset(so->traversalCxt) into resetSpGistScanOpaque()?  For me it seems that resetting of traversal memory context is part of opaque reset.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: BUG #15378: SP-GIST memory context screwup?

От
Andrew Gierth
Дата:
>>>>> "Alexander" == Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

 >> So this logic was added in ccd6eb49a and was wrong from the start.
 >> Testing suggests that removing the offending pfree does indeed fix
 >> the issue; any objections?

 Alexander> No objections from me. What about moving
 Alexander> MemoryContextReset(so->traversalCxt) into
 Alexander> resetSpGistScanOpaque()? For me it seems that resetting of
 Alexander> traversal memory context is part of opaque reset.

Looking at all those retail pfrees makes me itch... is there some reason
why reconstructedValue couldn't go in the traversal context too, and
then the list nodes could go there as well, and the whole thing freed
with a context reset?

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15378: SP-GIST memory context screwup?

От
Andrew Gierth
Дата:
>>>>> "Alexander" == Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

 >> So this logic was added in ccd6eb49a and was wrong from the start.
 >> Testing suggests that removing the offending pfree does indeed fix
 >> the issue; any objections?

 Alexander> No objections from me.

But it turns out that removing the pfree will cause transient leakage
within the scan, since ScanStackEntry objects are also freed retail
during a walk. ugh.

So the simplest fix would be to move the memory context reset to just
after the freeScanStack in resetSpGistScanOpaque. And with retail
freeing going on during the scan, it makes less sense to try and avoid
it during rescan, double ugh.

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15378: SP-GIST memory context screwup?

От
Andrew Gierth
Дата:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> So the simplest fix would be to move the memory context reset
 Andrew> to just after the freeScanStack in resetSpGistScanOpaque. And
 Andrew> with retail freeing going on during the scan, it makes less
 Andrew> sense to try and avoid it during rescan, double ugh.

Proposed patch attached.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index c728080812..5260d5017d 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -74,6 +74,13 @@ resetSpGistScanOpaque(SpGistScanOpaque so)
 
     freeScanStack(so);
 
+    /*
+     * clear traversal context before proceeding to the next scan; this must
+     * not happen before the freeScanStack above, else we get double-free
+     * crashes.
+     */
+    MemoryContextReset(so->traversalCxt);
+
     if (so->searchNulls)
     {
         /* Stack a work item to scan the null index entries */
@@ -212,9 +219,6 @@ spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
     SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
 
-    /* clear traversal context before proceeding to the next scan */
-    MemoryContextReset(so->traversalCxt);
-
     /* copy scankeys into local storage */
     if (scankey && scan->numberOfKeys > 0)
     {
diff --git a/src/test/regress/expected/spgist.out b/src/test/regress/expected/spgist.out
index 2d75bbf8dc..9364b88bc2 100644
--- a/src/test/regress/expected/spgist.out
+++ b/src/test/regress/expected/spgist.out
@@ -23,6 +23,24 @@ delete from spgist_point_tbl where id % 2 = 1;
 -- would exercise it)
 delete from spgist_point_tbl where id < 10000;
 vacuum spgist_point_tbl;
+-- Test rescan paths (cf. bug #15378)
+-- use box and && rather than point, so that rescan happens when the
+-- traverse stack is non-empty
+create table spgist_box_tbl(id serial, b box);
+insert into spgist_box_tbl(b)
+select box(point(i,j),point(i+s,j+s))
+  from generate_series(1,100,5) i,
+       generate_series(1,100,5) j,
+       generate_series(1,10) s;
+create index spgist_box_idx on spgist_box_tbl using spgist (b);
+select count(*)
+  from (values (point(5,5)),(point(8,8)),(point(12,12))) v(p)
+ where exists(select * from spgist_box_tbl b where b.b && box(v.p,v.p));
+ count 
+-------
+     3
+(1 row)
+
 -- The point opclass's choose method only uses the spgMatchNode action,
 -- so the other actions are not tested by the above. Create an index using
 -- text opclass, which uses the others actions.
diff --git a/src/test/regress/sql/spgist.sql b/src/test/regress/sql/spgist.sql
index 77b43a2d3e..c72cf42a33 100644
--- a/src/test/regress/sql/spgist.sql
+++ b/src/test/regress/sql/spgist.sql
@@ -30,6 +30,21 @@ delete from spgist_point_tbl where id < 10000;
 
 vacuum spgist_point_tbl;
 
+-- Test rescan paths (cf. bug #15378)
+-- use box and && rather than point, so that rescan happens when the
+-- traverse stack is non-empty
+
+create table spgist_box_tbl(id serial, b box);
+insert into spgist_box_tbl(b)
+select box(point(i,j),point(i+s,j+s))
+  from generate_series(1,100,5) i,
+       generate_series(1,100,5) j,
+       generate_series(1,10) s;
+create index spgist_box_idx on spgist_box_tbl using spgist (b);
+
+select count(*)
+  from (values (point(5,5)),(point(8,8)),(point(12,12))) v(p)
+ where exists(select * from spgist_box_tbl b where b.b && box(v.p,v.p));
 
 -- The point opclass's choose method only uses the spgMatchNode action,
 -- so the other actions are not tested by the above. Create an index using

Re: BUG #15378: SP-GIST memory context screwup?

От
Andrew Gierth
Дата:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> Proposed patch attached.

Pushed this.

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15378: SP-GIST memory context screwup?

От
Alexander Korotkov
Дата:
On Tue, Sep 11, 2018 at 6:47 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
But it turns out that removing the pfree will cause transient leakage
within the scan, since ScanStackEntry objects are also freed retail
during a walk. ugh.

Right, I've missed code path freeScanStackEntry() call from spgWalk().

On Tue, Sep 11, 2018 at 9:59 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
 Andrew> Proposed patch attached.

Pushed this.

Pushed version looks correct to me.  Thank you!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company