Re: Fix for seg picksplit function

Поиск
Список
Период
Сортировка
От Yeb Havinga
Тема Re: Fix for seg picksplit function
Дата
Msg-id 4CD42860.4070706@gmail.com
обсуждение исходный текст
Ответ на Fix for seg picksplit function  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: Fix for seg picksplit function  (Alexander Korotkov <aekorotkov@gmail.com>)
Re: Fix for seg picksplit function  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
Hello Alexander,

Here follows a review of your patch.
> Hackers,
>
> Seg contrib module contains the same bug in  picksplit function as
> cube  contrib module.
Good catch! :-)
> Also, Guttman's split algorithm is not needed in unidimensional case,
> because sorting based algorithm is good in this case.
I had some doubts whether this is true in the general case, instead of
the given example. I increased the interval width in your example to
0.25*b instead of 0.00005*b, with the purpose to increase overlaps
between intervals. Though the performance gain was less, it was still
faster than Guttmans algorithm. To make things worse I also tested with
an interval with of 1*b, resulting in a lot of overlaps and compared
several overlap queries. The sorting algorithm was 25% to 40% faster on
searches. Index creation time with the sorting algorithm is also a
fraction of the original creation time.

Since this testing could be part of a review, I looked at the code as
well and listed myself as reviewer on the commitfest.

Comparing with gbt_num_picksplit reveals some differences with sort
array intialization and size, the former's sort array starts at index 1
(FirstOffsetNumber), your implementation starts at 0 for sorting and
hence the size of the sorting array can be one element less. I prefer
your way of sort array initialization; gbt_num_pickplits's use of
FirstOffsetNumber of the qsort array seems to mix a define from the
gist/btree namespace for no reason and might even lead to confusion.

The remaining part of the new picksplit function puts the segs into left
or right, I think the code is easier to understand if there was only one
for loop from i=1 to 1 < maxoff, for the current code I had to verify
that all sort array entries were really used with the two seperate loops
that also skipped the first value. I edited the code a bit, and also
used seg_union to initialize/palloc the datum values. Finally, waste and
firsttime variables were initialized but not used anymore, so removed.

Attached is a revised patch.

regards,
Yeb Havinga

PS: when comparing with gbt_num_picksplit, I noticed that that one does
not update v->spl_ldatum and spl_rdatum to the union datums, but
initializes these to 0 at the beginning and never seems to update them.
Not sure if this is a problem since the num_picksplit stuff seems to
work well.

diff --git a/contrib/seg/seg.c b/contrib/seg/seg.c
index 930a35b..93895ef 100644
--- a/contrib/seg/seg.c
+++ b/contrib/seg/seg.c
@@ -292,38 +292,42 @@ gseg_penalty(GISTENTRY *origentry, GISTENTRY *newentry, float *result)
     return (result);
 }

+/*
+ * Auxiliary structure for picksplit method.
+ */
+typedef struct
+{
+    int index;
+    SEG *data;
+} PickSplitSortItem;

+/*
+ * Compare function for PickSplitSortItem based on seg_cmp.
+ */
+static int
+sort_item_cmp(const void *a, const void *b)
+{
+    PickSplitSortItem *i1 = (PickSplitSortItem *)a;
+    PickSplitSortItem *i2 = (PickSplitSortItem *)b;
+    return seg_cmp(i1->data, i2->data);
+}

 /*
 ** The GiST PickSplit method for segments
-** We use Guttman's poly time split algorithm
+** Algorithm based on sorting. Incoming array of segs is sorting using seg_cmp
+** function. After that first half of segs goes to the left datum, and the
+** second half of segs goes to the right datum.
 */
 GIST_SPLITVEC *
 gseg_picksplit(GistEntryVector *entryvec,
                GIST_SPLITVEC *v)
 {
-    OffsetNumber i,
-                j;
-    SEG           *datum_alpha,
-               *datum_beta;
+    OffsetNumber i;
     SEG           *datum_l,
                *datum_r;
-    SEG           *union_d,
-               *union_dl,
-               *union_dr;
-    SEG           *inter_d;
-    bool        firsttime;
-    float        size_alpha,
-                size_beta,
-                size_union,
-                size_inter;
-    float        size_waste,
-                waste;
-    float        size_l,
-                size_r;
+    PickSplitSortItem    *sortItems;
     int            nbytes;
-    OffsetNumber seed_1 = 1,
-                seed_2 = 2;
+    OffsetNumber seed_2;
     OffsetNumber *left,
                *right;
     OffsetNumber maxoff;
@@ -332,111 +336,52 @@ gseg_picksplit(GistEntryVector *entryvec,
     fprintf(stderr, "picksplit\n");
 #endif

-    maxoff = entryvec->n - 2;
-    nbytes = (maxoff + 2) * sizeof(OffsetNumber);
+    maxoff = entryvec->n - 1;
+    nbytes = (maxoff + 1) * sizeof(OffsetNumber);
+    sortItems = (PickSplitSortItem *)palloc(maxoff * sizeof(PickSplitSortItem));
     v->spl_left = (OffsetNumber *) palloc(nbytes);
     v->spl_right = (OffsetNumber *) palloc(nbytes);

-    firsttime = true;
-    waste = 0.0;
-
-    for (i = FirstOffsetNumber; i < maxoff; i = OffsetNumberNext(i))
+    /*
+     * Preparing auxiliary array and sorting.
+     */
+    for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
     {
-        datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key);
-        for (j = OffsetNumberNext(i); j <= maxoff; j = OffsetNumberNext(j))
-        {
-            datum_beta = (SEG *) DatumGetPointer(entryvec->vector[j].key);
-
-            /* compute the wasted space by unioning these guys */
-            /* size_waste = size_union - size_inter; */
-            union_d = seg_union(datum_alpha, datum_beta);
-            rt_seg_size(union_d, &size_union);
-            inter_d = seg_inter(datum_alpha, datum_beta);
-            rt_seg_size(inter_d, &size_inter);
-            size_waste = size_union - size_inter;
-
-            /*
-             * are these a more promising split that what we've already seen?
-             */
-            if (size_waste > waste || firsttime)
-            {
-                waste = size_waste;
-                seed_1 = i;
-                seed_2 = j;
-                firsttime = false;
-            }
-        }
+        sortItems[i - 1].index = i;
+        sortItems[i - 1].data = (SEG *) DatumGetPointer(entryvec->vector[i].key);
     }
+    qsort(sortItems, maxoff, sizeof(PickSplitSortItem), sort_item_cmp);
+    seed_2 = maxoff / 2;

     left = v->spl_left;
     v->spl_nleft = 0;
     right = v->spl_right;
     v->spl_nright = 0;

-    datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[seed_1].key);
-    datum_l = seg_union(datum_alpha, datum_alpha);
-    rt_seg_size(datum_l, &size_l);
-    datum_beta = (SEG *) DatumGetPointer(entryvec->vector[seed_2].key);
-    datum_r = seg_union(datum_beta, datum_beta);
-    rt_seg_size(datum_r, &size_r);
-
-    /*
-     * Now split up the regions between the two seeds.    An important property
-     * of this split algorithm is that the split vector v has the indices of
-     * items to be split in order in its left and right vectors.  We exploit
-     * this property by doing a merge in the code that actually splits the
-     * page.
-     *
-     * For efficiency, we also place the new index tuple in this loop. This is
-     * handled at the very end, when we have placed all the existing tuples
-     * and i == maxoff + 1.
-     */
-
-    maxoff = OffsetNumberNext(maxoff);
-    for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+    for (i = 0; i < maxoff; i++)
     {
-        /*
-         * If we've already decided where to place this item, just put it on
-         * the right list.    Otherwise, we need to figure out which page needs
-         * the least enlargement in order to store the item.
-         */
-
-        if (i == seed_1)
+        /* First half of segs goes to the left datum. */
+        if (i < seed_2)
         {
-            *left++ = i;
-            v->spl_nleft++;
-            continue;
-        }
-        else if (i == seed_2)
-        {
-            *right++ = i;
-            v->spl_nright++;
-            continue;
-        }
-
-        /* okay, which page needs least enlargement? */
-        datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key);
-        union_dl = seg_union(datum_l, datum_alpha);
-        union_dr = seg_union(datum_r, datum_alpha);
-        rt_seg_size(union_dl, &size_alpha);
-        rt_seg_size(union_dr, &size_beta);
-
-        /* pick which page to add it to */
-        if (size_alpha - size_l < size_beta - size_r)
-        {
-            datum_l = union_dl;
-            size_l = size_alpha;
-            *left++ = i;
+            datum_l = seg_union(sortItems[i].data,
+                                (i == 0
+                                 ? sortItems[i].data /* union with self at start */
+                                 : datum_l) /* union with existing value */ );
+            *left++ = sortItems[i].index;
             v->spl_nleft++;
         }
+        /* The other half of segs goes to the right datum. */
         else
         {
-            datum_r = union_dr;
-            size_r = size_alpha;
-            *right++ = i;
+            datum_r = seg_union(sortItems[i].data,
+                                (i == seed_2
+                                 ? sortItems[i].data /* union with self at start */
+                                 : datum_r) /* union with existing value */ );
+            *right++ = sortItems[i].index;
             v->spl_nright++;
         }
     }
+
     *left = *right = FirstOffsetNumber; /* sentinel value, see dosplit() */

     v->spl_ldatum = PointerGetDatum(datum_l);

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: ALTER TABLE ... IF EXISTS feature?
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: ALTER OBJECT any_name SET SCHEMA name