Re: Fix for seg picksplit function
От | Yeb Havinga |
---|---|
Тема | Re: Fix for seg picksplit function |
Дата | |
Msg-id | 4CD42860.4070706@gmail.com обсуждение исходный текст |
Ответ на | Fix for seg picksplit function (Alexander Korotkov <aekorotkov@gmail.com>) |
Ответы |
Re: Fix for seg picksplit function
(Alexander Korotkov <aekorotkov@gmail.com>)
Re: Fix for seg picksplit function (Alexander Korotkov <aekorotkov@gmail.com>) |
Список | pgsql-hackers |
Hello Alexander, Here follows a review of your patch. > Hackers, > > Seg contrib module contains the same bug in picksplit function as > cube contrib module. Good catch! :-) > Also, Guttman's split algorithm is not needed in unidimensional case, > because sorting based algorithm is good in this case. I had some doubts whether this is true in the general case, instead of the given example. I increased the interval width in your example to 0.25*b instead of 0.00005*b, with the purpose to increase overlaps between intervals. Though the performance gain was less, it was still faster than Guttmans algorithm. To make things worse I also tested with an interval with of 1*b, resulting in a lot of overlaps and compared several overlap queries. The sorting algorithm was 25% to 40% faster on searches. Index creation time with the sorting algorithm is also a fraction of the original creation time. Since this testing could be part of a review, I looked at the code as well and listed myself as reviewer on the commitfest. Comparing with gbt_num_picksplit reveals some differences with sort array intialization and size, the former's sort array starts at index 1 (FirstOffsetNumber), your implementation starts at 0 for sorting and hence the size of the sorting array can be one element less. I prefer your way of sort array initialization; gbt_num_pickplits's use of FirstOffsetNumber of the qsort array seems to mix a define from the gist/btree namespace for no reason and might even lead to confusion. The remaining part of the new picksplit function puts the segs into left or right, I think the code is easier to understand if there was only one for loop from i=1 to 1 < maxoff, for the current code I had to verify that all sort array entries were really used with the two seperate loops that also skipped the first value. I edited the code a bit, and also used seg_union to initialize/palloc the datum values. Finally, waste and firsttime variables were initialized but not used anymore, so removed. Attached is a revised patch. regards, Yeb Havinga PS: when comparing with gbt_num_picksplit, I noticed that that one does not update v->spl_ldatum and spl_rdatum to the union datums, but initializes these to 0 at the beginning and never seems to update them. Not sure if this is a problem since the num_picksplit stuff seems to work well. diff --git a/contrib/seg/seg.c b/contrib/seg/seg.c index 930a35b..93895ef 100644 --- a/contrib/seg/seg.c +++ b/contrib/seg/seg.c @@ -292,38 +292,42 @@ gseg_penalty(GISTENTRY *origentry, GISTENTRY *newentry, float *result) return (result); } +/* + * Auxiliary structure for picksplit method. + */ +typedef struct +{ + int index; + SEG *data; +} PickSplitSortItem; +/* + * Compare function for PickSplitSortItem based on seg_cmp. + */ +static int +sort_item_cmp(const void *a, const void *b) +{ + PickSplitSortItem *i1 = (PickSplitSortItem *)a; + PickSplitSortItem *i2 = (PickSplitSortItem *)b; + return seg_cmp(i1->data, i2->data); +} /* ** The GiST PickSplit method for segments -** We use Guttman's poly time split algorithm +** Algorithm based on sorting. Incoming array of segs is sorting using seg_cmp +** function. After that first half of segs goes to the left datum, and the +** second half of segs goes to the right datum. */ GIST_SPLITVEC * gseg_picksplit(GistEntryVector *entryvec, GIST_SPLITVEC *v) { - OffsetNumber i, - j; - SEG *datum_alpha, - *datum_beta; + OffsetNumber i; SEG *datum_l, *datum_r; - SEG *union_d, - *union_dl, - *union_dr; - SEG *inter_d; - bool firsttime; - float size_alpha, - size_beta, - size_union, - size_inter; - float size_waste, - waste; - float size_l, - size_r; + PickSplitSortItem *sortItems; int nbytes; - OffsetNumber seed_1 = 1, - seed_2 = 2; + OffsetNumber seed_2; OffsetNumber *left, *right; OffsetNumber maxoff; @@ -332,111 +336,52 @@ gseg_picksplit(GistEntryVector *entryvec, fprintf(stderr, "picksplit\n"); #endif - maxoff = entryvec->n - 2; - nbytes = (maxoff + 2) * sizeof(OffsetNumber); + maxoff = entryvec->n - 1; + nbytes = (maxoff + 1) * sizeof(OffsetNumber); + sortItems = (PickSplitSortItem *)palloc(maxoff * sizeof(PickSplitSortItem)); v->spl_left = (OffsetNumber *) palloc(nbytes); v->spl_right = (OffsetNumber *) palloc(nbytes); - firsttime = true; - waste = 0.0; - - for (i = FirstOffsetNumber; i < maxoff; i = OffsetNumberNext(i)) + /* + * Preparing auxiliary array and sorting. + */ + for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i)) { - datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key); - for (j = OffsetNumberNext(i); j <= maxoff; j = OffsetNumberNext(j)) - { - datum_beta = (SEG *) DatumGetPointer(entryvec->vector[j].key); - - /* compute the wasted space by unioning these guys */ - /* size_waste = size_union - size_inter; */ - union_d = seg_union(datum_alpha, datum_beta); - rt_seg_size(union_d, &size_union); - inter_d = seg_inter(datum_alpha, datum_beta); - rt_seg_size(inter_d, &size_inter); - size_waste = size_union - size_inter; - - /* - * are these a more promising split that what we've already seen? - */ - if (size_waste > waste || firsttime) - { - waste = size_waste; - seed_1 = i; - seed_2 = j; - firsttime = false; - } - } + sortItems[i - 1].index = i; + sortItems[i - 1].data = (SEG *) DatumGetPointer(entryvec->vector[i].key); } + qsort(sortItems, maxoff, sizeof(PickSplitSortItem), sort_item_cmp); + seed_2 = maxoff / 2; left = v->spl_left; v->spl_nleft = 0; right = v->spl_right; v->spl_nright = 0; - datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[seed_1].key); - datum_l = seg_union(datum_alpha, datum_alpha); - rt_seg_size(datum_l, &size_l); - datum_beta = (SEG *) DatumGetPointer(entryvec->vector[seed_2].key); - datum_r = seg_union(datum_beta, datum_beta); - rt_seg_size(datum_r, &size_r); - - /* - * Now split up the regions between the two seeds. An important property - * of this split algorithm is that the split vector v has the indices of - * items to be split in order in its left and right vectors. We exploit - * this property by doing a merge in the code that actually splits the - * page. - * - * For efficiency, we also place the new index tuple in this loop. This is - * handled at the very end, when we have placed all the existing tuples - * and i == maxoff + 1. - */ - - maxoff = OffsetNumberNext(maxoff); - for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i)) + for (i = 0; i < maxoff; i++) { - /* - * If we've already decided where to place this item, just put it on - * the right list. Otherwise, we need to figure out which page needs - * the least enlargement in order to store the item. - */ - - if (i == seed_1) + /* First half of segs goes to the left datum. */ + if (i < seed_2) { - *left++ = i; - v->spl_nleft++; - continue; - } - else if (i == seed_2) - { - *right++ = i; - v->spl_nright++; - continue; - } - - /* okay, which page needs least enlargement? */ - datum_alpha = (SEG *) DatumGetPointer(entryvec->vector[i].key); - union_dl = seg_union(datum_l, datum_alpha); - union_dr = seg_union(datum_r, datum_alpha); - rt_seg_size(union_dl, &size_alpha); - rt_seg_size(union_dr, &size_beta); - - /* pick which page to add it to */ - if (size_alpha - size_l < size_beta - size_r) - { - datum_l = union_dl; - size_l = size_alpha; - *left++ = i; + datum_l = seg_union(sortItems[i].data, + (i == 0 + ? sortItems[i].data /* union with self at start */ + : datum_l) /* union with existing value */ ); + *left++ = sortItems[i].index; v->spl_nleft++; } + /* The other half of segs goes to the right datum. */ else { - datum_r = union_dr; - size_r = size_alpha; - *right++ = i; + datum_r = seg_union(sortItems[i].data, + (i == seed_2 + ? sortItems[i].data /* union with self at start */ + : datum_r) /* union with existing value */ ); + *right++ = sortItems[i].index; v->spl_nright++; } } + *left = *right = FirstOffsetNumber; /* sentinel value, see dosplit() */ v->spl_ldatum = PointerGetDatum(datum_l);
В списке pgsql-hackers по дате отправления: