Обсуждение: GIN improvements part 1: additional information

Поиск
Список
Период
Сортировка

GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
Hackers,

Revised version of patch for additional information storage in GIN is attached. Changes are mostly bug fixes. 

Resemble GIN interface changes that this patch introduce.
Patch modifies GIN interface as following:
1) Two arguments are added to extractValue
Datum **addInfo, bool **addInfoIsNull
2) Two arguments are added to consistent
Datum addInfo[], bool addInfoIsNull[]
3) New method config is introduced which returns datatype oid of addtional information (analogy with SP-GiST config method).

Additionally there is another patch which demonstrates benefits from additional information storage itself (because it don't accelerate tsearch itselt). It comes in separate thread.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Jun 14, 2013 at 12:09 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Revised version of patch for additional information storage in GIN is attached. Changes are mostly bug fixes. 

New version of patch is attached with some more refactoring and bug fixes.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Jun 17, 2013 at 4:54 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Fri, Jun 14, 2013 at 12:09 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Revised version of patch for additional information storage in GIN is attached. Changes are mostly bug fixes. 

New version of patch is attached with some more refactoring and bug fixes.

Another revision with more bug fixes.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:

On Wed, Jun 19, 2013 at 12:44 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Mon, Jun 17, 2013 at 4:54 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Fri, Jun 14, 2013 at 12:09 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Revised version of patch for additional information storage in GIN is attached. Changes are mostly bug fixes. 

New version of patch is attached with some more refactoring and bug fixes.

Another revision with more bug fixes.

New revision of patch is attached. Now it includes some docs.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Antonin Houska
Дата:
On 06/25/2013 12:03 AM, Alexander Korotkov wrote:
>
>
> New revision of patch is attached. Now it includes some docs.
>
>

Hi,
I was curious about the new layout of the data page, so I spent a while 
looking into the code.
It's interesting, but I suspect 2 things are not o.k.:

* gindatapage.c:dataIsEnoughSpace() - 'i++' in the for loop should 
probably be 'j++', otherwise it loops forever

* gin_private.h:ginDataPageLeafRead() - fetch_att() is used to retrieve 
the additional info, but per the definition and comments in tupmacs.h it 
expects aligned pointer.

* gindatapage.c:ginCheckPlaceToDataPageLeaf() -  comment "if leaf data 
page" should probably be "on a leaf data page" or so.

Regards,
Antonin Houska (Tony)



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 25.06.2013 01:03, Alexander Korotkov wrote:
> New revision of patch is attached. Now it includes some docs.

Thanks! I'm looking into this in detail now.

First, this patch actually contains two major things:

1. Pack item pointers more tightly on posting data leaf pages.
2. Allow opclass implementation to attach "additional information" to
each item pointer.

These are two very distinct features, so this patch needs to be split
into two. I extracted the 1st part into a separate patch, attached, and
am going to focus on that now.

I made one significant change: I removed the 'freespace' field you added
to GinpageOpaque. Instead, on data leaf pages the offset from the
beginning of the page where the packed items end is stored in place of
the 'maxoff' field. This allows for quick calculation of the free space,
but there is no count of item pointers stored on the page anymore, so
some code that looped through all the item pointers relying on 'maxoff'
had to be changed to work with the end offset instead. I'm not 100%
wedded on this, but I'd like to avoid adding the redundant freespace
field on pages that don't need it, because it's confusing and you have
to remember to keep them in sync.

The patch needs a lot of cleanup still, and I may well have broken some
stuff, but I'm quite pleased with the performance. I tested this with
two tables; one is the titles from the DBLP dataset. Another is integer
arrays, created with this:

create function randomintarr() returns int[] as $$ select
array_agg((random() * 1000.0)::int4) from generate_series(1,10) $$
language sql;
create table intarrtbl as select randomintarr() as ii from
generate_series(1, 10000000);

The effect on the index sizes is quite dramatic:

postgres=# \di+
                                  List of relations
  Schema |        Name        | Type  | Owner  |    Table    |  Size  |
--------+--------------------+-------+--------+-------------+--------+
  public | gin_intarr_master  | index | heikki | intarrtbl   | 585 MB |
  public | gin_intarr_patched | index | heikki | intarrtbl   | 211 MB |
  public | gin_title          | index | heikki | dblp_titles | 93 MB  |
  public | gin_title_master   | index | heikki | dblp_titles | 180 MB |
(4 rows)

Tomas Vondra tested the search performance of an earlier version of this
patch: http://www.postgresql.org/message-id/50BFF89A.7080908@fuzzy.cz).
He initially saw a huge slowdown, but could not reproduce it with a
later version of the patch. I did not see much difference in a few quick
queries I ran, so we're probably good on that front.

There's a few open questions:

1. How are we going to handle pg_upgrade? It would be nice to be able to
read the old page format, or convert on-the-fly. OTOH, if it gets too
complicated, might not be worth it. The indexes are much smaller with
the patch, so anyone using GIN probably wants to rebuild them anyway,
sooner or later. Still, I'd like to give it a shot.

2. The patch introduces a small fixed 32-entry index into the packed
items. Is that an optimal number?

3. I'd like to see some performance testing of insertions, deletions,
and vacuum. I suspect that maintaining the 32-entry index might be
fairly expensive, as it's rewritten on every update to a leaf page.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 27.06.2013 17:20, Antonin Houska wrote:
> I was curious about the new layout of the data page, so I spent a while
> looking into the code.
> It's interesting, but I suspect 2 things are not o.k.:
>
> * gindatapage.c:dataIsEnoughSpace() - 'i++' in the for loop should
> probably be 'j++', otherwise it loops forever

Hmm. It won't loop forever, i is counting the number of entries that fit 
on the page, while j is used to loop through the items being added. 
However, i isn't actually used for anything (and isn't initialized 
either), so it's just dead code.

- Heikki



Re: GIN improvements part 1: additional information

От
Antonin Houska
Дата:
On 06/29/2013 11:00 AM, Heikki Linnakangas wrote:
> On 27.06.2013 17:20, Antonin Houska wrote:
>> I was curious about the new layout of the data page, so I spent a while
>> looking into the code.
>> It's interesting, but I suspect 2 things are not o.k.:
>>
>> * gindatapage.c:dataIsEnoughSpace() - 'i++' in the for loop should
>> probably be 'j++', otherwise it loops forever
>
> Hmm. It won't loop forever, i is counting the number of entries that 
> fit on the page, while j is used to loop through the items being 
> added. However, i isn't actually used for anything (and isn't 
> initialized either), so it's just dead code.
>
> - Heikki
You're right. While thinking about possible meaning of the 'i' I didn't 
notice that j++ is in the 'for' construct. Stupid mistake on my side.

Tony



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 29.06.2013 11:56, Heikki Linnakangas wrote:
> On 25.06.2013 01:03, Alexander Korotkov wrote:
>> New revision of patch is attached. Now it includes some docs.
>
> Thanks! I'm looking into this in detail now.
>
> First, this patch actually contains two major things:
>
> 1. Pack item pointers more tightly on posting data leaf pages.
> 2. Allow opclass implementation to attach "additional information" to
> each item pointer.
>
> These are two very distinct features, so this patch needs to be split
> into two. I extracted the 1st part into a separate patch, attached, and
> am going to focus on that now.
>
> I made one significant change: I removed the 'freespace' field you added
> to GinpageOpaque. Instead, on data leaf pages the offset from the
> beginning of the page where the packed items end is stored in place of
> the 'maxoff' field. This allows for quick calculation of the free space,
> but there is no count of item pointers stored on the page anymore, so
> some code that looped through all the item pointers relying on 'maxoff'
> had to be changed to work with the end offset instead. I'm not 100%
> wedded on this, but I'd like to avoid adding the redundant freespace
> field on pages that don't need it, because it's confusing and you have
> to remember to keep them in sync.

Ah, crap, looks like I sent the wrong patch, and now I can't find the 
correct one anymore. The patch I sent didn't include the changes store 
end offset instead of freespace. I'll rewrite that part..

- Heikki



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 29.06.2013 20:08, Heikki Linnakangas wrote:
> On 29.06.2013 11:56, Heikki Linnakangas wrote:
>> I made one significant change: I removed the 'freespace' field you added
>> to GinpageOpaque. Instead, on data leaf pages the offset from the
>> beginning of the page where the packed items end is stored in place of
>> the 'maxoff' field. This allows for quick calculation of the free space,
>> but there is no count of item pointers stored on the page anymore, so
>> some code that looped through all the item pointers relying on 'maxoff'
>> had to be changed to work with the end offset instead. I'm not 100%
>> wedded on this, but I'd like to avoid adding the redundant freespace
>> field on pages that don't need it, because it's confusing and you have
>> to remember to keep them in sync.
>
> Ah, crap, looks like I sent the wrong patch, and now I can't find the
> correct one anymore. The patch I sent didn't include the changes store
> end offset instead of freespace. I'll rewrite that part..

Here's the correct version. I've probably broken things, sorry about that.

I'm going to mark this as "returned with feedback" now. The code still
needs a lot of general cleanup, including comment and README updates.
Also, please take some time to consider the open questions I listed
here: archives.postgresql.org/message-id/51CEA13C.8040103@vmware.com.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Sun, Jun 30, 2013 at 2:50 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 29.06.2013 20:08, Heikki Linnakangas wrote:
On 29.06.2013 11:56, Heikki Linnakangas wrote:
I made one significant change: I removed the 'freespace' field you added
to GinpageOpaque. Instead, on data leaf pages the offset from the
beginning of the page where the packed items end is stored in place of
the 'maxoff' field. This allows for quick calculation of the free space,
but there is no count of item pointers stored on the page anymore, so
some code that looped through all the item pointers relying on 'maxoff'
had to be changed to work with the end offset instead. I'm not 100%
wedded on this, but I'd like to avoid adding the redundant freespace
field on pages that don't need it, because it's confusing and you have
to remember to keep them in sync.

Ah, crap, looks like I sent the wrong patch, and now I can't find the
correct one anymore. The patch I sent didn't include the changes store
end offset instead of freespace. I'll rewrite that part..

Here's the correct version. I've probably broken things, sorry about that.

I'm going to mark this as "returned with feedback" now. The code still needs a lot of general cleanup, including comment and README updates. Also, please take some time to consider the open questions I listed here: archives.postgresql.org/message-id/51CEA13C.8040103@vmware.com.

Thanks! So, we have a lot of stuff and you give the points for further work. Could you please verify my plan of work on these patches:
1) Solving questions of archives.postgresql.org/message-id/51CEA13C.8040103@vmware.com for packed postinglists.
2) Extract additional info patch based on packed postinglists.
3) Rewrite interface of fast scan. Do CPU and IO benchmarking.
4) Do IO benchmarking of index ordering.
Cleanup, comments and READMEs are assumed in each item.

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
<div dir="ltr">On Thu, Jun 27, 2013 at 6:20 PM, Antonin Houska <span dir="ltr"><<a
href="mailto:antonin.houska@gmail.com"target="_blank">antonin.houska@gmail.com</a>></span> wrote:<br /><div
class="gmail_extra"><divclass="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div
class="im">On06/25/2013 12:03 AM, Alexander Korotkov wrote:<br /><blockquote class="gmail_quote" style="margin:0px 0px
0px0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br /><br
/>New revision of patch is attached. Now it includes some docs.<br /><br /><br /></blockquote><br /></div> Hi,<br /> I
wascurious about the new layout of the data page, so I spent a while looking into the code.<br /> It's interesting, but
Isuspect 2 things are not o.k.:<br /><br /> * gindatapage.c:<u></u>dataIsEnoughSpace() - 'i++' in the for loop should
probablybe 'j++', otherwise it loops forever<br /><br /> * gin_private.h:<u></u>ginDataPageLeafRead() - fetch_att() is
usedto retrieve the additional info, but per the definition and comments in tupmacs.h it expects aligned pointer.<br
/><br/> * gindatapage.c:<u></u>ginCheckPlaceToDataPageLeaf() -  comment "if leaf data page" should probably be "on a
leafdata page" or so.</blockquote><div class="gmail_quote"><br /></div><div class="gmail_quote" style="style">
Hi!</div><divclass="gmail_quote" style="style"><br /></div><div class="gmail_quote" style="style">Thanks for pointing
these.</div><br/>------<br />With best regards,<br />Alexander Korotkov. </div></div></div> 

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01.07.2013 13:28, Alexander Korotkov wrote:
> Thanks! So, we have a lot of stuff and you give the points for further
> work. Could you please verify my plan of work on these patches:
> 1) Solving questions of archives.postgresql.org/**
> message-id/51CEA13C.8040103@**vmware.com<http://archives.postgresql.org/message-id/51CEA13C.8040103@vmware.com>
> for
> packed postinglists.
> 2) Extract additional info patch based on packed postinglists.
> 3) Rewrite interface of fast scan. Do CPU and IO benchmarking.
> 4) Do IO benchmarking of index ordering.
> Cleanup, comments and READMEs are assumed in each item.

Yep, sounds good!

- Heikki



Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
Hi,

I've done a fair amount of testing by loading pgsql-general archives
into a database and running a bunch of simple ts queries that use a GIN
index.

I've tested this as well as the two other patches, but as I was able to
get meaningful results only from this patch, I'll post the results here
and info about segfaults and other observed errors to the other threads.

First of all - update the commitfest page whenever you submit a new
patch version, please. I've spent two or three hours testing and
debugging a patches linked from those pages only to find out that there
are newer versions. I should have checked that initially, but let's keep
that updated.

I wan't able to apply the patches to the current head, so I've used
b8fd1a09 (from 17/06) as a base commit.

The following table shows these metrics:
* data load   - how long it took to import ~200k messages from the  list archive   - includes a lot of time spent in
Python(parsing), checking FKs ...   - so unless this is significantly higher, it's probably OK
 
* index size   - size of the main GIN index on message body
* 1/2/3-word(s)   - number of queries in the form
     SELECT id FROM messages      WHERE body_tsvector @@ plainto_tsquery('english', 'w1 w2')      LIMIT 100
   (executed over 60 seconds, and 'per second' speed)

All the scripts are available at https://bitbucket.org/tvondra/archie

Now, the results:

no patches:   data load:  710 s   index size: 545 MB   1 word:      37500 (630/s)   2 words:     49800 (800/s)   3
words:    40000 (660/s)
 

additional info (ginaddinfo.7.patch):   data load:  693 s   index size: 448 MB   1 word:     135000 (2250/s)   2 words:
   85000 (1430/s)   3 words:     54000 ( 900/s)
 

additional info + fast scan (gin_fast_scan.4.patch):   data load:  720 s   index size: 455 MB   1 word:     FAIL   2
words:   FAIL   3 words:    FAIL
 

additional info + fast scan + ordering (gin_ordering.4.patch):   data load:  FAIL   index size: N/A   1 word:     N/A
2words:    N/A   3 words:    N/A
 

So the speedup after adding info into GIN seems very promising, although
I don't quite understand why searching for two words is so much slower.
Also the index size seems to decrease significantly.

After applying 'fast scan' the things started to break down, so I wasn't
able to run the queries and then even the load failed consistently.

I'll post the info into the appropriate threads.

Tomas



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
There's a few open questions:

1. How are we going to handle pg_upgrade? It would be nice to be able to read the old page format, or convert on-the-fly. OTOH, if it gets too complicated, might not be worth it. The indexes are much smaller with the patch, so anyone using GIN probably wants to rebuild them anyway, sooner or later. Still, I'd like to give it a shot.

2. The patch introduces a small fixed 32-entry index into the packed items. Is that an optimal number?

3. I'd like to see some performance testing of insertions, deletions, and vacuum. I suspect that maintaining the 32-entry index might be fairly expensive, as it's rewritten on every update to a leaf page.

It appears that maintaining 32-entry index is really expensive because it required re-decoding whole page. This issue is fixed in attached version of patch by introducing incremental updating of that index. Benchmarks will be posted later today.

------
With best regards,
Alexander Korotkov.  
Вложения

Re: GIN improvements part 1: additional information

От
Peter Eisentraut
Дата:
Please fix compiler warnings:

gindatapage.c: In function ‘dataPlaceToPage’:
gindatapage.c:706:24: warning: unused variable ‘pageBackup’ [-Wunused-variable]
gindatapage.c: In function ‘updateItemIndexes’:
gindatapage.c:1182:6: warning: variable ‘totalsize’ set but not used [-Wunused-but-set-variable]
gindatapage.c: In function ‘dataPlaceToPage’:
gindatapage.c:633:28: warning: ‘startI’ may be used uninitialized in this function [-Wuninitialized]
gindatapage.c:617:21: note: ‘startI’ was declared here





Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 15.09.2013 12:14, Alexander Korotkov wrote:
> On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas<
> hlinnakangas@vmware.com>  wrote:
>
>> There's a few open questions:
>>
>> 1. How are we going to handle pg_upgrade? It would be nice to be able to
>> read the old page format, or convert on-the-fly. OTOH, if it gets too
>> complicated, might not be worth it. The indexes are much smaller with the
>> patch, so anyone using GIN probably wants to rebuild them anyway, sooner or
>> later. Still, I'd like to give it a shot.
>>
>> 2. The patch introduces a small fixed 32-entry index into the packed
>> items. Is that an optimal number?
>>
>> 3. I'd like to see some performance testing of insertions, deletions, and
>> vacuum. I suspect that maintaining the 32-entry index might be fairly
>> expensive, as it's rewritten on every update to a leaf page.
>
> It appears that maintaining 32-entry index is really expensive because it
> required re-decoding whole page. This issue is fixed in attached version of
> patch by introducing incremental updating of that index. Benchmarks will be
> posted later today..

Great! Please also benchmark WAL replay; you're still doing 
non-incremental update of the 32-entry index in ginRedoInsert.

A couple of quick comments after a quick glance (in addition to the above):

The varbyte encoding stuff is a quite isolated piece of functionality. 
And potentially useful elsewhere, too. It would be good to separate that 
into a separate .c/.h files. I'm thinking of 
src/backend/access/gin/packeditemptr.c, which would contain 
ginDataPageLeafReadItemPointer, ginDataPageLeafWriteItemPointer and 
ginDataPageLeafGetItemPointerSize functions. A new typedef for 
varbyte-encoded things would probably be good too, something like 
"typedef char *PackedItemPointer". In the new .c file, please also add a 
file-header comment explaining the encoding.

The README really needs to be updated. The posting tree page structure 
is a lot more complicated now, there needs to be a whole new section to 
explain it. A picture would be nice, similar to the one in bufpage.h.

It's a bit funny that we've clumped together all different kinds of GIN 
insertions into one WAL record type. ginRedoInsert does completely 
different things depending on what kind of a page it is. And the 
ginXlogInsert struct also contains different kinds of things depending 
on what kind of an insertion it is. It would be cleaner to split it into 
three. (this isn't your patch's fault - it was like that before, too.)

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Sep 16, 2013 at 11:43 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 15.09.2013 12:14, Alexander Korotkov wrote:
On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas<
hlinnakangas@vmware.com>  wrote:

There's a few open questions:

1. How are we going to handle pg_upgrade? It would be nice to be able to
read the old page format, or convert on-the-fly. OTOH, if it gets too
complicated, might not be worth it. The indexes are much smaller with the
patch, so anyone using GIN probably wants to rebuild them anyway, sooner or
later. Still, I'd like to give it a shot.

2. The patch introduces a small fixed 32-entry index into the packed
items. Is that an optimal number?

3. I'd like to see some performance testing of insertions, deletions, and
vacuum. I suspect that maintaining the 32-entry index might be fairly
expensive, as it's rewritten on every update to a leaf page.

It appears that maintaining 32-entry index is really expensive because it
required re-decoding whole page. This issue is fixed in attached version of
patch by introducing incremental updating of that index. Benchmarks will be
posted later today..

Great! Please also benchmark WAL replay; you're still doing non-incremental update of the 32-entry index in ginRedoInsert.

A couple of quick comments after a quick glance (in addition to the above):

The varbyte encoding stuff is a quite isolated piece of functionality. And potentially useful elsewhere, too. It would be good to separate that into a separate .c/.h files. I'm thinking of src/backend/access/gin/packeditemptr.c, which would contain ginDataPageLeafReadItemPointer, ginDataPageLeafWriteItemPointer and ginDataPageLeafGetItemPointerSize functions. A new typedef for varbyte-encoded things would probably be good too, something like "typedef char *PackedItemPointer". In the new .c file, please also add a file-header comment explaining the encoding.

The README really needs to be updated. The posting tree page structure is a lot more complicated now, there needs to be a whole new section to explain it. A picture would be nice, similar to the one in bufpage.h.

It's a bit funny that we've clumped together all different kinds of GIN insertions into one WAL record type. ginRedoInsert does completely different things depending on what kind of a page it is. And the ginXlogInsert struct also contains different kinds of things depending on what kind of an insertion it is. It would be cleaner to split it into three. (this isn't your patch's fault - it was like that before, too.)

Apparently some bugs breaks index on huge updates independent on incremental update of the 32-entry. I'm in debugging now. That's why benchmarks are delayed.

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Sep 16, 2013 at 11:43 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 15.09.2013 12:14, Alexander Korotkov wrote:
On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas<
hlinnakangas@vmware.com>  wrote:

There's a few open questions:

1. How are we going to handle pg_upgrade? It would be nice to be able to
read the old page format, or convert on-the-fly. OTOH, if it gets too
complicated, might not be worth it. The indexes are much smaller with the
patch, so anyone using GIN probably wants to rebuild them anyway, sooner or
later. Still, I'd like to give it a shot.

2. The patch introduces a small fixed 32-entry index into the packed
items. Is that an optimal number?

3. I'd like to see some performance testing of insertions, deletions, and
vacuum. I suspect that maintaining the 32-entry index might be fairly
expensive, as it's rewritten on every update to a leaf page.

It appears that maintaining 32-entry index is really expensive because it
required re-decoding whole page. This issue is fixed in attached version of
patch by introducing incremental updating of that index. Benchmarks will be
posted later today..

Great! Please also benchmark WAL replay; you're still doing non-incremental update of the 32-entry index in ginRedoInsert.

Yes
 
A couple of quick comments after a quick glance (in addition to the above):

The varbyte encoding stuff is a quite isolated piece of functionality. And potentially useful elsewhere, too. It would be good to separate that into a separate .c/.h files. I'm thinking of src/backend/access/gin/packeditemptr.c, which would contain ginDataPageLeafReadItemPointer, ginDataPageLeafWriteItemPointer and ginDataPageLeafGetItemPointerSize functions. A new typedef for varbyte-encoded things would probably be good too, something like "typedef char *PackedItemPointer". In the new .c file, please also add a file-header comment explaining the encoding.

The README really needs to be updated. The posting tree page structure is a lot more complicated now, there needs to be a whole new section to explain it. A picture would be nice, similar to the one in bufpage.h.

It's a bit funny that we've clumped together all different kinds of GIN insertions into one WAL record type. ginRedoInsert does completely different things depending on what kind of a page it is. And the ginXlogInsert struct also contains different kinds of things depending on what kind of an insertion it is. It would be cleaner to split it into three. (this isn't your patch's fault - it was like that before, too.)

Finally, I've debugged index update.

There are benchmark scripts attached which I used for testing. bench.sh is doing following:
1) Switches ~/postgres to given branch, configures and compiles it
2) Initializes cluster, runs postgres, imports mailing lists archives which could be downloaded from here:
3) Runs index creation measuring taken time and index size.
4) Runs set of index search queries measuring overall taken time and number of used blocks. Queries was extracted from mailing lists web server logs. So, they are real-life.
5) Runs huge updates and vacuums measuring overall taken time and final index size.
6) Rerun set of queries.

The results of master branch:

Time taken by index build, update and search:
     event      |     period
----------------+-----------------
 index_build    | 00:01:52.154299
 index_update   | 00:10:42.982413
 search_new     | 00:26:14.294872
 search_updated | 00:27:06.10298
(4 rows)

Numbers of blocks used in search (not very representative, because it's mostly consumed by heap fetches):
     label      | blocks_mark
----------------+-------------
 search_new     |   848156708
 search_updated |   885122373
(2 rows)

Size of index newly created and after updates:
     label     |    size
---------------+------------
 new           |  884514816
 after_updates | 1595252736
(2 rows)
 
The results of packed posting lists branch.

Time taken by index build, update and search:
     event      |     period
----------------+-----------------
 index_build    | 00:01:54.363988
 index_update   | 00:08:55.291099
 search_new     | 00:26:06.262403
 search_updated | 00:27:07.501142
(4 rows)

Numbers of blocks used in search:
     label      | blocks_mark
----------------+-------------
 search_new     |   847591514
 search_updated |   883928608
(2 rows)

Size of index newly created and after updates:
     label     |   size
---------------+-----------
 new           | 421330944
 after_updates | 718921728
(2 rows)

We can see there is no significant slowdown. Updates even becomes faster probably because of reduced index size.

Now, I'm going to take care about WAL, refactoring and documentation.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
Last version of patch is attached.
WAL is debugged here. How it's actually working.
Also there was some refactoring, commenting and README update.
Benchmark scripts are also updated. Updated version is attached.
I did some benchmark comparison with different count of index entries. See results is tables below.


         event         |    master       |  16-entries     |  32-entries     |  64-entries     |  128-entries    |
-----------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
 index_build           | 00:01:50.042658 | 00:01:53.79182  | 00:01:55.647561 | 00:01:52.677095 | 00:01:58.723898 |
 index_build_recovery  | 00:00:19        | 00:00:06        | 00:00:05        | 00:00:06        | 00:00:06        |
 index_update          | 00:05:18.215707 | 00:06:09.404842 | 00:05:49.015441 | 00:05:39.987697 | 00:05:38.723376 |
 index_update_recovery | 00:01:48        | 00:01:51        | 00:01:48        | 00:01:47        | 00:01:47        |
 search_new            | 00:25:21.481699 | 00:23:23.59775  | 00:25:13.943362 | 00:23:58.633514 | 00:22:30.763075 |
 search_updated        | 00:25:57.622592 | 00:25:29.867388 | 00:27:33.683614 | 00:25:17.565714 | 00:26:29.333003 |


     label     |    size    | 16-entries | 32-entries | 64-entries | 128-entries |
---------------+------------+------------+------------+------------+-------------+
 new           |  884514816 |  417013760 |  421240832 |  430350336 |   450994176 |
 after_updates | 1595252736 |  711368704 |  719380480 |  735682560 |   774275072 |

It's probably an option to select 64 entries instead of 32.
There is still some regression in update speed. However, there is also room for improvement patch. It searches item index entries 2 times on insert: in dataLocateLeafItem and dataPlaceToPage. We can save full results of dataLocateLeafItem, but it require some rework of gin btree interface: store not only offset of item.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Bruce Momjian
Дата:
On Sun, Sep 15, 2013 at 01:14:45PM +0400, Alexander Korotkov wrote:
> On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
> wrote:
> 
>     There's a few open questions:
> 
>     1. How are we going to handle pg_upgrade? It would be nice to be able to
>     read the old page format, or convert on-the-fly. OTOH, if it gets too
>     complicated, might not be worth it. The indexes are much smaller with the
>     patch, so anyone using GIN probably wants to rebuild them anyway, sooner or
>     later. Still, I'd like to give it a shot.

We have broken pg_upgrade index compatibility in the past. 
Specifically, hash and GIN index binary format changed from PG 8.3 to
8.4.  I handled it by invalidating the indexes and providing a
post-upgrade script to REINDEX all the changed indexes.  The user
message is:
   Your installation contains hash and/or GIN indexes.  These indexes have   different internal formats between your
oldand new clusters, so they   must be reindexed with the REINDEX command.  The file:
 
   ...
   when executed by psql by the database superuser will recreate all invalid      indexes; until then, none of these
indexeswill be used.
 

It would be very easy to do this from a pg_upgrade perspective. 
However, I know there has been complaints from others about making
pg_upgrade more restrictive.

In this specific case, even if you write code to read the old file
format, we might want to create the REINDEX script to allow _optional_
reindexing to shrink the index files.

If we do require the REINDEX, --check will clearly warn the user that
this will be required.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Sep 23, 2013 at 12:47 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
It's probably an option to select 64 entries instead of 32.
There is still some regression in update speed. However, there is also room for improvement patch. It searches item index entries 2 times on insert: in dataLocateLeafItem and dataPlaceToPage. We can save full results of dataLocateLeafItem, but it require some rework of gin btree interface: store not only offset of item.

In the attached version of patch double finding of ItemPointer during insert is avoided. Overhead becomes lower as expected.

         event         |    master       |  16-entries     |  32-entries     |  64-entries     |  128-entries    |
-----------------------+-----------------+-----------------+-----------------+-----------------+-----------------+
 index_build           | 00:01:50.042658 | 00:01:54.130873 | 00:01:59.37302  | 00:01:55.959693 | 00:01:58.126407 |
 index_build_recovery  | 00:00:19        | 00:00:06        | 00:00:06        | 00:00:06        | 00:00:06        |
 index_update          | 00:05:18.215707 | 00:05:38.40231  | 00:05:30.658786 | 00:05:27.664312 | 00:05:30.815876 |
 index_update_recovery | 00:01:48        | 00:01:53        | 00:01:50        | 00:01:44        | 00:01:46        |
 search_new            | 00:25:21.481699 | 00:23:20.324152 | 00:24:02.120438 | 00:22:50.989723 | 00:23:05.703824 |
 search_updated        | 00:25:57.622592 | 00:26:43.531979 | 00:26:08.003085 | 00:24:36.669028 | 00:26:09.175243 |


     label     |    size    | 16-entries | 32-entries | 64-entries | 128-entries |
---------------+------------+------------+------------+------------+-------------+
 new           |  884514816 |  417013760 |  421240832 |  430350336 |   450994176 |
 after_updates | 1595252736 |  711368704 |  719380480 |  735682560 |   774275072 |

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Peter Eisentraut
Дата:
On 9/23/13 5:36 PM, Alexander Korotkov wrote:
> In the attached version of patch double finding of ItemPointer during
> insert is avoided. Overhead becomes lower as expected.

Fails cpluspluscheck:

./src/include/access/gin_private.h: In function ‘char*
ginDataPageLeafReadItemPointer(char*, ItemPointer)’:
./src/include/access/gin_private.h:797:2: warning: comparison between
signed and unsigned integer expressions [-Wsign-compare]




Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Wed, Sep 25, 2013 at 5:22 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
On 9/23/13 5:36 PM, Alexander Korotkov wrote:
> In the attached version of patch double finding of ItemPointer during
> insert is avoided. Overhead becomes lower as expected.

Fails cpluspluscheck:

./src/include/access/gin_private.h: In function ‘char*
ginDataPageLeafReadItemPointer(char*, ItemPointer)’:
./src/include/access/gin_private.h:797:2: warning: comparison between
signed and unsigned integer expressions [-Wsign-compare]

Thanks. Fixed in attached version of patch.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 23.09.2013 18:35, Bruce Momjian wrote:
> On Sun, Sep 15, 2013 at 01:14:45PM +0400, Alexander Korotkov wrote:
>> On Sat, Jun 29, 2013 at 12:56 PM, Heikki Linnakangas<hlinnakangas@vmware.com>
>> wrote:
>>
>>      There's a few open questions:
>>
>>      1. How are we going to handle pg_upgrade? It would be nice to be able to
>>      read the old page format, or convert on-the-fly. OTOH, if it gets too
>>      complicated, might not be worth it. The indexes are much smaller with the
>>      patch, so anyone using GIN probably wants to rebuild them anyway, sooner or
>>      later. Still, I'd like to give it a shot.
>
> We have broken pg_upgrade index compatibility in the past.
> Specifically, hash and GIN index binary format changed from PG 8.3 to
> 8.4.  I handled it by invalidating the indexes and providing a
> post-upgrade script to REINDEX all the changed indexes.  The user
> message is:
>
>        Your installation contains hash and/or GIN indexes.  These indexes have
>        different internal formats between your old and new clusters, so they
>        must be reindexed with the REINDEX command.  The file:
>
>        ...
>
>        when executed by psql by the database superuser will recreate all invalid
>         indexes; until then, none of these indexes will be used.
>
> It would be very easy to do this from a pg_upgrade perspective.
> However, I know there has been complaints from others about making
> pg_upgrade more restrictive.
>
> In this specific case, even if you write code to read the old file
> format, we might want to create the REINDEX script to allow _optional_
> reindexing to shrink the index files.
>
> If we do require the REINDEX, --check will clearly warn the user that
> this will be required.

It seems we've all but decided that we'll require reindexing GIN indexes 
in 9.4. Let's take the opportunity to change some other annoyances with 
the current GIN on-disk format:

1. There's no explicit "page id" field in the opaque struct, like there 
is in other index types. This is for the benefit of debugging tools like 
pg_filedump. We've managed to tell GIN pages apart from other index 
types by the fact that the special size of GIN pages is 8 and it's not 
using all the high-order bits in the last byte on the page. But an 
explicit page id field would be nice, so let's add that.

2. I'd like to change the way "incomplete splits" are handled. 
Currently, WAL recovery keeps track of incomplete splits, and fixes any 
that remain at the end of recovery. That concept is slightly broken; 
it's not guaranteed that after you've split a leaf page, for example, 
you will succeed in inserting the downlink to its parent. You might e.g 
run out of disk space. To fix that, I'd like to add a flag to the page 
header to indicate if the split has been completed, ie. if the page's 
downlink has been inserted to the parent, and fix them lazily on the 
next insert. I did a similar change to GiST back in 9.1. (Strictly 
speaking this doesn't require changing the on-disk format, though.)

3. I noticed that the GIN b-trees, the main key entry tree and the 
posting trees, use a slightly different arrangement of the downlink than 
our regular nbtree code does. In nbtree, the downlink for a page is the 
*low* key of that page, ie. if the downlink is 10, all the items on that 
child page must be >= 10. But in GIN, we store the *high* key in the 
downlink, ie. all the items on the child page must be <= 10. That makes 
inserting new downlinks at a page split slightly more complicated. For 
example, when splitting a page containing keys between 1-10 into 1-5 and 
5-10, you need to insert a new downlink with key 10 for the new right 
page, and also update the existing downlink to 5. The nbtree code 
doesn't require updating existing entries.

Anything else?

- Heikki



Re: GIN improvements part 1: additional information

От
Robert Haas
Дата:
On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> It seems we've all but decided that we'll require reindexing GIN indexes in
> 9.4.

I thought the consensus in Ottawa was strongly against that.  I'm not
aware that anyone has subsequently changed their position on the
topic.  Bruce is right to point out that we've done such things before
and can therefore do it again, but just because we have the technical
means to do it doesn't make it good policy.

That having been said, if we do decide to break it...

> Let's take the opportunity to change some other annoyances with the
> current GIN on-disk format:

...then fixing as much as possible in one go-round is clearly a good plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Thu, Oct 3, 2013 at 10:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> It seems we've all but decided that we'll require reindexing GIN indexes in
> 9.4.

I thought the consensus in Ottawa was strongly against that.  I'm not
aware that anyone has subsequently changed their position on the
topic.  Bruce is right to point out that we've done such things before
and can therefore do it again, but just because we have the technical
means to do it doesn't make it good policy.

That having been said, if we do decide to break it...

> Let's take the opportunity to change some other annoyances with the
> current GIN on-disk format:

...then fixing as much as possible in one go-round is clearly a good plan.

Let's see what options we have at all. I see following:
1) Drop support old GIN on-disk format. But users will have to reindex after pg_upgrade.
2) Insert kluges into GIN to support both old and new formats. So, kluges are kluges :) I don't see elegant way to do it for now, because formats are very different.
3) Upgrade GIN on-disk format in pg_upgrade. However, it would be rewriting almost whole index. Is it much better than just reindex?
4) Fork GIN2, leave GIN as is. It would lead to much of duplicated code.
Any other options?

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 03.10.2013 23:37, Alexander Korotkov wrote:
> 2) Insert kluges into GIN to support both old and new formats. So, kluges
> are kluges :) I don't see elegant way to do it for now, because formats are
> very different.

Hmm. All you need is some code to read the old format, and a function to 
convert a page to new format before updating. It doesn't seem *that* 
kludgey. It's a fair amount of work, for sure, but not insurmountable.

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Oct 4, 2013 at 12:41 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 03.10.2013 23:37, Alexander Korotkov wrote:
2) Insert kluges into GIN to support both old and new formats. So, kluges
are kluges :) I don't see elegant way to do it for now, because formats are
very different.

Hmm. All you need is some code to read the old format, and a function to convert a page to new format before updating. It doesn't seem *that* kludgey. It's a fair amount of work, for sure, but not insurmountable.

My notice was not as much about amount of work as about result.
ItemPointers compression reduce occupied space in all normal cases. It's not very realistic, but it could increase space in worst case. That would lead to page split after conversion. Are we going to support such case?

------
With best regards,
Alexander Korotkov. 

Re: GIN improvements part 1: additional information

От
Bruce Momjian
Дата:
On Thu, Oct  3, 2013 at 02:48:20PM -0400, Robert Haas wrote:
> On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > It seems we've all but decided that we'll require reindexing GIN indexes in
> > 9.4.
> 
> I thought the consensus in Ottawa was strongly against that.  I'm not
> aware that anyone has subsequently changed their position on the
> topic.  Bruce is right to point out that we've done such things before
> and can therefore do it again, but just because we have the technical
> means to do it doesn't make it good policy.
> 
> That having been said, if we do decide to break it...

Agreed.  I was stating only that this is easy for pg_upgrade.  One cool
thing is that the upgrades completes, and the indexes are there, but
just marked as invalid until the REINDEX.

One other point Alexander made is that the new GIN indexes will be
smaller so most people would want the new format in the new cluster
anyway.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 03.10.2013 23:54, Alexander Korotkov wrote:
> ItemPointers compression reduce occupied space in all normal cases. It's
> not very realistic, but it could increase space in worst case. That would
> lead to page split after conversion. Are we going to support such case?

Hmm, that's probably rare enough that the number of such indexes in the 
real world where that could happen is exactly 0. A compressed item 
requires 7 bytes in the worst case; that is an offset > 127, and 
distance to previous item > 2^(4*7) = 268435456 blocks. With the default 
block size, that requires an index larger than 2TB. And that's just for 
one such item to appear - to actually cause a page to overflow, a page 
would need to be full of other items widely apart each other to take up 
6 bytes each.

So I think if you can make the conversion work with the assumption that 
the compressed format always fits in the old space, and throw an error 
if it doesn't, that's good enough. (That's for the posting trees - the 
posting lists attached to entry tuples is a different story.)

Besides, if you convert the page when you insert to it, you might need 
to split it anyway. So it might not be very difficult to split if required.

IMHO the main argument for not bothering with pg_upgrade is that the 
gain from the patch is so great that you'll want to REINDEX after the 
upgrade anyway, to shrink the index. I really don't have an opinion on 
whether we should attempt reading the old format. On one hand, it would 
be really nice to not have that caveat when you pg_upgrade (oh, you have 
GIN indexes, you have to reindex..). On the other hand, supporting the 
old format is a fair amount of extra code to maintain.

- Heikki



Re: GIN improvements part 1: additional information

От
Alvaro Herrera
Дата:
Bruce Momjian escribió:

> Agreed.  I was stating only that this is easy for pg_upgrade.  One cool
> thing is that the upgrades completes, and the indexes are there, but
> just marked as invalid until the REINDEX.
> 
> One other point Alexander made is that the new GIN indexes will be
> smaller so most people would want the new format in the new cluster
> anyway.

But they're nonfunctional until after the reindex, which is bad for
people who want a quick upgrade and return to operational mode
immediately.  If you could just keep the old indexes around, in working
state, until they are REINDEX CONCURRENTLY'ed, that would be more
practical than just marking them invalid.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Oct 4, 2013 at 12:37 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Oct 3, 2013 at 10:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> It seems we've all but decided that we'll require reindexing GIN indexes in
> 9.4.

I thought the consensus in Ottawa was strongly against that.  I'm not
aware that anyone has subsequently changed their position on the
topic.  Bruce is right to point out that we've done such things before
and can therefore do it again, but just because we have the technical
means to do it doesn't make it good policy.

That having been said, if we do decide to break it...

> Let's take the opportunity to change some other annoyances with the
> current GIN on-disk format:

...then fixing as much as possible in one go-round is clearly a good plan.

Let's see what options we have at all. I see following:
1) Drop support old GIN on-disk format. But users will have to reindex after pg_upgrade.
2) Insert kluges into GIN to support both old and new formats. So, kluges are kluges :) I don't see elegant way to do it for now, because formats are very different.
3) Upgrade GIN on-disk format in pg_upgrade. However, it would be rewriting almost whole index. Is it much better than just reindex?
4) Fork GIN2, leave GIN as is. It would lead to much of duplicated code.
Any other options?

I came to idea that I like option #4 more than option #2.
If we try to make new GIN work with old page formats we have to maintain 3 use cases:
1) old GIN with old page format (because of old releases)
2) new GIN with old page format
3) new GIN with new page format

If we create GIN2 we maintain only 2 use cases:
1) old GIN with old page format
2) new GIN with new page format
The code of old GIN would be additional code in 9.4, but not additional code we maintain. Because we anyway maintain exactly same in old releases.

The problem I see is how to migrate users to GIN2. We can't expect they read release notes, create GIN2 indexes and drop GIN1 indexes. A lot of users will still use GIN1, because of they don't care :)
Ideally any new GIN index should be GIN2 and reindex turns GIN1 into GIN2.

------
With best regards,
Alexander Korotkov. 

Re: GIN improvements part 1: additional information

От
Bruce Momjian
Дата:
On Fri, Oct  4, 2013 at 02:23:33AM +0400, Alexander Korotkov wrote:
> I came to idea that I like option #4 more than option #2.
> If we try to make new GIN work with old page formats we have to maintain 3 use
> cases:
> 1) old GIN with old page format (because of old releases)
> 2) new GIN with old page format
> 3) new GIN with new page format
> 
> If we create GIN2 we maintain only 2 use cases:
> 1) old GIN with old page format
> 2) new GIN with new page format
> The code of old GIN would be additional code in 9.4, but not additional code we
> maintain. Because we anyway maintain exactly same in old releases.
> 
> The problem I see is how to migrate users to GIN2. We can't expect they read
> release notes, create GIN2 indexes and drop GIN1 indexes. A lot of users will
> still use GIN1, because of they don't care :)
> Ideally any new GIN index should be GIN2 and reindex turns GIN1 into GIN2.

I am not sure I like the complexity of a GIN2, but we should give this
problem some serious thought as it will affect how we deal with other
on-disk index changes in the future.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
Aside from the pg_upgrade discussion, here's an updated version of the
patch, rebased over master. It also contains some other misc refactoring
I've done while reading through the patch. I haven't tested this much, I
may well have also broken something, but I wanted to post an update
before the weekend.

Thinking about the page format, I think we should start using the
pd_lower/upper pointers in the data page format. For a non-leaf page,
pd_upper would always point to the beginning of the special area, and
pd_lower would indicate the end of PostingItems. For a leaf page,
pd_lower would indicate the end of the compressed posting list, and
pd_upper would point to the "leaf-index" at the end of the page. That
matches the standard page layout in the sense that the space between
pd_lower and pd_upper is free, although the data stored in the non-free
areas would be quite different. That would allow us to mark full-page
images with buffer_std, allowing the "gap" to be left out. I think that
would be a more natural way to keep track of the used/unused space on
the page, anyway, compared to the current maxoff/endoffset field in the
special area.

In the attached patch, I in fact already did that for data leaf pages,
but didn't change the format of non-leaf pages yet. If we want to
support pg_upgrade, we might want to refrain from changing the non-leaf
format.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Oct 4, 2013 at 12:28 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
Aside from the pg_upgrade discussion, here's an updated version of the patch, rebased over master. It also contains some other misc refactoring I've done while reading through the patch. I haven't tested this much, I may well have also broken something, but I wanted to post an update before the weekend.

Thinking about the page format, I think we should start using the pd_lower/upper pointers in the data page format. For a non-leaf page, pd_upper would always point to the beginning of the special area, and pd_lower would indicate the end of PostingItems. For a leaf page, pd_lower would indicate the end of the compressed posting list, and pd_upper would point to the "leaf-index" at the end of the page. That matches the standard page layout in the sense that the space between pd_lower and pd_upper is free, although the data stored in the non-free areas would be quite different. That would allow us to mark full-page images with buffer_std, allowing the "gap" to be left out. I think that would be a more natural way to keep track of the used/unused space on the page, anyway, compared to the current maxoff/endoffset field in the special area.

In the attached patch, I in fact already did that for data leaf pages, but didn't change the format of non-leaf pages yet. If we want to support pg_upgrade, we might want to refrain from changing the non-leaf format.

In GinDataLeafPageGetPostingList* you use sizeof(ItemPointerData) without MAXALIGN. Is it an error or you especially use 2 extra bytes on leaf page?

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 4.10.2013 10:28, Heikki Linnakangas wrote:
> Aside from the pg_upgrade discussion, here's an updated version of the
> patch, rebased over master. It also contains some other misc refactoring
> I've done while reading through the patch. I haven't tested this much, I
> may well have also broken something, but I wanted to post an update
> before the weekend.
>
> Thinking about the page format, I think we should start using the
> pd_lower/upper pointers in the data page format. For a non-leaf page,
> pd_upper would always point to the beginning of the special area, and
> pd_lower would indicate the end of PostingItems. For a leaf page,
> pd_lower would indicate the end of the compressed posting list, and
> pd_upper would point to the "leaf-index" at the end of the page. That
> matches the standard page layout in the sense that the space between
> pd_lower and pd_upper is free, although the data stored in the non-free
> areas would be quite different. That would allow us to mark full-page
> images with buffer_std, allowing the "gap" to be left out. I think that
> would be a more natural way to keep track of the used/unused space on
> the page, anyway, compared to the current maxoff/endoffset field in the
> special area.
>
> In the attached patch, I in fact already did that for data leaf pages,
> but didn't change the format of non-leaf pages yet. If we want to
> support pg_upgrade, we might want to refrain from changing the non-leaf
> format.
>
> - Heikki

Hi,

I've attempted to rerun the benchmarks tests I did a few weeks ago, but
I got repeated crashes when loading the data (into a table with
tsvector+gin index).

Right before a crash, theres this message in the log:

   PANIC:  not enough space in leaf page!

The exact crash varies though - for example this is the backtrace on the
first gdb attempt (second crash):

Program received signal SIGABRT, Aborted.
0x00007fb3c0b40b15 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fb3c0b40b15 in raise () from /lib64/libc.so.6
#1  0x00007fb3c0b41f96 in abort () from /lib64/libc.so.6
#2  0x000000000072753b in errfinish ()
#3  0x0000000000729c1a in elog_finish ()
#4  0x00000000004721d6 in dataPlaceToPage ()
#5  0x0000000000473d51 in ginInsertValue ()
#6  0x0000000000473262 in ginInsertItemPointers ()
#7  0x000000000046d915 in ginEntryInsert ()
#8  0x00000000004792bf in ginInsertCleanup ()
#9  0x0000000000479b32 in ginHeapTupleFastInsert ()
#10 0x000000000046e0b4 in gininsert ()
#11 0x000000000072d733 in FunctionCall6Coll ()
#12 0x00000000004977bc in index_insert ()
#13 0x00000000005952ad in ExecInsertIndexTuples ()
#14 0x00000000005a235d in ExecModifyTable ()
#15 0x000000000058c4e8 in ExecProcNode ()
#16 0x0000000000589ab0 in standard_ExecutorRun ()
#17 0x00000000006603cf in ProcessQuery ()
#18 0x00000000006605fc in PortalRunMulti ()
#19 0x0000000000661142 in PortalRun ()
#20 0x000000000065e4c3 in PostgresMain ()
#21 0x0000000000465446 in ServerLoop ()
#22 0x000000000061f553 in PostmasterMain ()
#23 0x0000000000465cd5 in main ()

Then I recompiled the sources with "-ggdb" and I got this:

Program received signal SIGQUIT, Quit.
ResourceOwnerEnlargeSnapshots (owner=0x200000001) at resowner.c:1077
1077    {
(gdb) bt
#0  ResourceOwnerEnlargeSnapshots (owner=0x200000001) at resowner.c:1077
#1  0x0000000000872228 in RegisterSnapshotOnOwner (snapshot=0x1f9ac60,
snapshot@entry=<error reading variable: Cannot access memory at address
0x7fff6b6151c8>, owner=0x1e81960) at snapmgr.c:588

and then this:

Program received signal SIGQUIT, Quit.
0x00007f48d298b8f2 in recv () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f48d298b8f2 in recv () from /lib64/libc.so.6
#1  0x000000000062ec6b in secure_read (port=0x1a110b0, ptr=0xc5f840
<PqRecvBuffer>, len=8192) at be-secure.c:304
#2  0x0000000000639e31 in pq_recvbuf () at pqcomm.c:854
#3  0x0000000000639ec9 in pq_getbyte () at pqcomm.c:895
#4  0x0000000000723743 in SocketBackend (inBuf=0x7fff3522f9f0) at
postgres.c:344
#5  0x0000000000723b22 in ReadCommand (inBuf=0x7fff3522f9f0) at
postgres.c:492
#6  0x0000000000728218 in PostgresMain (argc=1, argv=0x19f4b50,
dbname=0x19f4b38 "archie", username=0x19f4b20 "tomas") at postgres.c:3953
#7  0x00000000006d27ae in BackendRun (port=0x1a110b0) at postmaster.c:4083
#8  0x00000000006d1f84 in BackendStartup (port=0x1a110b0) at
postmaster.c:3772
#9  0x00000000006cea47 in ServerLoop () at postmaster.c:1583
#10 0x00000000006ce150 in PostmasterMain (argc=3, argv=0x19f2a10) at
postmaster.c:1239
#11 0x000000000063c0a8 in main (argc=3, argv=0x19f2a10) at main.c:196

So it crashes at various times, although now that I'm looking into the
log I see this:

LOG:  server process (PID 12326) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: autovacuum: ANALYZE public.messages
LOG:  terminating any other active server processes

So in some cases it was autovacuum that crashed, and the other processes
were killed because of corrupted shared memory.

But in all cases there's 'not enough space in leaf page!' right before
the crash.

The behavior after the crash is pretty consistent too - I do get this:

LOG:  startup process (PID 26259) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
LOG:  database system was interrupted while in recovery at 2013-10-06
01:26:48 CEST
HINT:  This probably means that some data is corrupted and you will have
to use the last backup for recovery.
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/2B0094B0
LOG:  startup process (PID 12237) was terminated by signal 11:
Segmentation fault

or this:

LOG:  select() failed in postmaster: Nepřípustný argument
LOG:  database system was interrupted; last known up at 2013-10-06
01:39:38 CEST
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/A013300
LOG:  startup process (PID 12441) was terminated by signal 11:
Segmentation fault
LOG:  aborting startup due to startup process failure

and then I have to reinit the whole cluster because the xlog is corrupted.

Attached is a log from the multiple runs (and crashes). The test
basically parses mailing list archives and inserts them into a table.
The unique constraint violations are OK (i.e. expected).

Tomas

Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
Hi, Tomas!

On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
I've attempted to rerun the benchmarks tests I did a few weeks ago, but
I got repeated crashes when loading the data (into a table with
tsvector+gin index).

Right before a crash, theres this message in the log:

   PANIC:  not enough space in leaf page!

Thanks for testing. Heikki's version of patch don't works for me too on even much more simplier examples. I can try to get it working if he answer my question about GinDataLeafPageGetPostingList* macros.

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 04.10.2013 14:13, Alexander Korotkov wrote:
> On Fri, Oct 4, 2013 at 12:28 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>
>> In the attached patch, I in fact already did that for data leaf pages, but
>> didn't change the format of non-leaf pages yet. If we want to support
>> pg_upgrade, we might want to refrain from changing the non-leaf format.
>
> In GinDataLeafPageGetPostingList* you use sizeof(ItemPointerData) without
> MAXALIGN. Is it an error or you especially use 2 extra bytes on leaf page?

I didn't even think of it. Now that I do think of it, I don't see a 
reason to use MAXALIGN there. PostingItems only require 2-byte 
alignment. It's a bit fragile and underdocumented though. It probably 
would be good to have a struct to represent that layout. Something like:

struct
{  ItemPointerData rightBound;  PostingItem postingItems[1]; /* variable length array */
} PostingItemPageContent;

And use that struct in the macros.

Then again, we do currently use MAXALIGN there, so if we want to avoid 
changing the on-disk format, we have to keep it...

- Heikki



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 08.10.2013 17:47, Alexander Korotkov wrote:
> Hi, Tomas!
>
> On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra<tv@fuzzy.cz>  wrote:
>
>> I've attempted to rerun the benchmarks tests I did a few weeks ago, but
>>   I got repeated crashes when loading the data (into a table with
>> tsvector+gin index).
>>
>> Right before a crash, theres this message in the log:
>>
>>     PANIC:  not enough space in leaf page!
>>
>
> Thanks for testing. Heikki's version of patch don't works for me too on
> even much more simplier examples. I can try to get it working if he answer
> my question about GinDataLeafPageGetPostingList* macros.

The new macros in that patch version were quite botched. Here's a new
attempt.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 8.10.2013 21:59, Heikki Linnakangas wrote:
> On 08.10.2013 17:47, Alexander Korotkov wrote:
>> Hi, Tomas!
>>
>> On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra<tv@fuzzy.cz>  wrote:
>>
>>> I've attempted to rerun the benchmarks tests I did a few weeks ago, but
>>>   I got repeated crashes when loading the data (into a table with
>>> tsvector+gin index).
>>>
>>> Right before a crash, theres this message in the log:
>>>
>>>     PANIC:  not enough space in leaf page!
>>>
>>
>> Thanks for testing. Heikki's version of patch don't works for me too on
>> even much more simplier examples. I can try to get it working if he
>> answer
>> my question about GinDataLeafPageGetPostingList* macros.
> 
> The new macros in that patch version were quite botched. Here's a new
> attempt.

Nope, still the same errors :-(

PANIC:  not enough space in leaf page!
LOG:  server process (PID 29722) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: autovacuum: ANALYZE public.messages

regards
Tomas



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 09.10.2013 02:04, Tomas Vondra wrote:
> On 8.10.2013 21:59, Heikki Linnakangas wrote:
>> On 08.10.2013 17:47, Alexander Korotkov wrote:
>>> Hi, Tomas!
>>>
>>> On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra<tv@fuzzy.cz>   wrote:
>>>
>>>> I've attempted to rerun the benchmarks tests I did a few weeks ago, but
>>>>    I got repeated crashes when loading the data (into a table with
>>>> tsvector+gin index).
>>>>
>>>> Right before a crash, theres this message in the log:
>>>>
>>>>      PANIC:  not enough space in leaf page!
>>>>
>>>
>>> Thanks for testing. Heikki's version of patch don't works for me too on
>>> even much more simplier examples. I can try to get it working if he
>>> answer
>>> my question about GinDataLeafPageGetPostingList* macros.
>>
>> The new macros in that patch version were quite botched. Here's a new
>> attempt.
>
> Nope, still the same errors :-(
>
> PANIC:  not enough space in leaf page!
> LOG:  server process (PID 29722) was terminated by signal 6: Aborted
> DETAIL:  Failed process was running: autovacuum: ANALYZE public.messages

I've continued hacking away at the patch, here's yet another version.
I've done a lot of cleanup and refactoring to make the code more
readable (I hope). I'm not sure what caused the panic you saw, but it's
probably fixed now.  Let me know if not.

Some notable changes since my last patch version:

* I changed the ginbtree interface so that isEnoughSpace method is no
more. Instead, placeToPage returns false without doing anything if the
item doesn't fit. It was slightly simpler this way when working with the
compressed posting lists, as placeToPage had to calculate tuple sizes
anyway to decide how many items fit on the page.

* I rewrote incrUpdateItemIndexes(). It now operates in two stages: 1.
adjust the pageOffsets and 2. split the bin. I found it more readable
this way.

* I changed the WAL format of insert records so that there is a separate
struct for data-leaf, data-internal, and entry pages. The information
recorded for each of those was so different that it was confusing to
cram them all into the same struct.


I'm going to step back a bit from the details of the patch now. I think
there's enough work for you, Alexander, until the next commitfest:

* Try to make the code also work with the old page format, so that you
don't need to REINDEX after pg_upgrade.

* Documentation. The new compressed posting list format needs to be
explained somewhere. At the top of ginpostinglist.c would be a good
place. The README should be improved too.

* Fix whatever I broke (sorry)

Are we committed to go ahead with this patch in 9.4 timeframe, in one
form or another? If we are, then I'd like to start committing
refactoring patches that just move code around in preparation for the
Patch, to make the review of the core of the patch easier. For example,
merging the isEnoughSpace and placeToPage methods in the ginbtree
interface. Without the patch, it's unnecessary code churn - it won't do
any harm but it won't do any good either. I'm pretty confident that this
patch will land in the 9.4 timeframe, so barring objections I'll start
committing such refactorings.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Peter Eisentraut
Дата:
You linked to this email from the commitfest entry, but there is no
patch here.  You probably meant a different email.  Check please.


On Tue, 2013-10-08 at 21:48 +0300, Heikki Linnakangas wrote:
> On 04.10.2013 14:13, Alexander Korotkov wrote:
> > On Fri, Oct 4, 2013 at 12:28 PM, Heikki Linnakangas<hlinnakangas@vmware.com
> >> wrote:
> >
> >> In the attached patch, I in fact already did that for data leaf pages, but
> >> didn't change the format of non-leaf pages yet. If we want to support
> >> pg_upgrade, we might want to refrain from changing the non-leaf format.
> >
> > In GinDataLeafPageGetPostingList* you use sizeof(ItemPointerData) without
> > MAXALIGN. Is it an error or you especially use 2 extra bytes on leaf page?
> 
> I didn't even think of it. Now that I do think of it, I don't see a 
> reason to use MAXALIGN there. PostingItems only require 2-byte 
> alignment. It's a bit fragile and underdocumented though. It probably 
> would be good to have a struct to represent that layout. Something like:
> 
> struct
> {
>    ItemPointerData rightBound;
>    PostingItem postingItems[1]; /* variable length array */
> } PostingItemPageContent;
> 
> And use that struct in the macros.
> 
> Then again, we do currently use MAXALIGN there, so if we want to avoid 
> changing the on-disk format, we have to keep it...
> 
> - Heikki
> 
> 






Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 10.10.2013 13:57, Heikki Linnakangas wrote:
> On 09.10.2013 02:04, Tomas Vondra wrote:
>> On 8.10.2013 21:59, Heikki Linnakangas wrote:
>>> On 08.10.2013 17:47, Alexander Korotkov wrote:
>>>> Hi, Tomas!
>>>>
>>>> On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra<tv@fuzzy.cz>   wrote:
>>>>
>>>>> I've attempted to rerun the benchmarks tests I did a few weeks ago,
>>>>> but
>>>>>    I got repeated crashes when loading the data (into a table with
>>>>> tsvector+gin index).
>>>>>
>>>>> Right before a crash, theres this message in the log:
>>>>>
>>>>>      PANIC:  not enough space in leaf page!
>>>>>
>>>>
>>>> Thanks for testing. Heikki's version of patch don't works for me too on
>>>> even much more simplier examples. I can try to get it working if he
>>>> answer
>>>> my question about GinDataLeafPageGetPostingList* macros.
>>>
>>> The new macros in that patch version were quite botched. Here's a new
>>> attempt.
>>
>> Nope, still the same errors :-(
>>
>> PANIC:  not enough space in leaf page!
>> LOG:  server process (PID 29722) was terminated by signal 6: Aborted
>> DETAIL:  Failed process was running: autovacuum: ANALYZE public.messages
> 
> I've continued hacking away at the patch, here's yet another version.
> I've done a lot of cleanup and refactoring to make the code more
> readable (I hope). I'm not sure what caused the panic you saw, but it's
> probably fixed now.  Let me know if not.

Yup, this version fixed the issues. I haven't been able to do any
benchmarks yet, all I have is some basic stats
              |   HEAD   |  patched
======================================
load duration  |  1084 s  |   1086 s
subject index  |   96 MB  |     96 MB
body index     | 2349 MB  |   2051 MB

So there's virtually no difference in speed (which is expected, AFAIK)
and the large index on full message bodies is significantly smaller.

regards
Tomas





Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Sat, Oct 12, 2013 at 1:55 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 10.10.2013 13:57, Heikki Linnakangas wrote:
> On 09.10.2013 02:04, Tomas Vondra wrote:
>> On 8.10.2013 21:59, Heikki Linnakangas wrote:
>>> On 08.10.2013 17:47, Alexander Korotkov wrote:
>>>> Hi, Tomas!
>>>>
>>>> On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra<tv@fuzzy.cz>   wrote:
>>>>
>>>>> I've attempted to rerun the benchmarks tests I did a few weeks ago,
>>>>> but
>>>>>    I got repeated crashes when loading the data (into a table with
>>>>> tsvector+gin index).
>>>>>
>>>>> Right before a crash, theres this message in the log:
>>>>>
>>>>>      PANIC:  not enough space in leaf page!
>>>>>
>>>>
>>>> Thanks for testing. Heikki's version of patch don't works for me too on
>>>> even much more simplier examples. I can try to get it working if he
>>>> answer
>>>> my question about GinDataLeafPageGetPostingList* macros.
>>>
>>> The new macros in that patch version were quite botched. Here's a new
>>> attempt.
>>
>> Nope, still the same errors :-(
>>
>> PANIC:  not enough space in leaf page!
>> LOG:  server process (PID 29722) was terminated by signal 6: Aborted
>> DETAIL:  Failed process was running: autovacuum: ANALYZE public.messages
>
> I've continued hacking away at the patch, here's yet another version.
> I've done a lot of cleanup and refactoring to make the code more
> readable (I hope). I'm not sure what caused the panic you saw, but it's
> probably fixed now.  Let me know if not.

Yup, this version fixed the issues. I haven't been able to do any
benchmarks yet, all I have is some basic stats

               |   HEAD   |  patched
======================================
load duration  |  1084 s  |   1086 s
subject index  |   96 MB  |     96 MB
body index     | 2349 MB  |   2051 MB

So there's virtually no difference in speed (which is expected, AFAIK)
and the large index on full message bodies is significantly smaller.

Yes, it should be no significant difference in speed. But difference in index sizes seems to be too small. Could you share database dump somewhere?

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 12.10.2013 12:11, Alexander Korotkov wrote:
> On Sat, Oct 12, 2013 at 1:55 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
> 
>     Yup, this version fixed the issues. I haven't been able to do any
>     benchmarks yet, all I have is some basic stats
> 
>                    |   HEAD   |  patched
>     ======================================
>     load duration  |  1084 s  |   1086 s
>     subject index  |   96 MB  |     96 MB
>     body index     | 2349 MB  |   2051 MB
> 
>     So there's virtually no difference in speed (which is expected, AFAIK)
>     and the large index on full message bodies is significantly smaller.
> 
> 
> Yes, it should be no significant difference in speed. But difference in
> index sizes seems to be too small. Could you share database dump somewhere?

Turns out that if I do VACUUM FULL after loading the data (a sequence of
INSERT commands), the index sizes drop significantly.
                  |   HEAD   |  patched   ======================================   subject index  |   42 MB  |    15 MB
 body index     |  624 MB  |   328 MB
 

So there's a significant improvement, as expected. I'm wondering if the
bloat is expected too? Is that the consequence of incremental index
updates vs. rebuilding the whole index at once during VACUUM FULL?

Tomas



Re: GIN improvements part 1: additional information

От
Bruce Momjian
Дата:
On Thu, Oct  3, 2013 at 05:24:49PM -0400, Bruce Momjian wrote:
> On Thu, Oct  3, 2013 at 02:48:20PM -0400, Robert Haas wrote:
> > On Thu, Oct 3, 2013 at 2:43 PM, Heikki Linnakangas
> > <hlinnakangas@vmware.com> wrote:
> > > It seems we've all but decided that we'll require reindexing GIN indexes in
> > > 9.4.
> > 
> > I thought the consensus in Ottawa was strongly against that.  I'm not
> > aware that anyone has subsequently changed their position on the
> > topic.  Bruce is right to point out that we've done such things before
> > and can therefore do it again, but just because we have the technical
> > means to do it doesn't make it good policy.
> > 
> > That having been said, if we do decide to break it...
> 
> Agreed.  I was stating only that this is easy for pg_upgrade.  One cool
> thing is that the upgrades completes, and the indexes are there, but
> just marked as invalid until the REINDEX.
> 
> One other point Alexander made is that the new GIN indexes will be
> smaller so most people would want the new format in the new cluster
> anyway.

I am in Moscow with Alexander and we were discussing GIN pg_upgrade
issues.  One option we have already discussed was to take the old GIN
index code and put it in a separate directory, and call the new GIN
index something different, but that got hung up on how users were going
to create indexes of the new type.

One nice trick would be for the index creation code to check the global
variable IsBinaryUpgrade that pg_upgrade sets.  If CREATE INDEX ... GIN
finds IsBinaryUpgrade set, it should create an (empty) index of type
gin-v1/old.  If it is not set, it should create a new gin index.  This
will allow pg_upgrade to work, and allow REINDEX to create a new-type
GIN index from an old one.

We would need to append "-v1" to the old index directory, system
catalog, and function names.  We could then remove the old GIN index
code in some later major release, and pg_upgrade will then mark the
indexes as invalid and create a REINDEX script.

This allows users to reindex their GIN indexes over time, but it doesn't
lock us into keeping the gin-v1 code around forever.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: GIN improvements part 1: additional information

От
Bruce Momjian
Дата:
On Sat, Oct 19, 2013 at 05:11:55PM -0400, Bruce Momjian wrote:
> I am in Moscow with Alexander and we were discussing GIN pg_upgrade
> issues.  One option we have already discussed was to take the old GIN
> index code and put it in a separate directory, and call the new GIN
> index something different, but that got hung up on how users were going
> to create indexes of the new type.
> 
> One nice trick would be for the index creation code to check the global
> variable IsBinaryUpgrade that pg_upgrade sets.  If CREATE INDEX ... GIN
> finds IsBinaryUpgrade set, it should create an (empty) index of type
> gin-v1/old.  If it is not set, it should create a new gin index.  This
> will allow pg_upgrade to work, and allow REINDEX to create a new-type
> GIN index from an old one.
> 
> We would need to append "-v1" to the old index directory, system
> catalog, and function names.  We could then remove the old GIN index
> code in some later major release, and pg_upgrade will then mark the
> indexes as invalid and create a REINDEX script.
> 
> This allows users to reindex their GIN indexes over time, but it doesn't
> lock us into keeping the gin-v1 code around forever.

Correction --- the above is not going to work because the backend isn't
going to know, during a binary upgrade, if CREATE INDEX ... GIN is
coming from a 9.3 or 9.4 database.  We are going to need pg_dump code to
do the mapping, and also pg_dump code to map gin_v1 back to gin once we
remove the gin_v1 code.  We will also need index creation code to map
gin_v1 to gin to properly handle REINDEX and CLUSTER, but only if not in
binary upgrade mode.

If it would help, I would be happy to write a simple patch for the above
items once I return from Europe in November.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Thu, Oct 10, 2013 at 3:57 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 09.10.2013 02:04, Tomas Vondra wrote:
On 8.10.2013 21:59, Heikki Linnakangas wrote:
On 08.10.2013 17:47, Alexander Korotkov wrote:
Hi, Tomas!

On Sun, Oct 6, 2013 at 3:58 AM, Tomas Vondra<tv@fuzzy.cz>   wrote:

I've attempted to rerun the benchmarks tests I did a few weeks ago, but
   I got repeated crashes when loading the data (into a table with
tsvector+gin index).

Right before a crash, theres this message in the log:

     PANIC:  not enough space in leaf page!


Thanks for testing. Heikki's version of patch don't works for me too on
even much more simplier examples. I can try to get it working if he
answer
my question about GinDataLeafPageGetPostingList* macros.

The new macros in that patch version were quite botched. Here's a new
attempt.

Nope, still the same errors :-(

PANIC:  not enough space in leaf page!
LOG:  server process (PID 29722) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: autovacuum: ANALYZE public.messages

I've continued hacking away at the patch, here's yet another version. I've done a lot of cleanup and refactoring to make the code more readable (I hope). I'm not sure what caused the panic you saw, but it's probably fixed now.  Let me know if not.

Some notable changes since my last patch version:

* I changed the ginbtree interface so that isEnoughSpace method is no more. Instead, placeToPage returns false without doing anything if the item doesn't fit. It was slightly simpler this way when working with the compressed posting lists, as placeToPage had to calculate tuple sizes anyway to decide how many items fit on the page.

* I rewrote incrUpdateItemIndexes(). It now operates in two stages: 1. adjust the pageOffsets and 2. split the bin. I found it more readable this way.

* I changed the WAL format of insert records so that there is a separate struct for data-leaf, data-internal, and entry pages. The information recorded for each of those was so different that it was confusing to cram them all into the same struct.


I'm going to step back a bit from the details of the patch now. I think there's enough work for you, Alexander, until the next commitfest:

* Try to make the code also work with the old page format, so that you don't need to REINDEX after pg_upgrade.

* Documentation. The new compressed posting list format needs to be explained somewhere. At the top of ginpostinglist.c would be a good place. The README should be improved too.

* Fix whatever I broke (sorry)

Are we committed to go ahead with this patch in 9.4 timeframe, in one form or another? If we are, then I'd like to start committing refactoring patches that just move code around in preparation for the Patch, to make the review of the core of the patch easier. For example, merging the isEnoughSpace and placeToPage methods in the ginbtree interface. Without the patch, it's unnecessary code churn - it won't do any harm but it won't do any good either. I'm pretty confident that this patch will land in the 9.4 timeframe, so barring objections I'll start committing such refactorings.

Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring.

         event         |     period
-----------------------+-----------------
 index_build           | 00:01:45.416822
 index_build_recovery  | 00:00:06
 index_update          | 00:05:17.263606
 index_update_recovery | 00:01:22
 search_new            | 00:24:07.706482
 search_updated        | 00:26:25.364494
(6 rows)

     label      | blocks_mark
----------------+-------------
 search_new     |   847587636
 search_updated |   881210486
(2 rows)

     label     |   size
---------------+-----------
 new           | 419299328
 after_updates | 715915264
(2 rows)

Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted.

Now, I'm trying to implement support of old page format. Then we can decide which approach to use.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Attached version of patch is debugged. I would like to note that number of bugs was low and it wasn't very hard to debug. I've rerun tests on it. You can see that numbers are improved as the result of your refactoring.

         event         |     period
-----------------------+-----------------
 index_build           | 00:01:45.416822
 index_build_recovery  | 00:00:06
 index_update          | 00:05:17.263606
 index_update_recovery | 00:01:22
 search_new            | 00:24:07.706482
 search_updated        | 00:26:25.364494
(6 rows)

     label      | blocks_mark
----------------+-------------
 search_new     |   847587636
 search_updated |   881210486
(2 rows)

     label     |   size
---------------+-----------
 new           | 419299328
 after_updates | 715915264
(2 rows)

Beside simple bugs, there was a subtle bug in incremental item indexes update. I've added some more comments including ASCII picture about how item indexes are shifted.

Now, I'm trying to implement support of old page format. Then we can decide which approach to use.

Attached version of patch has support of old page format. Patch still needs more documentation and probably refactoring, but I believe idea is clear and it can be reviewed. In the patch I have to revert change of null category placement for compatibility with old posting list format. 

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 04.11.2013 23:44, Alexander Korotkov wrote:
> On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov
> <aekorotkov@gmail.com>wrote:
>
>> Attached version of patch is debugged. I would like to note that number of
>> bugs was low and it wasn't very hard to debug. I've rerun tests on it. You
>> can see that numbers are improved as the result of your refactoring.
>>
>>           event         |     period
>> -----------------------+-----------------
>>   index_build           | 00:01:45.416822
>>   index_build_recovery  | 00:00:06
>>   index_update          | 00:05:17.263606
>>   index_update_recovery | 00:01:22
>>   search_new            | 00:24:07.706482
>>   search_updated        | 00:26:25.364494
>> (6 rows)
>>
>>       label      | blocks_mark
>> ----------------+-------------
>>   search_new     |   847587636
>>   search_updated |   881210486
>> (2 rows)
>>
>>       label     |   size
>> ---------------+-----------
>>   new           | 419299328
>>   after_updates | 715915264
>> (2 rows)
>>
>> Beside simple bugs, there was a subtle bug in incremental item indexes
>> update. I've added some more comments including ASCII picture about how
>> item indexes are shifted.
>>
>> Now, I'm trying to implement support of old page format. Then we can
>> decide which approach to use.
>>
>
> Attached version of patch has support of old page format. Patch still needs
> more documentation and probably refactoring, but I believe idea is clear
> and it can be reviewed. In the patch I have to revert change of null
> category placement for compatibility with old posting list format.

Thanks, just glanced at this quickly.

If I'm reading the patch correctly, old-style non-compressed posting 
tree leaf pages are not vacuumed at all; that's clearly not right.

I argued earlier that we don't need to handle the case that compressing 
a page makes it larger, so that the existing items don't fit on the page 
anymore. I'm having some second thoughts on that; I didn't take into 
account the fact that the "mini-index" in the new page format takes up 
some space. I think it's still highly unlikely that there isn't enough 
space on a 8k page, but it's not totally crazy with a non-standard small 
page size. So at least that situation needs to be checked for with an 
ereport(), rather than Assert.

To handle splitting a non-compressed page, it seems that you're relying 
on the fact that before splitting, we try to insert, which compresses 
the page. The problem with that is that it's not correctly WAL-logged. 
The split record that follows includes a full copy of both page halves, 
but if the split fails for some reason, e.g you run out of disk space, 
there is no WAL record at all of the the compression. I'd suggest doing 
the compression in the insert phase on a temporary copy of the page, 
leaving the original page untouched if there isn't enough space for the 
insertion to complete. (You could argue that this case can't happen 
because the compression must always create enough space to insert one 
more item. maybe so, but at least there should be an explicit check for 
that.)

- Heikki



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 04.11.2013 23:44, Alexander Korotkov wrote:
> Attached version of patch has support of old page format. Patch still needs
> more documentation and probably refactoring, but I believe idea is clear
> and it can be reviewed. In the patch I have to revert change of null
> category placement for compatibility with old posting list format.

I committed some of the refactorings and cleanup included in this patch.
Attached is a rebased version that applies over current head. I hope I
didn't joggle your elbow..

- Heikki


Вложения

Re: GIN improvements part 1: additional information

От
Alvaro Herrera
Дата:
Heikki Linnakangas escribió:
> On 04.11.2013 23:44, Alexander Korotkov wrote:
> >Attached version of patch has support of old page format. Patch still needs
> >more documentation and probably refactoring, but I believe idea is clear
> >and it can be reviewed. In the patch I have to revert change of null
> >category placement for compatibility with old posting list format.
> 
> I committed some of the refactorings and cleanup included in this
> patch. Attached is a rebased version that applies over current head.
> I hope I didn't joggle your elbow..

Just for my own illumination, can someone explain this bit?

+ If a posting list is too large to store in-line in a key entry, a posting tree
+ is created. A posting tree is a B-tree structure, where the ItemPointer is
+ used as the key. At the leaf-level, item pointers are stored compressed, in
+ "varbyte encoding".

I think the first ItemPointer mentioned (the key) refers to a TID
pointing to the index, and "item pointers stored compressed" refers to
the TIDs pointing to the heap (the data).  Is that correct?


I'm also interested in the "FIXME explain varbyte encoding" explanation
currently missing, if somebody can write it down ...

Thanks,

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Tue, Nov 5, 2013 at 9:49 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 04.11.2013 23:44, Alexander Korotkov wrote:
On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

Attached version of patch is debugged. I would like to note that number of
bugs was low and it wasn't very hard to debug. I've rerun tests on it. You
can see that numbers are improved as the result of your refactoring.

          event         |     period
-----------------------+-----------------
  index_build           | 00:01:45.416822
  index_build_recovery  | 00:00:06
  index_update          | 00:05:17.263606
  index_update_recovery | 00:01:22
  search_new            | 00:24:07.706482
  search_updated        | 00:26:25.364494
(6 rows)

      label      | blocks_mark
----------------+-------------
  search_new     |   847587636
  search_updated |   881210486
(2 rows)

      label     |   size
---------------+-----------
  new           | 419299328
  after_updates | 715915264
(2 rows)

Beside simple bugs, there was a subtle bug in incremental item indexes
update. I've added some more comments including ASCII picture about how
item indexes are shifted.

Now, I'm trying to implement support of old page format. Then we can
decide which approach to use.


Attached version of patch has support of old page format. Patch still needs
more documentation and probably refactoring, but I believe idea is clear
and it can be reviewed. In the patch I have to revert change of null
category placement for compatibility with old posting list format.

Thanks, just glanced at this quickly.

If I'm reading the patch correctly, old-style non-compressed posting tree leaf pages are not vacuumed at all; that's clearly not right.

Fixed. Now separate function handles uncompressed posting lists and compress them if as least one TID is deleted.
  
I argued earlier that we don't need to handle the case that compressing a page makes it larger, so that the existing items don't fit on the page anymore. I'm having some second thoughts on that; I didn't take into account the fact that the "mini-index" in the new page format takes up some space. I think it's still highly unlikely that there isn't enough space on a 8k page, but it's not totally crazy with a non-standard small page size. So at least that situation needs to be checked for with an ereport(), rather than Assert.
 
Now this situation is ereported before any change in page.

To handle splitting a non-compressed page, it seems that you're relying on the fact that before splitting, we try to insert, which compresses the page. The problem with that is that it's not correctly WAL-logged. The split record that follows includes a full copy of both page halves, but if the split fails for some reason, e.g you run out of disk space, there is no WAL record at all of the the compression. I'd suggest doing the compression in the insert phase on a temporary copy of the page, leaving the original page untouched if there isn't enough space for the insertion to complete. (You could argue that this case can't happen because the compression must always create enough space to insert one more item. maybe so, but at least there should be an explicit check for that.)

Good point. Yes, if we don't handle specially inserting item indexes, I see no point to do special handling for single TID which is much smaller. In the attached patch dataCompressLeafPage just reserves space for single TID. 

Also, many changes in comments and README.

Unfortunally, I didn't understand what is FIXME about in ginVacuumEntryPage. So, it's not fixed :)

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Thu, Nov 14, 2013 at 2:17 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Nov 5, 2013 at 9:49 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 04.11.2013 23:44, Alexander Korotkov wrote:
On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

Attached version of patch is debugged. I would like to note that number of
bugs was low and it wasn't very hard to debug. I've rerun tests on it. You
can see that numbers are improved as the result of your refactoring.

          event         |     period
-----------------------+-----------------
  index_build           | 00:01:45.416822
  index_build_recovery  | 00:00:06
  index_update          | 00:05:17.263606
  index_update_recovery | 00:01:22
  search_new            | 00:24:07.706482
  search_updated        | 00:26:25.364494
(6 rows)

      label      | blocks_mark
----------------+-------------
  search_new     |   847587636
  search_updated |   881210486
(2 rows)

      label     |   size
---------------+-----------
  new           | 419299328
  after_updates | 715915264
(2 rows)

Beside simple bugs, there was a subtle bug in incremental item indexes
update. I've added some more comments including ASCII picture about how
item indexes are shifted.

Now, I'm trying to implement support of old page format. Then we can
decide which approach to use.


Attached version of patch has support of old page format. Patch still needs
more documentation and probably refactoring, but I believe idea is clear
and it can be reviewed. In the patch I have to revert change of null
category placement for compatibility with old posting list format.

Thanks, just glanced at this quickly.

If I'm reading the patch correctly, old-style non-compressed posting tree leaf pages are not vacuumed at all; that's clearly not right.

Fixed. Now separate function handles uncompressed posting lists and compress them if as least one TID is deleted.
  
I argued earlier that we don't need to handle the case that compressing a page makes it larger, so that the existing items don't fit on the page anymore. I'm having some second thoughts on that; I didn't take into account the fact that the "mini-index" in the new page format takes up some space. I think it's still highly unlikely that there isn't enough space on a 8k page, but it's not totally crazy with a non-standard small page size. So at least that situation needs to be checked for with an ereport(), rather than Assert.
 
Now this situation is ereported before any change in page.

To handle splitting a non-compressed page, it seems that you're relying on the fact that before splitting, we try to insert, which compresses the page. The problem with that is that it's not correctly WAL-logged. The split record that follows includes a full copy of both page halves, but if the split fails for some reason, e.g you run out of disk space, there is no WAL record at all of the the compression. I'd suggest doing the compression in the insert phase on a temporary copy of the page, leaving the original page untouched if there isn't enough space for the insertion to complete. (You could argue that this case can't happen because the compression must always create enough space to insert one more item. maybe so, but at least there should be an explicit check for that.)

Good point. Yes, if we don't handle specially inserting item indexes, I see no point to do special handling for single TID which is much smaller. In the attached patch dataCompressLeafPage just reserves space for single TID. 

Also, many changes in comments and README.

Unfortunally, I didn't understand what is FIXME about in ginVacuumEntryPage. So, it's not fixed :)

Sorry, I posted buggy version of patch. Attached version is correct.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Thu, Nov 14, 2013 at 3:00 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Nov 14, 2013 at 2:17 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Tue, Nov 5, 2013 at 9:49 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 04.11.2013 23:44, Alexander Korotkov wrote:
On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

Attached version of patch is debugged. I would like to note that number of
bugs was low and it wasn't very hard to debug. I've rerun tests on it. You
can see that numbers are improved as the result of your refactoring.

          event         |     period
-----------------------+-----------------
  index_build           | 00:01:45.416822
  index_build_recovery  | 00:00:06
  index_update          | 00:05:17.263606
  index_update_recovery | 00:01:22
  search_new            | 00:24:07.706482
  search_updated        | 00:26:25.364494
(6 rows)

      label      | blocks_mark
----------------+-------------
  search_new     |   847587636
  search_updated |   881210486
(2 rows)

      label     |   size
---------------+-----------
  new           | 419299328
  after_updates | 715915264
(2 rows)

Beside simple bugs, there was a subtle bug in incremental item indexes
update. I've added some more comments including ASCII picture about how
item indexes are shifted.

Now, I'm trying to implement support of old page format. Then we can
decide which approach to use.


Attached version of patch has support of old page format. Patch still needs
more documentation and probably refactoring, but I believe idea is clear
and it can be reviewed. In the patch I have to revert change of null
category placement for compatibility with old posting list format.

Thanks, just glanced at this quickly.

If I'm reading the patch correctly, old-style non-compressed posting tree leaf pages are not vacuumed at all; that's clearly not right.

Fixed. Now separate function handles uncompressed posting lists and compress them if as least one TID is deleted.
  
I argued earlier that we don't need to handle the case that compressing a page makes it larger, so that the existing items don't fit on the page anymore. I'm having some second thoughts on that; I didn't take into account the fact that the "mini-index" in the new page format takes up some space. I think it's still highly unlikely that there isn't enough space on a 8k page, but it's not totally crazy with a non-standard small page size. So at least that situation needs to be checked for with an ereport(), rather than Assert.
 
Now this situation is ereported before any change in page.

To handle splitting a non-compressed page, it seems that you're relying on the fact that before splitting, we try to insert, which compresses the page. The problem with that is that it's not correctly WAL-logged. The split record that follows includes a full copy of both page halves, but if the split fails for some reason, e.g you run out of disk space, there is no WAL record at all of the the compression. I'd suggest doing the compression in the insert phase on a temporary copy of the page, leaving the original page untouched if there isn't enough space for the insertion to complete. (You could argue that this case can't happen because the compression must always create enough space to insert one more item. maybe so, but at least there should be an explicit check for that.)

Good point. Yes, if we don't handle specially inserting item indexes, I see no point to do special handling for single TID which is much smaller. In the attached patch dataCompressLeafPage just reserves space for single TID. 

Also, many changes in comments and README.

Unfortunally, I didn't understand what is FIXME about in ginVacuumEntryPage. So, it's not fixed :)

Sorry, I posted buggy version of patch. Attached version is correct.

Another bug fix by report from Rod Taylor.

------
With best regards,
Alexander Korotkov.

Вложения

Re: GIN improvements part 1: additional information

От
Peter Eisentraut
Дата:
On 11/14/13, 6:00 AM, Alexander Korotkov wrote:
> Sorry, I posted buggy version of patch. Attached version is correct.

This patch crashes the hstore the pg_trgm regression tests.




Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Nov 15, 2013 at 8:58 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
On 11/14/13, 6:00 AM, Alexander Korotkov wrote:
> Sorry, I posted buggy version of patch. Attached version is correct.

This patch crashes the hstore the pg_trgm regression tests.

What exact version did you try 14 or 16? If it was 16, could you please post a stacktrace, because it doesn't crash for me.

------
With best regards,
Alexander Korotkov. 

Re: GIN improvements part 1: additional information

От
Peter Eisentraut
Дата:
On 11/15/13, 12:24 PM, Alexander Korotkov wrote:
> On Fri, Nov 15, 2013 at 8:58 PM, Peter Eisentraut <peter_e@gmx.net
> <mailto:peter_e@gmx.net>> wrote:
> 
>     On 11/14/13, 6:00 AM, Alexander Korotkov wrote:
>     > Sorry, I posted buggy version of patch. Attached version is correct.
> 
>     This patch crashes the hstore the pg_trgm regression tests.
> 
> 
> What exact version did you try 14 or 16? If it was 16, could you please
> post a stacktrace, because it doesn't crash for me.

The one that's the latest in the commitfest:
http://www.postgresql.org/message-id/CAPpHfdvEQ-JhE_2z9pvw22Bp6h_o8XOaXCbjAGf87gs4p4Jmuw@mail.gmail.com

If you have a newer one, please add it there.




Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Nov 15, 2013 at 9:56 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
On 11/15/13, 12:24 PM, Alexander Korotkov wrote:
> On Fri, Nov 15, 2013 at 8:58 PM, Peter Eisentraut <peter_e@gmx.net
> <mailto:peter_e@gmx.net>> wrote:
>
>     On 11/14/13, 6:00 AM, Alexander Korotkov wrote:
>     > Sorry, I posted buggy version of patch. Attached version is correct.
>
>     This patch crashes the hstore the pg_trgm regression tests.
>
>
> What exact version did you try 14 or 16? If it was 16, could you please
> post a stacktrace, because it doesn't crash for me.

The one that's the latest in the commitfest: http://www.postgresql.org/message-id/CAPpHfdvEQ-JhE_2z9pvw22Bp6h_o8XOaXCbjAGf87gs4p4Jmuw@mail.gmail.com

If you have a newer one, please add it there.

Ok, I actualized information in commitfest app.

------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 06.11.2013 17:36, Alvaro Herrera wrote:
> Just for my own illumination, can someone explain this bit?
>
> + If a posting list is too large to store in-line in a key entry, a posting tree
> + is created. A posting tree is a B-tree structure, where the ItemPointer is
> + used as the key. At the leaf-level, item pointers are stored compressed, in
> + "varbyte encoding".
>
> I think the first ItemPointer mentioned (the key) refers to a TID
> pointing to the index, and "item pointers stored compressed" refers to
> the TIDs pointing to the heap (the data).  Is that correct?

No, they both refer to TIDs pointing to the heap.

> I'm also interested in the "FIXME explain varbyte encoding" explanation
> currently missing, if somebody can write it down ...

Alexander's latest version filled in that explanation (haven't read it 
myself yet)

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
Hi!

On Wed, Nov 20, 2013 at 9:02 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 06.11.2013 17:36, Alvaro Herrera wrote:
Just for my own illumination, can someone explain this bit?

+ If a posting list is too large to store in-line in a key entry, a posting tree
+ is created. A posting tree is a B-tree structure, where the ItemPointer is
+ used as the key. At the leaf-level, item pointers are stored compressed, in
+ "varbyte encoding".

I think the first ItemPointer mentioned (the key) refers to a TID
pointing to the index, and "item pointers stored compressed" refers to
the TIDs pointing to the heap (the data).  Is that correct?

No, they both refer to TIDs pointing to the heap.


I'm also interested in the "FIXME explain varbyte encoding" explanation
currently missing, if somebody can write it down ...

Alexander's latest version filled in that explanation (haven't read it myself yet)

off-list

What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before?

------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Tue, Nov 26, 2013 at 5:34 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Nov 20, 2013 at 9:02 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 06.11.2013 17:36, Alvaro Herrera wrote:
Just for my own illumination, can someone explain this bit?

+ If a posting list is too large to store in-line in a key entry, a posting tree
+ is created. A posting tree is a B-tree structure, where the ItemPointer is
+ used as the key. At the leaf-level, item pointers are stored compressed, in
+ "varbyte encoding".

I think the first ItemPointer mentioned (the key) refers to a TID
pointing to the index, and "item pointers stored compressed" refers to
the TIDs pointing to the heap (the data).  Is that correct?

No, they both refer to TIDs pointing to the heap.


I'm also interested in the "FIXME explain varbyte encoding" explanation
currently missing, if somebody can write it down ...

Alexander's latest version filled in that explanation (haven't read it myself yet)

off-list

It appears to be not actually off-list, sorry :) 
 
What's your plans about GIN now? I tried to rebase packed posting lists with head. But I found that you've changed interface of placeToPage function. That's conflicts with packed posting lists, because dataPlaceToPageLeaf needs not only offset number to describe place to insert item pointer. Do you like to commit rework of handling GIN incomplete splits before?
 
------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 11/26/13 15:34, Alexander Korotkov wrote:
> What's your plans about GIN now? I tried to rebase packed posting lists
> with head. But I found that you've changed interface of placeToPage
> function. That's conflicts with packed posting lists, because
> dataPlaceToPageLeaf needs not only offset number to describe place to
> insert item pointer. Do you like to commit rework of handling GIN
> incomplete splits before?

Yeah, I'm planning to get back to this patch after committing the 
incomplete splits patch. I think the refactoring of the WAL-logging that 
I did in that patch will simplify this patch, too. I'll take a look at 
Michael's latest comments on the incomplete splits patch tomorrow, so I 
should get back to this on Thursday or Friday.

- Heikki



Re: GIN improvements part 1: additional information

От
Peter Eisentraut
Дата:
This patch needs to be rebased.
> 
> 






Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Wed, Nov 27, 2013 at 1:14 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 11/26/13 15:34, Alexander Korotkov wrote:
What's your plans about GIN now? I tried to rebase packed posting lists
with head. But I found that you've changed interface of placeToPage
function. That's conflicts with packed posting lists, because
dataPlaceToPageLeaf needs not only offset number to describe place to
insert item pointer. Do you like to commit rework of handling GIN
incomplete splits before?

Yeah, I'm planning to get back to this patch after committing the incomplete splits patch. I think the refactoring of the WAL-logging that I did in that patch will simplify this patch, too. I'll take a look at Michael's latest comments on the incomplete splits patch tomorrow, so I should get back to this on Thursday or Friday.

Should I try to rebase this patch now or you plan to do it yourself? Some changes like "void *insertdata" argument make me think you have some particular plan to rebase this patch, but I didn't get it exactly.

------
With best regards,
Alexander Korotkov. 

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 11/28/2013 09:19 AM, Alexander Korotkov wrote:
> On Wed, Nov 27, 2013 at 1:14 AM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> On 11/26/13 15:34, Alexander Korotkov wrote:
>>
>>> What's your plans about GIN now? I tried to rebase packed posting lists
>>> with head. But I found that you've changed interface of placeToPage
>>> function. That's conflicts with packed posting lists, because
>>> dataPlaceToPageLeaf needs not only offset number to describe place to
>>> insert item pointer. Do you like to commit rework of handling GIN
>>> incomplete splits before?
>>
>> Yeah, I'm planning to get back to this patch after committing the
>> incomplete splits patch. I think the refactoring of the WAL-logging that I
>> did in that patch will simplify this patch, too. I'll take a look at
>> Michael's latest comments on the incomplete splits patch tomorrow, so I
>> should get back to this on Thursday or Friday.
>
> Should I try to rebase this patch now or you plan to do it yourself? Some
> changes like "void *insertdata" argument make me think you have some
> particular plan to rebase this patch, but I didn't get it exactly.

Here's rebased version. I'll continue reviewing it now..

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 11/29/2013 11:41 AM, Heikki Linnakangas wrote:
> On 11/28/2013 09:19 AM, Alexander Korotkov wrote:
>> On Wed, Nov 27, 2013 at 1:14 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com
>>> wrote:
>>
>>> On 11/26/13 15:34, Alexander Korotkov wrote:
>>>
>>>> What's your plans about GIN now? I tried to rebase packed posting lists
>>>> with head. But I found that you've changed interface of placeToPage
>>>> function. That's conflicts with packed posting lists, because
>>>> dataPlaceToPageLeaf needs not only offset number to describe place to
>>>> insert item pointer. Do you like to commit rework of handling GIN
>>>> incomplete splits before?
>>>
>>> Yeah, I'm planning to get back to this patch after committing the
>>> incomplete splits patch. I think the refactoring of the WAL-logging
>>> that I
>>> did in that patch will simplify this patch, too. I'll take a look at
>>> Michael's latest comments on the incomplete splits patch tomorrow, so I
>>> should get back to this on Thursday or Friday.
>>
>> Should I try to rebase this patch now or you plan to do it yourself? Some
>> changes like "void *insertdata" argument make me think you have some
>> particular plan to rebase this patch, but I didn't get it exactly.
>
> Here's rebased version. I'll continue reviewing it now..

Another update. Fixes a bunch of bugs. Mostly introduced by me, but a
couple were present in your v16:

* When allocating the entry->list buffer in a scan, it must be large
enough for the max number of items that can fit on a compressed page,
whether the current page is compressed or not. That's because the same
buffer is reused on subsequent pages, which might be compressed.

* When splitting a leaf page during index creation, missed the trick
that's present in current master, to choose the split point so that left
page is packed as full as possible. I put that back, it makes
newly-built indexes somewhat smaller. (I wonder if we should leave some
free space for future updates. But that's a separate patch, let's keep
the current behavior in this patch)

I'll continue reviewing next week..

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Nov 29, 2013 at 11:17 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 11/29/2013 11:41 AM, Heikki Linnakangas wrote:
On 11/28/2013 09:19 AM, Alexander Korotkov wrote:
On Wed, Nov 27, 2013 at 1:14 AM, Heikki Linnakangas
<hlinnakangas@vmware.com
wrote:

On 11/26/13 15:34, Alexander Korotkov wrote:

What's your plans about GIN now? I tried to rebase packed posting lists
with head. But I found that you've changed interface of placeToPage
function. That's conflicts with packed posting lists, because
dataPlaceToPageLeaf needs not only offset number to describe place to
insert item pointer. Do you like to commit rework of handling GIN
incomplete splits before?

Yeah, I'm planning to get back to this patch after committing the
incomplete splits patch. I think the refactoring of the WAL-logging
that I
did in that patch will simplify this patch, too. I'll take a look at
Michael's latest comments on the incomplete splits patch tomorrow, so I
should get back to this on Thursday or Friday.

Should I try to rebase this patch now or you plan to do it yourself? Some
changes like "void *insertdata" argument make me think you have some
particular plan to rebase this patch, but I didn't get it exactly.

Here's rebased version. I'll continue reviewing it now..

Another update. Fixes a bunch of bugs. Mostly introduced by me, but a couple were present in your v16:

* When allocating the entry->list buffer in a scan, it must be large enough for the max number of items that can fit on a compressed page, whether the current page is compressed or not. That's because the same buffer is reused on subsequent pages, which might be compressed.

* When splitting a leaf page during index creation, missed the trick that's present in current master, to choose the split point so that left page is packed as full as possible. I put that back, it makes newly-built indexes somewhat smaller. (I wonder if we should leave some free space for future updates. But that's a separate patch, let's keep the current behavior in this patch)

I'll continue reviewing next week..

Good. Thanks for debug and fixing bugs.
Can I do anything for this patch now?

------
With best regards,
Alexander Korotkov. 

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/08/2013 09:56 PM, Alexander Korotkov wrote:
> On Fri, Nov 29, 2013 at 11:17 PM, Heikki Linnakangas <
> hlinnakangas@vmware.com> wrote:
>
>> I'll continue reviewing next week..

Got dragged into other things and didn't make any progress on this last 
week. I'm trying again now.

> Good. Thanks for debug and fixing bugs.
> Can I do anything for this patch now?

I wonder if we're leaving some money on the table, by using varbyte 
encoding. Googling around, there are many other compression methods out 
there for compressing integer deltas that compress better, and/or 
decompress faster.

Even if we use varbyte encoding, I wonder if it would be better to treat 
block + offset number as a single 48-bit integer, rather than encode 
them separately. That would allow the delta of two items on the same 
page to be stored as a single byte, rather than two bytes. Naturally it 
would be a loss on other values, but would be nice to see some kind of 
an analysis on that. I suspect it might make the code simpler, too.

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/08/2013 09:56 PM, Alexander Korotkov wrote:
On Fri, Nov 29, 2013 at 11:17 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

I'll continue reviewing next week..

Got dragged into other things and didn't make any progress on this last week. I'm trying again now.


Good. Thanks for debug and fixing bugs.
Can I do anything for this patch now?

I wonder if we're leaving some money on the table, by using varbyte encoding. Googling around, there are many other compression methods out there for compressing integer deltas that compress better, and/or decompress faster.

Ok, I'll try to google around. Probably, there are some better options. 
 
Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too.

Yeah, I had that idea, but I thought it's not a better option. Will try to do some analysis.

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
> On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com>wrote:
>
>> Even if we use varbyte encoding, I wonder if it would be better to treat
>> block + offset number as a single 48-bit integer, rather than encode them
>> separately. That would allow the delta of two items on the same page to be
>> stored as a single byte, rather than two bytes. Naturally it would be a
>> loss on other values, but would be nice to see some kind of an analysis on
>> that. I suspect it might make the code simpler, too.
>
> Yeah, I had that idea, but I thought it's not a better option. Will try to
> do some analysis.

The more I think about that, the more convinced I am that it's a good
idea. I don't think it will ever compress worse than the current
approach of treating block and offset numbers separately, and, although
I haven't actually tested it, I doubt it's any slower. About the same
amount of arithmetic is required in both versions.

Attached is a version that does that. Plus some other minor cleanup.

(we should still investigate using a completely different algorithm, though)

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
<hlinnakangas@vmware.com>wrote:

Even if we use varbyte encoding, I wonder if it would be better to treat
block + offset number as a single 48-bit integer, rather than encode them
separately. That would allow the delta of two items on the same page to be
stored as a single byte, rather than two bytes. Naturally it would be a
loss on other values, but would be nice to see some kind of an analysis on
that. I suspect it might make the code simpler, too.

Yeah, I had that idea, but I thought it's not a better option. Will try to
do some analysis.

The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions.

Attached is a version that does that. Plus some other minor cleanup.

(we should still investigate using a completely different algorithm, though)

Yes, when I though about that, I didn't realize that we can reserve less than 16 bits for offset number.
I rerun my benchmark and got following results:

         event         |     period      
-----------------------+-----------------
 index_build           | 00:01:46.39056
 index_build_recovery  | 00:00:05
 index_update          | 00:06:01.557708
 index_update_recovery | 00:01:23
 search_new            | 00:24:05.600366
 search_updated        | 00:25:29.520642
(6 rows)

     label      | blocks_mark 
----------------+-------------
 search_new     |   847509920
 search_updated |   883789826
(2 rows)

     label     |   size    
---------------+-----------
 new           | 364560384
 after_updates | 642736128
(2 rows)

Speed is same while index size is less. In previous format it was:

     label     |   size    
---------------+-----------
 new           | 419299328
 after_updates | 715915264
(2 rows)

Good optimization, thanks. I'll try another datasets but I expect similar results.
However, patch didn't apply to head. Corrected version is attached.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
<hlinnakangas@vmware.com>wrote:

Even if we use varbyte encoding, I wonder if it would be better to treat
block + offset number as a single 48-bit integer, rather than encode them
separately. That would allow the delta of two items on the same page to be
stored as a single byte, rather than two bytes. Naturally it would be a
loss on other values, but would be nice to see some kind of an analysis on
that. I suspect it might make the code simpler, too.

Yeah, I had that idea, but I thought it's not a better option. Will try to
do some analysis.

The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions.

Attached is a version that does that. Plus some other minor cleanup.

(we should still investigate using a completely different algorithm, though)

I've thought about different algorithms little more. General problem I see is online update. We need it while it is typically not covered by researches at all. We already have to invent small index in the end of page. Different encoding methods adds more challenges. In general, methods can be classified in two groups:
1) Values aren't aligned by bytes (gamma-codes, PFOR etc.)
2) Multiple values are packed together in small group (simple-9, simple-18)
For the first group of methods when inserting in the middle of the page we would have to do not byte-aligned shift of right part of values. I don't know how expensive is this shift but I expect that it would be much slower than memmove.
When values are packed into small groups, we have to either insert inefficiently encoded value or re-encode whole right part of values.

The option I see is to encode bins between item indexes separately. This still might be slower and require much more complicated maintenance. And also this much complicates further work on storing additional information in GIN.

Any other thoughts?

------
With best regards,
Alexander Korotkov. 

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/12/2013 06:44 PM, Alexander Korotkov wrote:
> I've thought about different algorithms little more. General problem I see
> is online update. We need it while it is typically not covered by
> researches at all. We already have to invent small index in the end of
> page. Different encoding methods adds more challenges. In general, methods
> can be classified in two groups:
> 1) Values aren't aligned by bytes (gamma-codes, PFOR etc.)
> 2) Multiple values are packed together in small group (simple-9, simple-18)

Ok.

> For the first group of methods when inserting in the middle of the page we
> would have to do not byte-aligned shift of right part of values. I don't
> know how expensive is this shift but I expect that it would be much slower
> than memmove.

Agreed.

> When values are packed into small groups, we have to either insert
> inefficiently encoded value or re-encode whole right part of values.

It would probably be simplest to store newly inserted items 
uncompressed, in a separate area in the page. For example, grow the list 
of uncompressed items downwards from pg_upper, and the compressed items 
upwards from pg_lower. When the page fills up, re-encode the whole page.

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/12/2013 06:44 PM, Alexander Korotkov wrote:
I've thought about different algorithms little more. General problem I see
is online update. We need it while it is typically not covered by
researches at all. We already have to invent small index in the end of
page. Different encoding methods adds more challenges. In general, methods
can be classified in two groups:
1) Values aren't aligned by bytes (gamma-codes, PFOR etc.)
2) Multiple values are packed together in small group (simple-9, simple-18)

Ok.


For the first group of methods when inserting in the middle of the page we
would have to do not byte-aligned shift of right part of values. I don't
know how expensive is this shift but I expect that it would be much slower
than memmove.

Agreed.


When values are packed into small groups, we have to either insert
inefficiently encoded value or re-encode whole right part of values.

It would probably be simplest to store newly inserted items uncompressed, in a separate area in the page. For example, grow the list of uncompressed items downwards from pg_upper, and the compressed items upwards from pg_lower. When the page fills up, re-encode the whole page.

Good idea. But:
1) We'll still need item indexes in the end of page for fast scan.
2) Storage would be easily extendable to hold additional information as well.
Better compression shouldn't block more serious improvements.

------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
> On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> On 12/12/2013 06:44 PM, Alexander Korotkov wrote:
>>
>>   When values are packed into small groups, we have to either insert
>>> inefficiently encoded value or re-encode whole right part of values.
>>
>> It would probably be simplest to store newly inserted items uncompressed,
>> in a separate area in the page. For example, grow the list of uncompressed
>> items downwards from pg_upper, and the compressed items upwards from
>> pg_lower. When the page fills up, re-encode the whole page.

I hacked together an implementation of a variant of Simple9, to see how
it performs. Insertions are handled per the above scheme.

In a limited pg_trgm test case I've been using a lot for this, this
reduces the index size about 20%, compared to varbyte encoding. It might
be possible to squeeze it a bit more, I handcrafted the "selectors" in
the encoding algorithm to suite our needs, but I don't actually have a
good idea of how to choose them optimally. Also, the encoding can encode
0 values, but we never need to do that, so you could take advantage of
that to pack items tighter.

Compression and decompression speed seems to be about the same.

Patch attached if you want to play with it. WAL replay is still broken,
and there are probably bugs.

> Good idea. But:
> 1) We'll still need item indexes in the end of page for fast scan.

Sure.

> 2) Storage would be easily extendable to hold additional information as
> well.
> Better compression shouldn't block more serious improvements.

I'm not sure I agree with that. For all the cases where you don't care
about additional information - which covers all existing users for
example - reducing disk size is pretty important. How are you planning
to store the additional information, and how does using another encoding
gets in the way of that?

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Tue, Dec 17, 2013 at 2:49 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas <hlinnakangas@vmware.com
wrote:

On 12/12/2013 06:44 PM, Alexander Korotkov wrote:

  When values are packed into small groups, we have to either insert
inefficiently encoded value or re-encode whole right part of values.

It would probably be simplest to store newly inserted items uncompressed,
in a separate area in the page. For example, grow the list of uncompressed
items downwards from pg_upper, and the compressed items upwards from
pg_lower. When the page fills up, re-encode the whole page.

I hacked together an implementation of a variant of Simple9, to see how it performs. Insertions are handled per the above scheme.

In a limited pg_trgm test case I've been using a lot for this, this reduces the index size about 20%, compared to varbyte encoding. It might be possible to squeeze it a bit more, I handcrafted the "selectors" in the encoding algorithm to suite our needs, but I don't actually have a good idea of how to choose them optimally. Also, the encoding can encode 0 values, but we never need to do that, so you could take advantage of that to pack items tighter.

Compression and decompression speed seems to be about the same.

Patch attached if you want to play with it. WAL replay is still broken, and there are probably bugs.


Good idea. But:
1) We'll still need item indexes in the end of page for fast scan.

Sure.


2) Storage would be easily extendable to hold additional information as
well.
Better compression shouldn't block more serious improvements.

I'm not sure I agree with that. For all the cases where you don't care about additional information - which covers all existing users for example - reducing disk size is pretty important. How are you planning to store the additional information, and how does using another encoding gets in the way of that?

I was planned to store additional information datums between varbyte-encoded tids. I was expected it would be hard to do with PFOR. However, I don't see significant problems in your implementation of Simple9 encoding. I'm going to dig deeper in your version of patch.

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/18/2013 01:45 PM, Alexander Korotkov wrote:
> On Tue, Dec 17, 2013 at 2:49 AM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
>>   2) Storage would be easily extendable to hold additional information as
>>> well.
>>> Better compression shouldn't block more serious improvements.
>>>
>>
>> I'm not sure I agree with that. For all the cases where you don't care
>> about additional information - which covers all existing users for example
>> - reducing disk size is pretty important. How are you planning to store the
>> additional information, and how does using another encoding gets in the way
>> of that?
>
> I was planned to store additional information datums between
> varbyte-encoded tids. I was expected it would be hard to do with PFOR.
> However, I don't see significant problems in your implementation of Simple9
> encoding. I'm going to dig deeper in your version of patch.

Ok, thanks.

I had another idea about the page format this morning. Instead of having 
the item-indexes at the end of the page, it would be more flexible to 
store a bunch of self-contained posting list "segments" on the page. So 
I propose that we get rid of the item-indexes, and instead store a bunch 
of independent posting lists on the page:

typedef struct
{    ItemPointerData first;   /* first item in this segment (unpacked) */    uint16      nwords;         /* number of
wordsthat follow */    uint64      words[1];    /* var length */
 
} PostingListSegment;

Each segment can be encoded and decoded independently. When searching 
for a particular item (like on insertion), you skip over segments where 
'first' > the item you're searching for.

This format offers a lot more flexibility compared to the separate item 
indexes. First, we don't need to have another fixed sized area on the 
page, which simplifies the page format. Second, we can more easily 
re-encode only one segment on the page, on insertion or vacuum. The 
latter is particularly important with the Simple-9 encoding, which 
operates one word at a time rather than one item at a time; removing or 
inserting an item in the middle can require a complete re-encoding of 
everything that follows. Third, when a page is being inserted into and 
contains only uncompressed items, you don't waste any space for unused 
item indexes.

While we're at it, I think we should use the above struct in the inline 
posting lists stored directly in entry tuples. That wastes a few bytes 
compared to the current approach in the patch (more alignment, and 
'words' is redundant with the number of items stored on the tuple 
header), but it simplifies the functions handling these lists.

- Heikki



Re: GIN improvements part 1: additional information

От
Oleg Bartunov
Дата:
Guys,

before digging deep into the art of comp/decomp world I'd like to know
if you familiar with results of
http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
some newer research ?  Do we agree in what we really want ? Basically,
there are three main features: size, compression, decompression speed
- we should take two :)

Should we design sort of plugin, which could support independent
storage on disk, so users can apply different techniques, depending on
data.

What I want to say is that we certainly can play with this very
challenged task, but we have limited time  before 9.4 and we should
think in positive direction.

Oleg

On Wed, Dec 18, 2013 at 6:50 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 12/18/2013 01:45 PM, Alexander Korotkov wrote:
>>
>> On Tue, Dec 17, 2013 at 2:49 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com
>>>
>>> wrote:
>>
>>
>>> On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
>>>   2) Storage would be easily extendable to hold additional information as
>>>>
>>>> well.
>>>> Better compression shouldn't block more serious improvements.
>>>>
>>>
>>> I'm not sure I agree with that. For all the cases where you don't care
>>> about additional information - which covers all existing users for
>>> example
>>> - reducing disk size is pretty important. How are you planning to store
>>> the
>>> additional information, and how does using another encoding gets in the
>>> way
>>> of that?
>>
>>
>> I was planned to store additional information datums between
>> varbyte-encoded tids. I was expected it would be hard to do with PFOR.
>> However, I don't see significant problems in your implementation of
>> Simple9
>> encoding. I'm going to dig deeper in your version of patch.
>
>
> Ok, thanks.
>
> I had another idea about the page format this morning. Instead of having the
> item-indexes at the end of the page, it would be more flexible to store a
> bunch of self-contained posting list "segments" on the page. So I propose
> that we get rid of the item-indexes, and instead store a bunch of
> independent posting lists on the page:
>
> typedef struct
> {
>     ItemPointerData first;   /* first item in this segment (unpacked) */
>     uint16      nwords;      /* number of words that follow */
>     uint64      words[1];    /* var length */
> } PostingListSegment;
>
> Each segment can be encoded and decoded independently. When searching for a
> particular item (like on insertion), you skip over segments where 'first' >
> the item you're searching for.
>
> This format offers a lot more flexibility compared to the separate item
> indexes. First, we don't need to have another fixed sized area on the page,
> which simplifies the page format. Second, we can more easily re-encode only
> one segment on the page, on insertion or vacuum. The latter is particularly
> important with the Simple-9 encoding, which operates one word at a time
> rather than one item at a time; removing or inserting an item in the middle
> can require a complete re-encoding of everything that follows. Third, when a
> page is being inserted into and contains only uncompressed items, you don't
> waste any space for unused item indexes.
>
> While we're at it, I think we should use the above struct in the inline
> posting lists stored directly in entry tuples. That wastes a few bytes
> compared to the current approach in the patch (more alignment, and 'words'
> is redundant with the number of items stored on the tuple header), but it
> simplifies the functions handling these lists.
>
>
> - Heikki
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/19/2013 08:37 AM, Oleg Bartunov wrote:
> Guys,
>
> before digging deep into the art of comp/decomp world I'd like to know
> if you familiar with results of
> http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
> some newer research ?

Yeah, I saw that paper.

> Do we agree in what we really want ? Basically,
> there are three main features: size, compression, decompression speed
> - we should take two :)

According to that Zhang et al paper you linked, the Vbyte method 
actually performs the worst on all of those measures. The other 
algorithms are quite similar in terms of size (PForDelta being the most 
efficient), while PForDelta is significantly faster to compress/decompress.

Just by looking at those numbers, PForDelta looks like a clear winner. 
However, it operates on much bigger batches than the other algorithms; I 
haven't looked at it in detail, but Zhang et al used 128 integer 
batches, and they say that 32 integers is the minimum batch size. If we 
want to use it for the inline posting lists stored in entry tuples, that 
would be quite wasteful if there are only a few item pointers on the tuple.

Also, in the tests I've run, the compression/decompression speed is not 
a significant factor in total performance, with either varbyte encoding 
or Simple9-like encoding I hacked together.

Actually, now that I think about this a bit more, maybe we should go 
with Rice encoding after all? It's the most efficient in terms of size, 
and I believe it would be fast enough.

> Should we design sort of plugin, which could support independent
> storage on disk, so users can apply different techniques, depending on
> data.
>
> What I want to say is that we certainly can play with this very
> challenged task, but we have limited time  before 9.4 and we should
> think in positive direction.

Once we have the code in place to deal with one encoding, it's easy to 
switch the implementation. Making it user-configurable or pluggable 
would be overkill IMHO.

What I'm saying is that we should make sure we get the page format right 
(in particular, I strongly feel we should use the self-contained 
PostingListSegment struct instead of item-indees that I mentioned in the 
other post), with the implementation details hidden in the functions in 
ginpostinglist.c. Then we can easily experiment with different algorithms.

- Heikki



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/17/2013 12:49 AM, Heikki Linnakangas wrote:
> On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
>> On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com
>>> wrote:
>>
>>> On 12/12/2013 06:44 PM, Alexander Korotkov wrote:
>>>
>>>   When values are packed into small groups, we have to either insert
>>>> inefficiently encoded value or re-encode whole right part of values.
>>>
>>> It would probably be simplest to store newly inserted items
>>> uncompressed,
>>> in a separate area in the page. For example, grow the list of
>>> uncompressed
>>> items downwards from pg_upper, and the compressed items upwards from
>>> pg_lower. When the page fills up, re-encode the whole page.
>
> I hacked together an implementation of a variant of Simple9, to see how
> it performs. Insertions are handled per the above scheme.

Here's an updated version of that, using the page layout without
item-indexes that I described in the other post. This is much less buggy
than that earlier crude version I posted - and unfortunately it doesn't
compress as well. The earlier version lost some items :-(.

Nevertheless, I think this page layout and code formatting is better,
even if we switch the encoding back to the varbyte encoding in the end.

I haven't tested WAL replay or VACUUM with this version yet, so those
are likely broken.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/19/2013 10:44 AM, Heikki Linnakangas wrote:
> On 12/19/2013 08:37 AM, Oleg Bartunov wrote:
>> Guys,
>>
>> before digging deep into the art of comp/decomp world I'd like to know
>> if you familiar with results of
>> http://wwwconference.org/www2008/papers/pdf/p387-zhangA.pdf paper and
>> some newer research ?
>
> Yeah, I saw that paper.
>
>> Do we agree in what we really want ? Basically,
>> there are three main features: size, compression, decompression speed
>> - we should take two :)
>
> According to that Zhang et al paper you linked, the Vbyte method
> actually performs the worst on all of those measures. The other
> algorithms are quite similar in terms of size (PForDelta being the most
> efficient), while PForDelta is significantly faster to compress/decompress.
>
> Just by looking at those numbers, PForDelta looks like a clear winner.
> However, it operates on much bigger batches than the other algorithms; I
> haven't looked at it in detail, but Zhang et al used 128 integer
> batches, and they say that 32 integers is the minimum batch size. If we
> want to use it for the inline posting lists stored in entry tuples, that
> would be quite wasteful if there are only a few item pointers on the tuple.
>
> Also, in the tests I've run, the compression/decompression speed is not
> a significant factor in total performance, with either varbyte encoding
> or Simple9-like encoding I hacked together.

One disadvantage of Simple9 and other encodings that operate in batches 
is that removing a value from the middle can increase the number of 
bytes required for the remaining values. That's a problem during vacuum; 
it's possible that after vacuuming away one item pointer, the remaining 
items no longer fit on the page.

- Heikki



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 12/19/2013 03:33 PM, Heikki Linnakangas wrote:
> On 12/17/2013 12:49 AM, Heikki Linnakangas wrote:
>> On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
>>> On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas
>>> <hlinnakangas@vmware.com
>>>> wrote:
>>>
>>>> On 12/12/2013 06:44 PM, Alexander Korotkov wrote:
>>>>
>>>>   When values are packed into small groups, we have to either insert
>>>>> inefficiently encoded value or re-encode whole right part of values.
>>>>
>>>> It would probably be simplest to store newly inserted items
>>>> uncompressed,
>>>> in a separate area in the page. For example, grow the list of
>>>> uncompressed
>>>> items downwards from pg_upper, and the compressed items upwards from
>>>> pg_lower. When the page fills up, re-encode the whole page.
>>
>> I hacked together an implementation of a variant of Simple9, to see how
>> it performs. Insertions are handled per the above scheme.
>
> Here's an updated version of that, using the page layout without
> item-indexes that I described in the other post. This is much less buggy
> than that earlier crude version I posted - and unfortunately it doesn't
> compress as well. The earlier version lost some items :-(.
>
> Nevertheless, I think this page layout and code formatting is better,
> even if we switch the encoding back to the varbyte encoding in the end.

Yet another version. The encoding/decoding code is now quite isolated in
ginpostinglist.c, so it's easy to experiment with different encodings.
This patch uses varbyte encoding again.

I got a bit carried away, experimented with a bunch of different
encodings. I tried rice encoding, rice encoding with block and offset
number delta stored separately, the simple9 variant, and varbyte encoding.

The compressed size obviously depends a lot on the distribution of the
items, but in the test set I used, the differences between different
encodings were quite small.

One fatal problem with many encodings is VACUUM. If a page is completely
full and you remove one item, the result must still fit. In other words,
removing an item must never enlarge the space needed. Otherwise we have
to be able to split on vacuum, which adds a lot of code, and also makes
it possible for VACUUM to fail if there is no disk space left. That's
unpleasant if you're trying to run VACUUM to release disk space. (gin
fast updates already has that problem BTW, but let's not make it worse)

I believe that eliminates all encodings in the Simple family, as well as
PForDelta, and surprisingly also Rice encoding. For example, if you have
three items in consecutive offsets, the differences between them are
encoded as 11 in rice encoding. If you remove the middle item, the
encoding for the next item becomes 010, which takes more space than the
original.

AFAICS varbyte encoding is safe from that. (a formal proof would be nice
though).

So, I'm happy to go with varbyte encoding now, indeed I don't think we
have much choice unless someone can come up with an alternative that's
VACUUM-safe. I have to put this patch aside for a while now, I spent a
lot more time on these encoding experiments than I intended. If you
could take a look at this latest version, spend some time reviewing it
and cleaning up any obsolete comments, and re-run the performance tests
you did earlier, that would be great. One thing I'm slightly worried
about is the overhead of merging the compressed and uncompressed posting
lists in a scan. This patch will be in good shape for the final
commitfest, or even before that.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Alvaro Herrera
Дата:
Heikki Linnakangas escribió:

> I believe that eliminates all encodings in the Simple family, as
> well as PForDelta, and surprisingly also Rice encoding. For example,
> if you have three items in consecutive offsets, the differences
> between them are encoded as 11 in rice encoding. If you remove the
> middle item, the encoding for the next item becomes 010, which takes
> more space than the original.

I don't understand this.  If you have three consecutive entries, and the
differences between them are 11, you need to store two 11s.  But if you
have two items, you only need to store 010 once.  So the difference is
larger, but since you need to store only one of them then overall it's
still shorter than the original.  No?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Dec 20, 2013 at 11:43 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Heikki Linnakangas escribió:

> I believe that eliminates all encodings in the Simple family, as
> well as PForDelta, and surprisingly also Rice encoding. For example,
> if you have three items in consecutive offsets, the differences
> between them are encoded as 11 in rice encoding. If you remove the
> middle item, the encoding for the next item becomes 010, which takes
> more space than the original.

I don't understand this.  If you have three consecutive entries, and the
differences between them are 11, you need to store two 11s.  But if you
have two items, you only need to store 010 once.  So the difference is
larger, but since you need to store only one of them then overall it's
still shorter than the original.  No?

I believe Heikki mean both differences are encoded as 11, each one is 1.

------
With best regards,
Alexander Korotkov. 

Re: GIN improvements part 1: additional information

От
Alvaro Herrera
Дата:
Alexander Korotkov escribió:
> On Fri, Dec 20, 2013 at 11:43 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com>wrote:
> 
> > Heikki Linnakangas escribió:
> >
> > > I believe that eliminates all encodings in the Simple family, as
> > > well as PForDelta, and surprisingly also Rice encoding. For example,
> > > if you have three items in consecutive offsets, the differences
> > > between them are encoded as 11 in rice encoding. If you remove the
> > > middle item, the encoding for the next item becomes 010, which takes
> > > more space than the original.
> >
> > I don't understand this.  If you have three consecutive entries, and the
> > differences between them are 11, you need to store two 11s.  But if you
> > have two items, you only need to store 010 once.  So the difference is
> > larger, but since you need to store only one of them then overall it's
> > still shorter than the original.  No?
> 
> I believe Heikki mean both differences are encoded as 11, each one is 1.

Oh, that sucks (or it's great, depending on perspective).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Dec 20, 2013 at 11:36 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 12/19/2013 03:33 PM, Heikki Linnakangas wrote:
On 12/17/2013 12:49 AM, Heikki Linnakangas wrote:
On 12/17/2013 12:22 AM, Alexander Korotkov wrote:
On Mon, Dec 16, 2013 at 3:30 PM, Heikki Linnakangas
<hlinnakangas@vmware.com
wrote:

On 12/12/2013 06:44 PM, Alexander Korotkov wrote:

  When values are packed into small groups, we have to either insert
inefficiently encoded value or re-encode whole right part of values.

It would probably be simplest to store newly inserted items
uncompressed,
in a separate area in the page. For example, grow the list of
uncompressed
items downwards from pg_upper, and the compressed items upwards from
pg_lower. When the page fills up, re-encode the whole page.

I hacked together an implementation of a variant of Simple9, to see how
it performs. Insertions are handled per the above scheme.

Here's an updated version of that, using the page layout without
item-indexes that I described in the other post. This is much less buggy
than that earlier crude version I posted - and unfortunately it doesn't
compress as well. The earlier version lost some items :-(.

Nevertheless, I think this page layout and code formatting is better,
even if we switch the encoding back to the varbyte encoding in the end.

Yet another version. The encoding/decoding code is now quite isolated in ginpostinglist.c, so it's easy to experiment with different encodings. This patch uses varbyte encoding again.

I got a bit carried away, experimented with a bunch of different encodings. I tried rice encoding, rice encoding with block and offset number delta stored separately, the simple9 variant, and varbyte encoding.

The compressed size obviously depends a lot on the distribution of the items, but in the test set I used, the differences between different encodings were quite small.

One fatal problem with many encodings is VACUUM. If a page is completely full and you remove one item, the result must still fit. In other words, removing an item must never enlarge the space needed. Otherwise we have to be able to split on vacuum, which adds a lot of code, and also makes it possible for VACUUM to fail if there is no disk space left. That's unpleasant if you're trying to run VACUUM to release disk space. (gin fast updates already has that problem BTW, but let's not make it worse)

I believe that eliminates all encodings in the Simple family, as well as PForDelta, and surprisingly also Rice encoding. For example, if you have three items in consecutive offsets, the differences between them are encoded as 11 in rice encoding. If you remove the middle item, the encoding for the next item becomes 010, which takes more space than the original.

AFAICS varbyte encoding is safe from that. (a formal proof would be nice though).

Formal proof is so. Removing number is actually replacement of two numbers with their sum. We have to proof that varbyte encoding of sum can't be longer than varbyte encoding of summands.High bit number of sum is at most one more than high bit number of larger summand. So varbyte encoding of sum is at most one byte longer than varbyte encoding of larger summand. Lesser summand is at least one byte. 
 
So, I'm happy to go with varbyte encoding now, indeed I don't think we have much choice unless someone can come up with an alternative that's VACUUM-safe. I have to put this patch aside for a while now, I spent a lot more time on these encoding experiments than I intended. If you could take a look at this latest version, spend some time reviewing it and cleaning up any obsolete comments, and re-run the performance tests you did earlier, that would be great. One thing I'm slightly worried about is the overhead of merging the compressed and uncompressed posting lists in a scan. This patch will be in good shape for the final commitfest, or even before that.

OK. I'll make tests for scans on mix of compressed and uncompressed posting lists. However, I don't expect it to be slower than both pure compressed and pure uncompressed posting lists. Overhead of compressing uncompressed posting lists is evident. But it also would be nice to measure.

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Peter Eisentraut
Дата:
On 12/10/13, 2:44 PM, Alexander Korotkov wrote:
> However, patch didn't apply to head. Corrected version is attached.

Update the pgstattuple regression test results.




Re: GIN improvements part 1: additional information

От
Amit Langote
Дата:
On Sat, Dec 21, 2013 at 4:36 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>
> Yet another version. The encoding/decoding code is now quite isolated in
> ginpostinglist.c, so it's easy to experiment with different encodings. This
> patch uses varbyte encoding again.
>
> I got a bit carried away, experimented with a bunch of different encodings.
> I tried rice encoding, rice encoding with block and offset number delta
> stored separately, the simple9 variant, and varbyte encoding.
>
> The compressed size obviously depends a lot on the distribution of the
> items, but in the test set I used, the differences between different
> encodings were quite small.
>
> One fatal problem with many encodings is VACUUM. If a page is completely
> full and you remove one item, the result must still fit. In other words,
> removing an item must never enlarge the space needed. Otherwise we have to
> be able to split on vacuum, which adds a lot of code, and also makes it
> possible for VACUUM to fail if there is no disk space left. That's
> unpleasant if you're trying to run VACUUM to release disk space. (gin fast
> updates already has that problem BTW, but let's not make it worse)
>
> I believe that eliminates all encodings in the Simple family, as well as
> PForDelta, and surprisingly also Rice encoding. For example, if you have
> three items in consecutive offsets, the differences between them are encoded
> as 11 in rice encoding. If you remove the middle item, the encoding for the
> next item becomes 010, which takes more space than the original.
>
> AFAICS varbyte encoding is safe from that. (a formal proof would be nice
> though).
>
> So, I'm happy to go with varbyte encoding now, indeed I don't think we have
> much choice unless someone can come up with an alternative that's
> VACUUM-safe. I have to put this patch aside for a while now, I spent a lot
> more time on these encoding experiments than I intended. If you could take a
> look at this latest version, spend some time reviewing it and cleaning up
> any obsolete comments, and re-run the performance tests you did earlier,
> that would be great. One thing I'm slightly worried about is the overhead of
> merging the compressed and uncompressed posting lists in a scan. This patch
> will be in good shape for the final commitfest, or even before that.
>


I just tried out the patch "gin-packed-postinglists-varbyte2.patch"
(which looks like the latest one in this thread) as follows:

1) Applied patch to the HEAD (on commit
94b899b829657332bda856ac3f06153d09077bd1)
2) Created a test table and index

create table test (a text);
copy test from '/usr/share/dict/words';
create index test_trgm_idx on test using gin (a gin_trgm_ops);

3) Got the following error on a wildcard query:

postgres=# explain (buffers, analyze) select count(*) from test where
a like '%tio%';
ERROR:  lock 9447 is not held
STATEMENT:  explain (buffers, analyze) select count(*) from test where
a like '%tio%';
ERROR:  lock 9447 is not held

Following is the "bt":

#0  LWLockRelease (lockid=9447) at lwlock.c:747
#1  0x000000000074a356 in LockBuffer (buffer=4638, mode=0) at bufmgr.c:2760
#2  0x0000000000749d6e in UnlockReleaseBuffer (buffer=4638) at bufmgr.c:2551
#3  0x0000000000478bcc in entryGetNextItem (ginstate=0x2380100,
entry=0x23832a8) at ginget.c:555
#4  0x0000000000479346 in entryGetItem (ginstate=0x2380100,
entry=0x23832a8) at ginget.c:695
#5  0x000000000047987e in scanGetItem (scan=0x237fee8,
advancePast=0x7fffa1a46b80, item=0x7fffa1a46b80,
recheck=0x7fffa1a46b7f "\001") at ginget.c:925
#6  0x000000000047ae3f in gingetbitmap (fcinfo=0x7fffa1a46be0) at ginget.c:1540
#7  0x00000000008a9486 in FunctionCall2Coll (flinfo=0x236f558,
collation=0, arg1=37224168, arg2=37236160) at fmgr.c:1323
#8  0x00000000004bd091 in index_getbitmap (scan=0x237fee8,
bitmap=0x2382dc0) at indexam.c:649
#9  0x000000000064827c in MultiExecBitmapIndexScan (node=0x237fce0) at
nodeBitmapIndexscan.c:89
#10 0x00000000006313b6 in MultiExecProcNode (node=0x237fce0) at
execProcnode.c:550
#11 0x000000000064713a in BitmapHeapNext (node=0x237e998) at
nodeBitmapHeapscan.c:104
#12 0x000000000063c348 in ExecScanFetch (node=0x237e998,
accessMtd=0x6470ac <BitmapHeapNext>, recheckMtd=0x647cbc
<BitmapHeapRecheck>) at execScan.c:82
#13 0x000000000063c3bc in ExecScan (node=0x237e998, accessMtd=0x6470ac
<BitmapHeapNext>, recheckMtd=0x647cbc <BitmapHeapRecheck>) at
execScan.c:132
#14 0x0000000000647d3a in ExecBitmapHeapScan (node=0x237e998) at
nodeBitmapHeapscan.c:436
#15 0x0000000000631121 in ExecProcNode (node=0x237e998) at execProcnode.c:414
#16 0x0000000000644992 in agg_retrieve_direct (aggstate=0x237e200) at
nodeAgg.c:1073
#17 0x00000000006448cc in ExecAgg (node=0x237e200) at nodeAgg.c:1026
#18 0x0000000000631247 in ExecProcNode (node=0x237e200) at execProcnode.c:476
#19 0x000000000062ef2a in ExecutePlan (estate=0x237e0e8,
planstate=0x237e200, operation=CMD_SELECT, sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0xd0c360) at
execMain.c:1474
#20 0x000000000062cfeb in standard_ExecutorRun (queryDesc=0x237c208,
direction=ForwardScanDirection, count=0) at execMain.c:308
#21 0x000000000062ce51 in ExecutorRun (queryDesc=0x237c208,
direction=ForwardScanDirection, count=0) at execMain.c:256
#22 0x00000000005ccc46 in ExplainOnePlan (plannedstmt=0x237c170,
into=0x0, es=0x7fffa1a47580, queryString=0x231d158 "explain (buffers,
analyze) select count(*) from test where a like '%tio%';", params=0x0)   at explain.c:471
#23 0x00000000005cc8f1 in ExplainOneQuery (query=0x2353c10, into=0x0,
es=0x7fffa1a47580, queryString=0x231d158 "explain (buffers, analyze)
select count(*) from test where a like '%tio%';", params=0x0)   at explain.c:327
#24 0x00000000005cc5d7 in ExplainQuery (stmt=0x231e2c8,
queryString=0x231d158 "explain (buffers, analyze) select count(*) from
test where a like '%tio%';", params=0x0, dest=0x2353b40) at
explain.c:225
#25 0x0000000000784101 in standard_ProcessUtility
(parsetree=0x231e2c8, queryString=0x231d158 "explain (buffers,
analyze) select count(*) from test where a like '%tio%';",
context=PROCESS_UTILITY_TOPLEVEL,   params=0x0, dest=0x2353b40, completionTag=0x7fffa1a477e0 "") at
utility.c:687
#26 0x0000000000783835 in ProcessUtility (parsetree=0x231e2c8,
queryString=0x231d158 "explain (buffers, analyze) select count(*) from
test where a like '%tio%';", context=PROCESS_UTILITY_TOPLEVEL,   params=0x0, dest=0x2353b40,
completionTag=0x7fffa1a477e0"") at
 
utility.c:352
#27 0x0000000000782771 in PortalRunUtility (portal=0x235aec8,
utilityStmt=0x231e2c8, isTopLevel=1 '\001', dest=0x2353b40,
completionTag=0x7fffa1a477e0 "") at pquery.c:1187
#28 0x00000000007824c0 in FillPortalStore (portal=0x235aec8,
isTopLevel=1 '\001') at pquery.c:1061
#29 0x0000000000781dc6 in PortalRun (portal=0x235aec8,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x231eef8,
altdest=0x231eef8, completionTag=0x7fffa1a479c0 "") at pquery.c:785
#30 0x000000000077be51 in exec_simple_query (query_string=0x231d158
"explain (buffers, analyze) select count(*) from test where a like
'%tio%';") at postgres.c:1054
#31 0x000000000078004f in PostgresMain (argc=1, argv=0x22b85d8,
dbname=0x22b8438 "postgres", username=0x22b8418 "amit") at
postgres.c:4011
#32 0x000000000071a811 in BackendRun (port=0x22d5c00) at postmaster.c:4085
#33 0x0000000000719f9b in BackendStartup (port=0x22d5c00) at postmaster.c:3774
#34 0x00000000007167c0 in ServerLoop () at postmaster.c:1585
#35 0x0000000000715eed in PostmasterMain (argc=3, argv=0x22b6350) at
postmaster.c:1240
#36 0x0000000000678890 in main (argc=3, argv=0x22b6350) at main.c:194


--
Amit



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Jan 6, 2014 at 12:35 PM, Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Dec 21, 2013 at 4:36 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>
> Yet another version. The encoding/decoding code is now quite isolated in
> ginpostinglist.c, so it's easy to experiment with different encodings. This
> patch uses varbyte encoding again.
>
> I got a bit carried away, experimented with a bunch of different encodings.
> I tried rice encoding, rice encoding with block and offset number delta
> stored separately, the simple9 variant, and varbyte encoding.
>
> The compressed size obviously depends a lot on the distribution of the
> items, but in the test set I used, the differences between different
> encodings were quite small.
>
> One fatal problem with many encodings is VACUUM. If a page is completely
> full and you remove one item, the result must still fit. In other words,
> removing an item must never enlarge the space needed. Otherwise we have to
> be able to split on vacuum, which adds a lot of code, and also makes it
> possible for VACUUM to fail if there is no disk space left. That's
> unpleasant if you're trying to run VACUUM to release disk space. (gin fast
> updates already has that problem BTW, but let's not make it worse)
>
> I believe that eliminates all encodings in the Simple family, as well as
> PForDelta, and surprisingly also Rice encoding. For example, if you have
> three items in consecutive offsets, the differences between them are encoded
> as 11 in rice encoding. If you remove the middle item, the encoding for the
> next item becomes 010, which takes more space than the original.
>
> AFAICS varbyte encoding is safe from that. (a formal proof would be nice
> though).
>
> So, I'm happy to go with varbyte encoding now, indeed I don't think we have
> much choice unless someone can come up with an alternative that's
> VACUUM-safe. I have to put this patch aside for a while now, I spent a lot
> more time on these encoding experiments than I intended. If you could take a
> look at this latest version, spend some time reviewing it and cleaning up
> any obsolete comments, and re-run the performance tests you did earlier,
> that would be great. One thing I'm slightly worried about is the overhead of
> merging the compressed and uncompressed posting lists in a scan. This patch
> will be in good shape for the final commitfest, or even before that.
>


I just tried out the patch "gin-packed-postinglists-varbyte2.patch"
(which looks like the latest one in this thread) as follows:

1) Applied patch to the HEAD (on commit
94b899b829657332bda856ac3f06153d09077bd1)
2) Created a test table and index

create table test (a text);
copy test from '/usr/share/dict/words';
create index test_trgm_idx on test using gin (a gin_trgm_ops);

3) Got the following error on a wildcard query:

postgres=# explain (buffers, analyze) select count(*) from test where
a like '%tio%';
ERROR:  lock 9447 is not held
STATEMENT:  explain (buffers, analyze) select count(*) from test where
a like '%tio%';
ERROR:  lock 9447 is not held

Thanks for reporting. Fixed version is attached.

------
With best regards,
Alexander Korotkov.
Вложения

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 8.1.2014 22:58, Alexander Korotkov wrote:
> Thanks for reporting. Fixed version is attached.

I've tried to rerun the 'archie' benchmark with the current patch, and
once again I got
  PANIC:  could not split GIN page, didn't fit

I reran it with '--enable-cassert' and with that I got

TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],                  &items[i]) < 0)", File:
"gindatapage.c",Line: 149)
 
LOG:  server process (PID 5364) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: INSERT INTO messages ...

so the assert in GinDataLeafPageGetUncompressed fails for some reason.

I can easily reproduce it, but my knowledge in this area is rather
limited so I'm not entirely sure what to look for.

regards
Tomas



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 8.1.2014 22:58, Alexander Korotkov wrote:
> Thanks for reporting. Fixed version is attached.

I've tried to rerun the 'archie' benchmark with the current patch, and
once again I got

   PANIC:  could not split GIN page, didn't fit

I reran it with '--enable-cassert' and with that I got

TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],
                   &items[i]) < 0)", File: "gindatapage.c", Line: 149)
LOG:  server process (PID 5364) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: INSERT INTO messages ...

so the assert in GinDataLeafPageGetUncompressed fails for some reason.

I can easily reproduce it, but my knowledge in this area is rather
limited so I'm not entirely sure what to look for.

I've fixed this bug and many other bug. Now patch passes test suite that I've used earlier. The results are so:

Operations time:
         event         |     period      
-----------------------+-----------------
 index_build           | 00:01:47.53915
 index_build_recovery  | 00:00:04
 index_update          | 00:05:24.388163
 index_update_recovery | 00:00:53
 search_new            | 00:24:02.289384
 search_updated        | 00:27:09.193343
(6 rows)

Index sizes:
     label     |   size    
---------------+-----------
 new           | 384761856
 after_updates | 667942912
(2 rows)

Also, I made following changes in algorithms:
  • Now, there is a limit to number of uncompressed TIDs in the page. After reaching this limit, they are encoded independent on if they can fit page. That seems to me more desirable behaviour and somehow it accelerates search speed. Before this change times were following:
         event         |     period      
-----------------------+-----------------
 index_build           | 00:01:51.467888
 index_build_recovery  | 00:00:04
 index_update          | 00:05:03.315155
 index_update_recovery | 00:00:51
 search_new            | 00:24:43.194882
 search_updated        | 00:28:36.316784
(6 rows)
  • Page are not fully re-encoded if it's enough to re-encode just last segment.

README is updated.

------
With best regards,
Alexander Korotkov.

Вложения

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 13.1.2014 18:07, Alexander Korotkov wrote:
> On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra <tv@fuzzy.cz
> <mailto:tv@fuzzy.cz>> wrote:
> 
>     On 8.1.2014 22:58, Alexander Korotkov wrote:
>     > Thanks for reporting. Fixed version is attached.
> 
>     I've tried to rerun the 'archie' benchmark with the current patch, and
>     once again I got
> 
>        PANIC:  could not split GIN page, didn't fit
> 
>     I reran it with '--enable-cassert' and with that I got
> 
>     TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],
>                        &items[i]) < 0)", File: "gindatapage.c", Line: 149)
>     LOG:  server process (PID 5364) was terminated by signal 6: Aborted
>     DETAIL:  Failed process was running: INSERT INTO messages ...
> 
>     so the assert in GinDataLeafPageGetUncompressed fails for some reason.
> 
>     I can easily reproduce it, but my knowledge in this area is rather
>     limited so I'm not entirely sure what to look for.
> 
> 
> I've fixed this bug and many other bug. Now patch passes test suite that
> I've used earlier. The results are so:

OK, it seems the bug is gone. However now there's a memory leak
somewhere. I'm loading pgsql mailing list archives (~600k messages)
using this script
  https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py

And after loading about 1/5 of the data, all the memory gets filled by
the pgsql backends (loading the data in parallel) and the DB gets killed
by the OOM killer.

Tomas




Re: GIN improvements part 1: additional information

От
"Etsuro Fujita"
Дата:
Peter Eisentraut wrote:
> On 12/10/13, 2:44 PM, Alexander Korotkov wrote:
> > However, patch didn't apply to head. Corrected version is attached.

> Update the pgstattuple regression test results.

The latest version of the patch still doesn't pass the test.

I'll look at the patch in further detail.

Thanks,

Best regards,
Etsuro Fujita




Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/13/2014 07:07 PM, Alexander Korotkov wrote:
> I've fixed this bug and many other bug. Now patch passes test suite that
> I've used earlier. The results are so:
>
> Operations time:
>           event         |     period
> -----------------------+-----------------
>   index_build           | 00:01:47.53915
>   index_build_recovery  | 00:00:04
>   index_update          | 00:05:24.388163
>   index_update_recovery | 00:00:53
>   search_new            | 00:24:02.289384
>   search_updated        | 00:27:09.193343
> (6 rows)
>
> Index sizes:
>       label     |   size
> ---------------+-----------
>   new           | 384761856
>   after_updates | 667942912
> (2 rows)
>
> Also, I made following changes in algorithms:
>
>     - Now, there is a limit to number of uncompressed TIDs in the page.
>     After reaching this limit, they are encoded independent on if they can fit
>     page. That seems to me more desirable behaviour and somehow it accelerates
>     search speed. Before this change times were following:
>
>           event         |     period
> -----------------------+-----------------
>   index_build           | 00:01:51.467888
>   index_build_recovery  | 00:00:04
>   index_update          | 00:05:03.315155
>   index_update_recovery | 00:00:51
>   search_new            | 00:24:43.194882
>   search_updated        | 00:28:36.316784
> (6 rows)

Hmm, that's strange. Any idea why that's happening? One theory is that 
when you re-encode the pages more aggressively, there are fewer pages 
with a mix of packed and unpacked items. Mixed pages are somewhat slower 
to scan than fully packed or fully unpacked pages, because 
GinDataLeafPageGetItems() has to merge the packed and unpacked items 
into a single list. But I wouldn't expect that effect to be large enough 
to explain the results you got.

>     - Page are not fully re-encoded if it's enough to re-encode just last
>     segment.

Great! We should also take advantage of that in the WAL record that's 
written; no point in WAL-logging all the segments, if we know that only 
last one was modified.    

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Tue, Jan 14, 2014 at 12:34 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/13/2014 07:07 PM, Alexander Korotkov wrote:
I've fixed this bug and many other bug. Now patch passes test suite that
I've used earlier. The results are so:

Operations time:
          event         |     period
-----------------------+-----------------
  index_build           | 00:01:47.53915
  index_build_recovery  | 00:00:04
  index_update          | 00:05:24.388163
  index_update_recovery | 00:00:53
  search_new            | 00:24:02.289384
  search_updated        | 00:27:09.193343
(6 rows)

Index sizes:
      label     |   size
---------------+-----------
  new           | 384761856
  after_updates | 667942912
(2 rows)

Also, I made following changes in algorithms:

    - Now, there is a limit to number of uncompressed TIDs in the page.

    After reaching this limit, they are encoded independent on if they can fit
    page. That seems to me more desirable behaviour and somehow it accelerates
    search speed. Before this change times were following:

          event         |     period
-----------------------+-----------------
  index_build           | 00:01:51.467888
  index_build_recovery  | 00:00:04
  index_update          | 00:05:03.315155
  index_update_recovery | 00:00:51
  search_new            | 00:24:43.194882
  search_updated        | 00:28:36.316784
(6 rows)

Hmm, that's strange. Any idea why that's happening? One theory is that when you re-encode the pages more aggressively, there are fewer pages with a mix of packed and unpacked items. Mixed pages are somewhat slower to scan than fully packed or fully unpacked pages, because GinDataLeafPageGetItems() has to merge the packed and unpacked items into a single list. But I wouldn't expect that effect to be large enough to explain the results you got.

Probably, it's because of less work in ginMergeItemPointers.

    - Page are not fully re-encoded if it's enough to re-encode just last
    segment.

Great! We should also take advantage of that in the WAL record that's written; no point in WAL-logging all the segments, if we know that only last one was modified.    

Already. 

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 14.1.2014 00:38, Tomas Vondra wrote:
> On 13.1.2014 18:07, Alexander Korotkov wrote:
>> On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra <tv@fuzzy.cz
>> <mailto:tv@fuzzy.cz>> wrote:
>>
>>     On 8.1.2014 22:58, Alexander Korotkov wrote:
>>     > Thanks for reporting. Fixed version is attached.
>>
>>     I've tried to rerun the 'archie' benchmark with the current patch, and
>>     once again I got
>>
>>        PANIC:  could not split GIN page, didn't fit
>>
>>     I reran it with '--enable-cassert' and with that I got
>>
>>     TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],
>>                        &items[i]) < 0)", File: "gindatapage.c", Line: 149)
>>     LOG:  server process (PID 5364) was terminated by signal 6: Aborted
>>     DETAIL:  Failed process was running: INSERT INTO messages ...
>>
>>     so the assert in GinDataLeafPageGetUncompressed fails for some reason.
>>
>>     I can easily reproduce it, but my knowledge in this area is rather
>>     limited so I'm not entirely sure what to look for.
>>
>>
>> I've fixed this bug and many other bug. Now patch passes test suite that
>> I've used earlier. The results are so:
> 
> OK, it seems the bug is gone. However now there's a memory leak
> somewhere. I'm loading pgsql mailing list archives (~600k messages)
> using this script
> 
>    https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py
> 
> And after loading about 1/5 of the data, all the memory gets filled by
> the pgsql backends (loading the data in parallel) and the DB gets killed
> by the OOM killer.

I've spent a fair amount of time trying to locate the memory leak, but
so far no luck. I'm not sufficiently familiar with the GIN code.

I can however demonstrate that it's there, and I have rather simple test
case to reproduce it - basically just a CREATE INDEX on a table with ~1M
email message bodies (in a tsvector column). The data is available here
(360MB compressed, 1GB raw):
  http://www.fuzzy.cz/tmp/message-b.data.gz

Simply create a single-column table, load data and create the index
  CREATE TABLE test ( body_tsvector TSVECTOR );  COPY test FROM '/tmp/message-b.data';  CREATE test_idx ON test USING
gintest ( body_tsvector );
 

I'm running this on a machine with 8GB of RAM, with these settings
  shared_buffers=1GB  maintenance_work_mem=1GB

According to top, CREATE INDEX from the current HEAD never consumes more
than ~25% of RAM:
   PID USER      PR  NI    VIRT    RES    SHR  %CPU %MEM  COMMAND 32091 tomas     20   0 2026032 1,817g 1,040g  56,2
23,8 postgres
 

which is about right, as (shared_buffers + maintenance_work_mem) is
about 1/4 of RAM.

With the v5 patch version applied, the CREATE INDEX process eventually
goes crazy and allocates almost all the available memory (but somesimes
finishes, mostly by pure luck). This is what I was able to get from top
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM  COMMAND 14090 tomas     20   0 7913820 6,962g 955036 D
4,391,1  postgres
 

while the system was still reasonably responsive.

regards
Tomas



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Wed, Jan 15, 2014 at 5:17 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 14.1.2014 00:38, Tomas Vondra wrote:
> On 13.1.2014 18:07, Alexander Korotkov wrote:
>> On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra <tv@fuzzy.cz
>> <mailto:tv@fuzzy.cz>> wrote:
>>
>>     On 8.1.2014 22:58, Alexander Korotkov wrote:
>>     > Thanks for reporting. Fixed version is attached.
>>
>>     I've tried to rerun the 'archie' benchmark with the current patch, and
>>     once again I got
>>
>>        PANIC:  could not split GIN page, didn't fit
>>
>>     I reran it with '--enable-cassert' and with that I got
>>
>>     TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],
>>                        &items[i]) < 0)", File: "gindatapage.c", Line: 149)
>>     LOG:  server process (PID 5364) was terminated by signal 6: Aborted
>>     DETAIL:  Failed process was running: INSERT INTO messages ...
>>
>>     so the assert in GinDataLeafPageGetUncompressed fails for some reason.
>>
>>     I can easily reproduce it, but my knowledge in this area is rather
>>     limited so I'm not entirely sure what to look for.
>>
>>
>> I've fixed this bug and many other bug. Now patch passes test suite that
>> I've used earlier. The results are so:
>
> OK, it seems the bug is gone. However now there's a memory leak
> somewhere. I'm loading pgsql mailing list archives (~600k messages)
> using this script
>
>    https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py
>
> And after loading about 1/5 of the data, all the memory gets filled by
> the pgsql backends (loading the data in parallel) and the DB gets killed
> by the OOM killer.

I've spent a fair amount of time trying to locate the memory leak, but
so far no luck. I'm not sufficiently familiar with the GIN code.

I can however demonstrate that it's there, and I have rather simple test
case to reproduce it - basically just a CREATE INDEX on a table with ~1M
email message bodies (in a tsvector column). The data is available here
(360MB compressed, 1GB raw):

   http://www.fuzzy.cz/tmp/message-b.data.gz

Simply create a single-column table, load data and create the index

   CREATE TABLE test ( body_tsvector TSVECTOR );
   COPY test FROM '/tmp/message-b.data';
   CREATE test_idx ON test USING gin test ( body_tsvector );

I'm running this on a machine with 8GB of RAM, with these settings

   shared_buffers=1GB
   maintenance_work_mem=1GB

According to top, CREATE INDEX from the current HEAD never consumes more
than ~25% of RAM:

    PID USER      PR  NI    VIRT    RES    SHR  %CPU %MEM  COMMAND
  32091 tomas     20   0 2026032 1,817g 1,040g  56,2 23,8  postgres

which is about right, as (shared_buffers + maintenance_work_mem) is
about 1/4 of RAM.

With the v5 patch version applied, the CREATE INDEX process eventually
goes crazy and allocates almost all the available memory (but somesimes
finishes, mostly by pure luck). This is what I was able to get from top

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM  COMMAND
  14090 tomas     20   0 7913820 6,962g 955036 D   4,3 91,1  postgres

while the system was still reasonably responsive.

Thanks a lot for your help! I believe problem is that each decompressed item pointers array is palloc'd but not freed. I hope to fix it today. 

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Wed, Jan 15, 2014 at 10:47 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Jan 15, 2014 at 5:17 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 14.1.2014 00:38, Tomas Vondra wrote:
> On 13.1.2014 18:07, Alexander Korotkov wrote:
>> On Sat, Jan 11, 2014 at 6:15 AM, Tomas Vondra <tv@fuzzy.cz
>> <mailto:tv@fuzzy.cz>> wrote:
>>
>>     On 8.1.2014 22:58, Alexander Korotkov wrote:
>>     > Thanks for reporting. Fixed version is attached.
>>
>>     I've tried to rerun the 'archie' benchmark with the current patch, and
>>     once again I got
>>
>>        PANIC:  could not split GIN page, didn't fit
>>
>>     I reran it with '--enable-cassert' and with that I got
>>
>>     TRAP: FailedAssertion("!(ginCompareItemPointers(&items[i - 1],
>>                        &items[i]) < 0)", File: "gindatapage.c", Line: 149)
>>     LOG:  server process (PID 5364) was terminated by signal 6: Aborted
>>     DETAIL:  Failed process was running: INSERT INTO messages ...
>>
>>     so the assert in GinDataLeafPageGetUncompressed fails for some reason.
>>
>>     I can easily reproduce it, but my knowledge in this area is rather
>>     limited so I'm not entirely sure what to look for.
>>
>>
>> I've fixed this bug and many other bug. Now patch passes test suite that
>> I've used earlier. The results are so:
>
> OK, it seems the bug is gone. However now there's a memory leak
> somewhere. I'm loading pgsql mailing list archives (~600k messages)
> using this script
>
>    https://bitbucket.org/tvondra/archie/src/1bbeb920/bin/load.py
>
> And after loading about 1/5 of the data, all the memory gets filled by
> the pgsql backends (loading the data in parallel) and the DB gets killed
> by the OOM killer.

I've spent a fair amount of time trying to locate the memory leak, but
so far no luck. I'm not sufficiently familiar with the GIN code.

I can however demonstrate that it's there, and I have rather simple test
case to reproduce it - basically just a CREATE INDEX on a table with ~1M
email message bodies (in a tsvector column). The data is available here
(360MB compressed, 1GB raw):

   http://www.fuzzy.cz/tmp/message-b.data.gz

Simply create a single-column table, load data and create the index

   CREATE TABLE test ( body_tsvector TSVECTOR );
   COPY test FROM '/tmp/message-b.data';
   CREATE test_idx ON test USING gin test ( body_tsvector );

I'm running this on a machine with 8GB of RAM, with these settings

   shared_buffers=1GB
   maintenance_work_mem=1GB

According to top, CREATE INDEX from the current HEAD never consumes more
than ~25% of RAM:

    PID USER      PR  NI    VIRT    RES    SHR  %CPU %MEM  COMMAND
  32091 tomas     20   0 2026032 1,817g 1,040g  56,2 23,8  postgres

which is about right, as (shared_buffers + maintenance_work_mem) is
about 1/4 of RAM.

With the v5 patch version applied, the CREATE INDEX process eventually
goes crazy and allocates almost all the available memory (but somesimes
finishes, mostly by pure luck). This is what I was able to get from top

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM  COMMAND
  14090 tomas     20   0 7913820 6,962g 955036 D   4,3 91,1  postgres

while the system was still reasonably responsive.

Thanks a lot for your help! I believe problem is that each decompressed item pointers array is palloc'd but not freed. I hope to fix it today. 

Seems to be fixed in the attached version of patch.
Another improvement in this version: only left-most segments and further are re-encoded. Left part of page are left untouched.

------
With best regards,
Alexander Korotkov. 
Вложения

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/17/2014 01:05 PM, Alexander Korotkov wrote:
> Seems to be fixed in the attached version of patch.
> Another improvement in this version: only left-most segments and further
> are re-encoded. Left part of page are left untouched.

I'm looking into this now. A few quick notes:

* Even when you don't re-encode the whole page, you still WAL-log the 
whole page. While correct, it'd be a pretty obvious optimization to only 
WAL-log the modified part.

* When new items are appended to the end of the page so that only the 
last existing compressed segment is re-encoded, and the page has to be 
split, the items aren't divided 50/50 on the pages. The loop that moves 
segments destined for the left page to the right won't move any 
existing, untouched, segments.

> !     /*
> !      * Didn't fit uncompressed. We'll have to encode them. Check if both
> !      * new items and uncompressed items can be placed starting from last
> !      * segment of page. Then re-encode only last segment of page.
> !      */
> !     minNewItem = newItems[0];
> !     if (nolduncompressed == 0 &&
> !             ginCompareItemPointers(&olduncompressed[0], &minNewItem) < 0)
> !         minNewItem = olduncompressed[0];

That looks wrong. If I'm understanding it right, it's trying to do 
minNewItem = Min(newItems[0], olduncompressed[0]). The test should be 
"nolduncompressed > 0 && ..."

No need to send a new patch, I'll just fix those while reviewing...

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/17/2014 01:05 PM, Alexander Korotkov wrote:
Seems to be fixed in the attached version of patch.
Another improvement in this version: only left-most segments and further
are re-encoded. Left part of page are left untouched.

I'm looking into this now. A few quick notes:

* Even when you don't re-encode the whole page, you still WAL-log the whole page. While correct, it'd be a pretty obvious optimization to only WAL-log the modified part.

Oh, I overlooked it. I wrote code in ginRedoInsertData which finds appropriate place fro data but didn't notice that I still wrote whole page to xlog record.
 
* When new items are appended to the end of the page so that only the last existing compressed segment is re-encoded, and the page has to be split, the items aren't divided 50/50 on the pages. The loop that moves segments destined for the left page to the right won't move any existing, untouched, segments.

I think this loop will move one segment. However, it's too small.
 
!       /*
!        * Didn't fit uncompressed. We'll have to encode them. Check if both
!        * new items and uncompressed items can be placed starting from last
!        * segment of page. Then re-encode only last segment of page.
!        */
!       minNewItem = newItems[0];
!       if (nolduncompressed == 0 &&
!                       ginCompareItemPointers(&olduncompressed[0], &minNewItem) < 0)
!               minNewItem = olduncompressed[0];

That looks wrong. If I'm understanding it right, it's trying to do minNewItem = Min(newItems[0], olduncompressed[0]). The test should be "nolduncompressed > 0 && ..."

Yes, definitely a bug.
 
No need to send a new patch, I'll just fix those while reviewing...

Ok, thanks!

------
With best regards,
Alexander Korotkov.

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/17/2014 08:49 PM, Alexander Korotkov wrote:
> On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas <
> hlinnakangas@vmware.com> wrote:
>
>> On 01/17/2014 01:05 PM, Alexander Korotkov wrote:
>>
>>> Seems to be fixed in the attached version of patch.
>>> Another improvement in this version: only left-most segments and further
>>> are re-encoded. Left part of page are left untouched.
>>
>> I'm looking into this now. A few quick notes:
>>
>> * Even when you don't re-encode the whole page, you still WAL-log the
>> whole page. While correct, it'd be a pretty obvious optimization to only
>> WAL-log the modified part.
>
> Oh, I overlooked it. I wrote code in ginRedoInsertData which finds
> appropriate place fro data but didn't notice that I still wrote whole page
> to xlog record.

Yeah. I think ginRedoInsertData was too cute to be bug-free. I added an
explicit unmodifiedsize field to the WAL record, so that
ginRedoInsertData doesn't need to calculate it.

>> * When new items are appended to the end of the page so that only the last
>> existing compressed segment is re-encoded, and the page has to be split,
>> the items aren't divided 50/50 on the pages. The loop that moves segments
>> destined for the left page to the right won't move any existing, untouched,
>> segments.
>
> I think this loop will move one segment. However, it's too small.

Implemented this.

I noticed that the gin vacuum redo routine is dead code, except for the
data-leaf page handling, because we never remove entries or internal
nodes (page deletion is a separate wal record type). And the data-leaf
case is functionally equivalent to heap newpage records. I removed the
dead code and made it more clear that it resembles heap newpage.

Attached is a yet another version, with more bugs fixed and more
comments added and updated. I would appreciate some heavy-testing of
this patch now. If you could re-run the tests you've been using, that
could be great. I've tested the WAL replay by replicating GIN operations
over streaming replication. That doesn't guarantee it's correct, but
it's a good smoke test.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 20.1.2014 19:30, Heikki Linnakangas wrote:
> 
> Attached is a yet another version, with more bugs fixed and more 
> comments added and updated. I would appreciate some heavy-testing of 
> this patch now. If you could re-run the tests you've been using,
> that could be great. I've tested the WAL replay by replicating GIN
> operations over streaming replication. That doesn't guarantee it's
> correct, but it's a good smoke test.

I gave it a try - the OOM error seems to be gone, but now get this
  PANIC:  cannot insert duplicate items to GIN index page

This only happens when building the index incrementally (i.e. using a
sequence of INSERT statements into a table with GIN index). When I
create a new index on a table (already containing the same dataset) it
works just fine.

Also, I tried to reproduce the issue by running a simple plpgsql loop
(instead of a complex python script):

DO LANGUAGE plpgsql $$
DECLARE   r tsvector;
BEGIN   FOR r IN SELECT body_tsvector FROM data_table LOOP       INSERT INTO idx_table (body_tsvector) VALUES (r);
ENDLOOP;
 
END$$;

where data_table is the table with imported data (the same data I
mentioned in the post about OOM errors), and index_table is an empty
table with a GIN index. And indeed it fails, but only if I run the block
in multiple sessions in parallel.

regards
Tomas



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/21/2014 04:02 AM, Tomas Vondra wrote:
> On 20.1.2014 19:30, Heikki Linnakangas wrote:
>>
>> Attached is a yet another version, with more bugs fixed and more
>> comments added and updated. I would appreciate some heavy-testing of
>> this patch now. If you could re-run the tests you've been using,
>> that could be great. I've tested the WAL replay by replicating GIN
>> operations over streaming replication. That doesn't guarantee it's
>> correct, but it's a good smoke test.
>
> I gave it a try - the OOM error seems to be gone, but now get this
>
>     PANIC:  cannot insert duplicate items to GIN index page
>
> This only happens when building the index incrementally (i.e. using a
> sequence of INSERT statements into a table with GIN index). When I
> create a new index on a table (already containing the same dataset) it
> works just fine.
>
> Also, I tried to reproduce the issue by running a simple plpgsql loop
> (instead of a complex python script):
>
> DO LANGUAGE plpgsql $$
> DECLARE
>      r tsvector;
> BEGIN
>      FOR r IN SELECT body_tsvector FROM data_table LOOP
>          INSERT INTO idx_table (body_tsvector) VALUES (r);
>      END LOOP;
> END$$;
>
> where data_table is the table with imported data (the same data I
> mentioned in the post about OOM errors), and index_table is an empty
> table with a GIN index. And indeed it fails, but only if I run the block
> in multiple sessions in parallel.

Oh, I see what's going on. I had assumed that there cannot be duplicate 
insertions into the posting tree, but that's dead wrong. The fast 
insertion mechanism depends on a duplicate insertion to do nothing.

Will fix, thanks for the testing!

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Mon, Jan 20, 2014 at 10:30 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/17/2014 08:49 PM, Alexander Korotkov wrote:
On Fri, Jan 17, 2014 at 10:38 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 01/17/2014 01:05 PM, Alexander Korotkov wrote:

Seems to be fixed in the attached version of patch.
Another improvement in this version: only left-most segments and further
are re-encoded. Left part of page are left untouched.

I'm looking into this now. A few quick notes:

* Even when you don't re-encode the whole page, you still WAL-log the
whole page. While correct, it'd be a pretty obvious optimization to only
WAL-log the modified part.

Oh, I overlooked it. I wrote code in ginRedoInsertData which finds
appropriate place fro data but didn't notice that I still wrote whole page
to xlog record.

Yeah. I think ginRedoInsertData was too cute to be bug-free. I added an explicit unmodifiedsize field to the WAL record, so that ginRedoInsertData doesn't need to calculate it.


* When new items are appended to the end of the page so that only the last
existing compressed segment is re-encoded, and the page has to be split,
the items aren't divided 50/50 on the pages. The loop that moves segments
destined for the left page to the right won't move any existing, untouched,
segments.

I think this loop will move one segment. However, it's too small.

Implemented this.

I noticed that the gin vacuum redo routine is dead code, except for the data-leaf page handling, because we never remove entries or internal nodes (page deletion is a separate wal record type). And the data-leaf case is functionally equivalent to heap newpage records. I removed the dead code and made it more clear that it resembles heap newpage.

Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test.

I tried my test-suite but it hangs on index scan with infinite loop. I re-tried it on my laptop with -O0. I found it to crash on update and vacuum in some random places like:
Assert(GinPageIsData(page)); in xlogVacuumPage
Assert(ndecoded == totalpacked); in ginCompressPostingList
Trying to debug it.

------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Tue, Jan 21, 2014 at 4:28 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I noticed that the gin vacuum redo routine is dead code, except for the data-leaf page handling, because we never remove entries or internal nodes (page deletion is a separate wal record type). And the data-leaf case is functionally equivalent to heap newpage records. I removed the dead code and made it more clear that it resembles heap newpage.

Attached is a yet another version, with more bugs fixed and more comments added and updated. I would appreciate some heavy-testing of this patch now. If you could re-run the tests you've been using, that could be great. I've tested the WAL replay by replicating GIN operations over streaming replication. That doesn't guarantee it's correct, but it's a good smoke test.

I tried my test-suite but it hangs on index scan with infinite loop. I re-tried it on my laptop with -O0. I found it to crash on update and vacuum in some random places like:
Assert(GinPageIsData(page)); in xlogVacuumPage
Assert(ndecoded == totalpacked); in ginCompressPostingList
Trying to debug it.

Another question is about dataPlaceToPageLeaf:

while ((Pointer) seg < segend)
{
    if (ginCompareItemPointers(&minNewItem, &seg->first) < 0)
        break;

Shouldn't we adjust seg to previous segment? If minNewItem is less than seg->first we should insert it to previous segment.

------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
> On 01/21/2014 04:02 AM, Tomas Vondra wrote:
>> On 20.1.2014 19:30, Heikki Linnakangas wrote:
>>>
>>> Attached is a yet another version, with more bugs fixed and more
>>> comments added and updated. I would appreciate some heavy-testing of
>>> this patch now. If you could re-run the tests you've been using,
>>> that could be great. I've tested the WAL replay by replicating GIN
>>> operations over streaming replication. That doesn't guarantee it's
>>> correct, but it's a good smoke test.
>>
>> I gave it a try - the OOM error seems to be gone, but now get this
>>
>>      PANIC:  cannot insert duplicate items to GIN index page
>>
>> This only happens when building the index incrementally (i.e. using a
>> sequence of INSERT statements into a table with GIN index). When I
>> create a new index on a table (already containing the same dataset) it
>> works just fine.
>>
>> Also, I tried to reproduce the issue by running a simple plpgsql loop
>> (instead of a complex python script):
>>
>> DO LANGUAGE plpgsql $$
>> DECLARE
>>       r tsvector;
>> BEGIN
>>       FOR r IN SELECT body_tsvector FROM data_table LOOP
>>           INSERT INTO idx_table (body_tsvector) VALUES (r);
>>       END LOOP;
>> END$$;
>>
>> where data_table is the table with imported data (the same data I
>> mentioned in the post about OOM errors), and index_table is an empty
>> table with a GIN index. And indeed it fails, but only if I run the block
>> in multiple sessions in parallel.
>
> Oh, I see what's going on. I had assumed that there cannot be duplicate
> insertions into the posting tree, but that's dead wrong. The fast
> insertion mechanism depends on a duplicate insertion to do nothing.

Ok, this turned out to be a much bigger change than I thought...

In principle, it's not difficult to eliminate duplicates. However, to
detect a duplicate, you have to check if the item is present in the
uncompressed items array, or in the compressed lists. That requires
decoding the segment where it should be.

But if we decode the segment, what's the purpose of the uncompressed
items array? Its purpose was to speed up insertions, by buffering them
so that we don't need to decode and re-encode the page/segment on every
inserted item. But if we're paying the price of decoding it anyway, we
might as well include the new item and re-encode the segment. The
uncompressed area saves some effort in WAL-logging, as the record of
inserting an entry into the uncompressed area is much smaller than that
of re-encoding part of the page, but if that really is a concern, we
could track more carefully which parts of the page is modified, and only
WAL-log the required parts. And hopefully, the fast-update lists buffer
inserts so that you insert many items at a time to the posting tree, and
the price of re-encoding is only paid once.

So, now that I think about this once more, I don't think the
uncompressed area in every leaf page is a good idea.

I refactored the way the recompression routine works again. It is now
more clearly a multi-step process. First, the existing page is
"disassembled" into an in-memory struct, which is a linked list of all
the segments. In-memory each segment can be represented as an array of
item pointers, or in the compressed format. In the next phase, all the
new items are added to the segments where they belong. This naturally
can lead to overly large segments; in the third phase, the items are
redistributed among the segments, and compressed back to the compressed
format. Finally, the finished segments are written back to the page, or
pages if it had to be split.

The same subroutines to disassemble and recompress are also used in vacuum.

Attached is what I've got now. This is again quite heavily-changed from
the previous version - sorry for repeatedly rewriting this. I'll
continue testing and re-reviewing this tomorrow.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Tomas Vondra
Дата:
On 21.1.2014 22:21, Heikki Linnakangas wrote:
> On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
>> On 01/21/2014 04:02 AM, Tomas Vondra wrote:
>>> On 20.1.2014 19:30, Heikki Linnakangas wrote:
>>>>
>>>> Attached is a yet another version, with more bugs fixed and more
>>>> comments added and updated. I would appreciate some heavy-testing of
>>>> this patch now. If you could re-run the tests you've been using,
>>>> that could be great. I've tested the WAL replay by replicating GIN
>>>> operations over streaming replication. That doesn't guarantee it's
>>>> correct, but it's a good smoke test.
>>>
>>> I gave it a try - the OOM error seems to be gone, but now get this
>>>
>>>      PANIC:  cannot insert duplicate items to GIN index page
>>>
>>> This only happens when building the index incrementally (i.e. using a
>>> sequence of INSERT statements into a table with GIN index). When I
>>> create a new index on a table (already containing the same dataset) it
>>> works just fine.
>>>
>>> Also, I tried to reproduce the issue by running a simple plpgsql loop
>>> (instead of a complex python script):
>>>
>>> DO LANGUAGE plpgsql $$
>>> DECLARE
>>>       r tsvector;
>>> BEGIN
>>>       FOR r IN SELECT body_tsvector FROM data_table LOOP
>>>           INSERT INTO idx_table (body_tsvector) VALUES (r);
>>>       END LOOP;
>>> END$$;
>>>
>>> where data_table is the table with imported data (the same data I
>>> mentioned in the post about OOM errors), and index_table is an empty
>>> table with a GIN index. And indeed it fails, but only if I run the block
>>> in multiple sessions in parallel.
>>
>> Oh, I see what's going on. I had assumed that there cannot be duplicate
>> insertions into the posting tree, but that's dead wrong. The fast
>> insertion mechanism depends on a duplicate insertion to do nothing.
> 
> Ok, this turned out to be a much bigger change than I thought...
> 
> In principle, it's not difficult to eliminate duplicates. However, to
> detect a duplicate, you have to check if the item is present in the
> uncompressed items array, or in the compressed lists. That requires
> decoding the segment where it should be.
> 
> But if we decode the segment, what's the purpose of the uncompressed
> items array? Its purpose was to speed up insertions, by buffering them
> so that we don't need to decode and re-encode the page/segment on every
> inserted item. But if we're paying the price of decoding it anyway, we
> might as well include the new item and re-encode the segment. The
> uncompressed area saves some effort in WAL-logging, as the record of
> inserting an entry into the uncompressed area is much smaller than that
> of re-encoding part of the page, but if that really is a concern, we
> could track more carefully which parts of the page is modified, and only
> WAL-log the required parts. And hopefully, the fast-update lists buffer
> inserts so that you insert many items at a time to the posting tree, and
> the price of re-encoding is only paid once.
> 
> So, now that I think about this once more, I don't think the
> uncompressed area in every leaf page is a good idea.
> 
> I refactored the way the recompression routine works again. It is now
> more clearly a multi-step process. First, the existing page is
> "disassembled" into an in-memory struct, which is a linked list of all
> the segments. In-memory each segment can be represented as an array of
> item pointers, or in the compressed format. In the next phase, all the
> new items are added to the segments where they belong. This naturally
> can lead to overly large segments; in the third phase, the items are
> redistributed among the segments, and compressed back to the compressed
> format. Finally, the finished segments are written back to the page, or
> pages if it had to be split.
> 
> The same subroutines to disassemble and recompress are also used in vacuum.
> 
> Attached is what I've got now. This is again quite heavily-changed from
> the previous version - sorry for repeatedly rewriting this. I'll
> continue testing and re-reviewing this tomorrow.

I've repeated the tests, and it actually finishes OK with this patch.
The incremental load works OK, does not fail because of OOM or PANIC
errors, or any other issues. Subsequent querying seems to work fine too.

The sizes before / after applying the patch look like this:
 object  | raw size (old) | raw size (new) |   diff
----------|------------------------------------------ table   |           6093 |           6093 |  00.0% index A |
    2288 |           2035 | -11.0% index B |             96 |             96 |  00.0%
 

and after running VACUUM FULL
 object  | vacuumed (old) | vacuumed (new) |   diff
----------|------------------------------------------ table   |           6416 |           6416 |  00.0% index A |
     678 |            415 | -38.8% index B |             42 |             20 | -52.4%
 

There results are slightly different than on my previous runs - for
example in my post from 2013/10/12 I've posted this (after vacuuming)
                  |   HEAD   |  patched |   ============================================   subject index  |   42 MB  |
 15 MB | -75%   body index     |  624 MB  |   328 MB | -47%
 

It was collected with slightly different test code / data, but I believe
the ratios should be about the same. However the differences are rather
significant. Is that because of the recent changes in encoding? Could
that be improved somehow?

What annoys me a bit is the huge size difference between the index
updated incrementally (by a sequence of INSERT commands), and the index
rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
(2288 vs. 2035 MB), but is there a chance to improve this?

Anyway, is it safe to apply the second patch on top of this one?

regards
Tomas



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
On 01/21/2014 04:02 AM, Tomas Vondra wrote:
On 20.1.2014 19:30, Heikki Linnakangas wrote:

Attached is a yet another version, with more bugs fixed and more
comments added and updated. I would appreciate some heavy-testing of
this patch now. If you could re-run the tests you've been using,
that could be great. I've tested the WAL replay by replicating GIN
operations over streaming replication. That doesn't guarantee it's
correct, but it's a good smoke test.

I gave it a try - the OOM error seems to be gone, but now get this

     PANIC:  cannot insert duplicate items to GIN index page

This only happens when building the index incrementally (i.e. using a
sequence of INSERT statements into a table with GIN index). When I
create a new index on a table (already containing the same dataset) it
works just fine.

Also, I tried to reproduce the issue by running a simple plpgsql loop
(instead of a complex python script):

DO LANGUAGE plpgsql $$
DECLARE
      r tsvector;
BEGIN
      FOR r IN SELECT body_tsvector FROM data_table LOOP
          INSERT INTO idx_table (body_tsvector) VALUES (r);
      END LOOP;
END$$;

where data_table is the table with imported data (the same data I
mentioned in the post about OOM errors), and index_table is an empty
table with a GIN index. And indeed it fails, but only if I run the block
in multiple sessions in parallel.

Oh, I see what's going on. I had assumed that there cannot be duplicate
insertions into the posting tree, but that's dead wrong. The fast
insertion mechanism depends on a duplicate insertion to do nothing.

Ok, this turned out to be a much bigger change than I thought...

In principle, it's not difficult to eliminate duplicates. However, to detect a duplicate, you have to check if the item is present in the uncompressed items array, or in the compressed lists. That requires decoding the segment where it should be.

But if we decode the segment, what's the purpose of the uncompressed items array? Its purpose was to speed up insertions, by buffering them so that we don't need to decode and re-encode the page/segment on every inserted item. But if we're paying the price of decoding it anyway, we might as well include the new item and re-encode the segment. The uncompressed area saves some effort in WAL-logging, as the record of inserting an entry into the uncompressed area is much smaller than that of re-encoding part of the page, but if that really is a concern, we could track more carefully which parts of the page is modified, and only WAL-log the required parts. And hopefully, the fast-update lists buffer inserts so that you insert many items at a time to the posting tree, and the price of re-encoding is only paid once.

So, now that I think about this once more, I don't think the uncompressed area in every leaf page is a good idea.
 
I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads?

I refactored the way the recompression routine works again. It is now more clearly a multi-step process. First, the existing page is "disassembled" into an in-memory struct, which is a linked list of all the segments. In-memory each segment can be represented as an array of item pointers, or in the compressed format. In the next phase, all the new items are added to the segments where they belong. This naturally can lead to overly large segments; in the third phase, the items are redistributed among the segments, and compressed back to the compressed format. Finally, the finished segments are written back to the page, or pages if it had to be split.

The same subroutines to disassemble and recompress are also used in vacuum.

Attached is what I've got now. This is again quite heavily-changed from the previous version - sorry for repeatedly rewriting this. I'll continue testing and re-reviewing this tomorrow.

Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right?

------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/22/2014 09:25 AM, Alexander Korotkov wrote:
> On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
>>
>>> Oh, I see what's going on. I had assumed that there cannot be duplicate
>>> insertions into the posting tree, but that's dead wrong. The fast
>>> insertion mechanism depends on a duplicate insertion to do nothing.
>>
>> Ok, this turned out to be a much bigger change than I thought...
>>
>> In principle, it's not difficult to eliminate duplicates. However, to
>> detect a duplicate, you have to check if the item is present in the
>> uncompressed items array, or in the compressed lists. That requires
>> decoding the segment where it should be.
>>
>> But if we decode the segment, what's the purpose of the uncompressed items
>> array? Its purpose was to speed up insertions, by buffering them so that we
>> don't need to decode and re-encode the page/segment on every inserted item.
>> But if we're paying the price of decoding it anyway, we might as well
>> include the new item and re-encode the segment. The uncompressed area saves
>> some effort in WAL-logging, as the record of inserting an entry into the
>> uncompressed area is much smaller than that of re-encoding part of the
>> page, but if that really is a concern, we could track more carefully which
>> parts of the page is modified, and only WAL-log the required parts. And
>> hopefully, the fast-update lists buffer inserts so that you insert many
>> items at a time to the posting tree, and the price of re-encoding is only
>> paid once.
>>
>> So, now that I think about this once more, I don't think the uncompressed
>> area in every leaf page is a good idea.
>
> I didn't get why insertion of duplicate TIDs to uncompressed area and
> eliminate them of re-encoding? It would be somewhat more work during
> updates, more work during scan, more WAL records. But is it really
> significant for real-life workloads?

Hmm, so you would merrily insert duplicate TIDs into the uncompressed 
area, and weed them out when reading or recompressing the page? I had 
not thought of that. Yeah, it might be a good trade-off, duplicates are 
not expected to happen very often.

> I refactored the way the recompression routine works again. It is now more
>> clearly a multi-step process. First, the existing page is "disassembled"
>> into an in-memory struct, which is a linked list of all the segments.
>> In-memory each segment can be represented as an array of item pointers, or
>> in the compressed format. In the next phase, all the new items are added to
>> the segments where they belong. This naturally can lead to overly large
>> segments; in the third phase, the items are redistributed among the
>> segments, and compressed back to the compressed format. Finally, the
>> finished segments are written back to the page, or pages if it had to be
>> split.
>>
>> The same subroutines to disassemble and recompress are also used in vacuum.
>>
>> Attached is what I've got now. This is again quite heavily-changed from
>> the previous version - sorry for repeatedly rewriting this. I'll continue
>> testing and re-reviewing this tomorrow.
>
>
> Let's clarify where we are. We had quite debugged and tested version with
> hard-wired varbyte encoding. Now we're reimplementing and debugging
> segmented version of varbyte encoding in a hope that one day we can easily
> replace it with something better that meets our restrictions (but we didn't
> find it already). Is it right?

The segmentation was a sensible change on code-readability grounds 
alone. Yes, it made it easier to experiment with different encodings, 
and will make it easier to replace the encoding in the future, but that 
wasn't the only reason for doing it. It's nice to have the 
encoding/decoding stuff in ginpostinglists.c, so that the rest of the 
code just passes around opaque GinPostingList structs (previously known 
as PostingListSegment).

One thing I have been pondering, though: Instead of storing the posting 
lists one after each other on the leaf page, it might be better to store 
them as Items on the page, with normal ItemIds pointing to each. So the 
page layout would be more standard, and you could use 
PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. 
The immediate advantage of that would be that it would make it possible 
to do a binary search of the segments, to quickly locate the right 
segment where a tuple belongs to. That might not be significant in 
practice - linearly scanning 32 items is not slow either. And it would 
add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per 
page with the current segment size of 256 bytes. But then again, it 
might be a good idea just because it would make the pages look more like 
any other page, which is generally a good thing.

- Heikki



GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information)

От
Heikki Linnakangas
Дата:
On 01/22/2014 03:39 AM, Tomas Vondra wrote:
> What annoys me a bit is the huge size difference between the index
> updated incrementally (by a sequence of INSERT commands), and the index
> rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
> (2288 vs. 2035 MB), but is there a chance to improve this?

Hmm. What seems to be happening is that pending item list pages that the
fast update mechanism uses are not getting recycled. When enough list
pages are filled up, they are flushed into the main index and the list
pages are marked as deleted. But they are not recorded in the FSM, so
they won't be recycled until the index is vacuumed. Almost all of the
difference can be attributed to deleted pages left behind like that.

So this isn't actually related to the packed postinglists patch at all.
It just makes the bloat more obvious, because it makes the actual size
of the index size, excluding deleted pages, smaller. But it can be
observed on git master as well:

I created a simple test table and index like this:

create table foo (intarr int[]);
create index i_foo on foo using gin(intarr) with (fastupdate=on);

I filled the table like this:

insert into foo select array[-1] from generate_series(1, 10000000) g;

postgres=# \d+i
                    List of relations
  Schema | Name | Type  | Owner  |  Size  | Description
--------+------+-------+--------+--------+-------------
  public | foo  | table | heikki | 575 MB |
(1 row)

postgres=# \di+
                        List of relations
  Schema | Name  | Type  | Owner  | Table |  Size  | Description
--------+-------+-------+--------+-------+--------+-------------
  public | i_foo | index | heikki | foo   | 251 MB |
(1 row)

I wrote a little utility that scans all pages in a gin index, and prints
out the flags indicating what kind of a page it is. The distribution
looks like this:

      19 DATA
    7420 DATA LEAF
   24701 DELETED
       1 LEAF
       1 META

I think we need to add the deleted pages to the FSM more aggressively.

I tried simply adding calls to RecordFreeIndexPage, after the list pages
have been marked as deleted, but unfortunately that didn't help. The
problem is that the FSM is organized into a three-level tree, and
RecordFreeIndexPage only updates the bottom level. The upper levels are
not updated until the FSM is vacuumed, so the pages are still not
visible to GetFreeIndexPage calls until next vacuum. The simplest fix
would be to add a call to IndexFreeSpaceMapVacuum after flushing the
pending list, per attached patch. I'm slightly worried about the
performance impact of the IndexFreeSpaceMapVacuum() call. It scans the
whole FSM of the index, which isn't exactly free. So perhaps we should
teach RecordFreeIndexPage to update the upper levels of the FSM in a
retail-fashion instead.

- Heikki

Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Wed, Jan 22, 2014 at 12:30 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/22/2014 09:25 AM, Alexander Korotkov wrote:
On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas <hlinnakangas@vmware.com
wrote:

On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:

Oh, I see what's going on. I had assumed that there cannot be duplicate
insertions into the posting tree, but that's dead wrong. The fast
insertion mechanism depends on a duplicate insertion to do nothing.

Ok, this turned out to be a much bigger change than I thought...

In principle, it's not difficult to eliminate duplicates. However, to
detect a duplicate, you have to check if the item is present in the
uncompressed items array, or in the compressed lists. That requires
decoding the segment where it should be.

But if we decode the segment, what's the purpose of the uncompressed items
array? Its purpose was to speed up insertions, by buffering them so that we
don't need to decode and re-encode the page/segment on every inserted item.
But if we're paying the price of decoding it anyway, we might as well
include the new item and re-encode the segment. The uncompressed area saves
some effort in WAL-logging, as the record of inserting an entry into the
uncompressed area is much smaller than that of re-encoding part of the
page, but if that really is a concern, we could track more carefully which
parts of the page is modified, and only WAL-log the required parts. And
hopefully, the fast-update lists buffer inserts so that you insert many
items at a time to the posting tree, and the price of re-encoding is only
paid once.

So, now that I think about this once more, I don't think the uncompressed
area in every leaf page is a good idea.

I didn't get why insertion of duplicate TIDs to uncompressed area and
eliminate them of re-encoding? It would be somewhat more work during
updates, more work during scan, more WAL records. But is it really
significant for real-life workloads?

Hmm, so you would merrily insert duplicate TIDs into the uncompressed area, and weed them out when reading or recompressing the page? I had not thought of that. Yeah, it might be a good trade-off, duplicates are not expected to happen very often.


I refactored the way the recompression routine works again. It is now more
clearly a multi-step process. First, the existing page is "disassembled"
into an in-memory struct, which is a linked list of all the segments.
In-memory each segment can be represented as an array of item pointers, or
in the compressed format. In the next phase, all the new items are added to
the segments where they belong. This naturally can lead to overly large
segments; in the third phase, the items are redistributed among the
segments, and compressed back to the compressed format. Finally, the
finished segments are written back to the page, or pages if it had to be
split.

The same subroutines to disassemble and recompress are also used in vacuum.

Attached is what I've got now. This is again quite heavily-changed from
the previous version - sorry for repeatedly rewriting this. I'll continue
testing and re-reviewing this tomorrow.


Let's clarify where we are. We had quite debugged and tested version with
hard-wired varbyte encoding. Now we're reimplementing and debugging
segmented version of varbyte encoding in a hope that one day we can easily
replace it with something better that meets our restrictions (but we didn't
find it already). Is it right?

The segmentation was a sensible change on code-readability grounds alone. Yes, it made it easier to experiment with different encodings, and will make it easier to replace the encoding in the future, but that wasn't the only reason for doing it. It's nice to have the encoding/decoding stuff in ginpostinglists.c, so that the rest of the code just passes around opaque GinPostingList structs (previously known as PostingListSegment).

One thing I have been pondering, though: Instead of storing the posting lists one after each other on the leaf page, it might be better to store them as Items on the page, with normal ItemIds pointing to each. So the page layout would be more standard, and you could use PageAddItem/PageIndexMultiDelete to add/remove posting lists to page. The immediate advantage of that would be that it would make it possible to do a binary search of the segments, to quickly locate the right segment where a tuple belongs to. That might not be significant in practice - linearly scanning 32 items is not slow either. And it would add some overhead, one line pointer per segment, 4 * 32 = 128 bytes per page with the current segment size of 256 bytes. But then again, it might be a good idea just because it would make the pages look more like any other page, which is generally a good thing.

We already spent a lot of time with compression. Now we need to figure out the result we want see. I spent quite long time debugging varbyte encoding without segments. Finally, it passed very many tests without any problems. Now, it is just piece of junk. I'm afraid that we will have to reimplement everything from scratch another two or three times because code doesn't look perfect. For sure, it's incompatible with getting something into 9.4. Remember we have also subsequent fast-scan which is very needed for hstore and other application.
I propose to do final decisions now and concentrate forces on making committable patch with these decisions. And don't change these decisions even if somebody have idea how to improve code readability in 100 times and potential extendability in 1000 times.
I propose following decisions:
1) Leave uncompressed area, allow duplicates in it
2) Don't introduce Items on page.

------
With best regards,
Alexander Korotkov.  
Heikki Linnakangas wrote:

> I wrote a little utility that scans all pages in a gin index, and
> prints out the flags indicating what kind of a page it is. The
> distribution looks like this:
> 
>      19 DATA
>    7420 DATA LEAF
>   24701 DELETED
>       1 LEAF
>       1 META

Hah.

> I think we need to add the deleted pages to the FSM more aggressively.
>
> I tried simply adding calls to RecordFreeIndexPage, after the list
> pages have been marked as deleted, but unfortunately that didn't
> help. The problem is that the FSM is organized into a three-level
> tree, and RecordFreeIndexPage only updates the bottom level.

Interesting.  I think the idea of having an option for RecordFreeIndexPage
to update upper levels makes sense (no need to force it for other
users.)

Some time ago I proposed an index-only cleanup for vacuum.  That would
help GIN get this kind of treatment (vacuuming its FSM and processing
the pending list) separately from vacuuming the index.  It's probably
too late for 9.4 though.

One other thing worth considering in this area is that making the
pending list size depend on work_mem appears to have been a really bad
idea.  I know one case where the server is really large and seems to run
mostly OLAP type stuff with occasional updates, so they globally set
work_mem=2GB; they have GIN indexes for text search, and the result is
horrible performance 90% of the time, then a vacuum cleans the pending
list and it is blazing fast until the pending list starts getting big
again.  Now you can argue that setting work_mem to that value is a bad
idea, but as it turns out, in this case other than the GIN pending list
it seems to work fine.

Not related to the patch at hand, but I thought I would out it for
consideration, 'cause I'm not gonna start a new thread about it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/22/2014 02:17 PM, Alexander Korotkov wrote:
> We already spent a lot of time with compression. Now we need to figure out
> the result we want see. I spent quite long time debugging varbyte encoding
> without segments. Finally, it passed very many tests without any problems.
> Now, it is just piece of junk. I'm afraid that we will have to reimplement
> everything from scratch another two or three times because code doesn't
> look perfect. For sure, it's incompatible with getting something into 9.4.

That's a bit harsh :-). But yes, I understand what you're saying. It's 
quite common for large patches like this to be rewritten several times 
before being committed; you just don't know what works best until you've 
tried many designs.

> Remember we have also subsequent fast-scan which is very needed for hstore
> and other application.
> I propose to do final decisions now and concentrate forces on making
> committable patch with these decisions. And don't change these decisions
> even if somebody have idea how to improve code readability in 100 times and
> potential extendability in 1000 times.
> I propose following decisions:
> 1) Leave uncompressed area, allow duplicates in it
> 2) Don't introduce Items on page.

Well, I got the insertions to work now without the uncompressed area, so 
in the spirit of Just Getting This Damn Patch Committed, I'm going to go 
ahead with that. We can add the uncompressed area later if performance 
testing shows it to be necessary. And I agree about the Items on page 
idea - we can come back to that too still in 9.4 timeframe if necessary, 
but probably not.

So, committed. It's the same design as in the most recent patch I 
posted, but with a bunch of bugs fixed, and cleaned up all over. I'm 
going to move to the fast scan patch now, but please do test and review 
the committed version to see if I missed something.

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/22/2014 02:17 PM, Alexander Korotkov wrote:
We already spent a lot of time with compression. Now we need to figure out
the result we want see. I spent quite long time debugging varbyte encoding
without segments. Finally, it passed very many tests without any problems.
Now, it is just piece of junk. I'm afraid that we will have to reimplement
everything from scratch another two or three times because code doesn't
look perfect. For sure, it's incompatible with getting something into 9.4.

That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs.


Remember we have also subsequent fast-scan which is very needed for hstore
and other application.
I propose to do final decisions now and concentrate forces on making
committable patch with these decisions. And don't change these decisions
even if somebody have idea how to improve code readability in 100 times and
potential extendability in 1000 times.
I propose following decisions:
1) Leave uncompressed area, allow duplicates in it
2) Don't introduce Items on page.

Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not.

So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something.

Great! Thanks a lot!
Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain GinDataLeafMaxContentSize bytes. Patch is attached.
My test-suite don't run correctly. I'm debugging now.

------
With best regards,
Alexander Korotkov.  
Вложения

Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Thu, Jan 23, 2014 at 9:27 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Jan 22, 2014 at 9:28 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/22/2014 02:17 PM, Alexander Korotkov wrote:
We already spent a lot of time with compression. Now we need to figure out
the result we want see. I spent quite long time debugging varbyte encoding
without segments. Finally, it passed very many tests without any problems.
Now, it is just piece of junk. I'm afraid that we will have to reimplement
everything from scratch another two or three times because code doesn't
look perfect. For sure, it's incompatible with getting something into 9.4.

That's a bit harsh :-). But yes, I understand what you're saying. It's quite common for large patches like this to be rewritten several times before being committed; you just don't know what works best until you've tried many designs.


Remember we have also subsequent fast-scan which is very needed for hstore
and other application.
I propose to do final decisions now and concentrate forces on making
committable patch with these decisions. And don't change these decisions
even if somebody have idea how to improve code readability in 100 times and
potential extendability in 1000 times.
I propose following decisions:
1) Leave uncompressed area, allow duplicates in it
2) Don't introduce Items on page.

Well, I got the insertions to work now without the uncompressed area, so in the spirit of Just Getting This Damn Patch Committed, I'm going to go ahead with that. We can add the uncompressed area later if performance testing shows it to be necessary. And I agree about the Items on page idea - we can come back to that too still in 9.4 timeframe if necessary, but probably not.

So, committed. It's the same design as in the most recent patch I posted, but with a bunch of bugs fixed, and cleaned up all over. I'm going to move to the fast scan patch now, but please do test and review the committed version to see if I missed something.

Great! Thanks a lot!
Assertion in dataPlaceToPageLeafRecompress is too strong. Page can contain GinDataLeafMaxContentSize bytes. Patch is attached.
My test-suite don't run correctly. I'm debugging now.

ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some segments. Others are not even re-palloced. They are moved left in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with memcpy, proper solution is memmove. Using memcpy platform dependently can lead to page corruption. Another solution is to re-palloc segments in ginVacuumPostingTreeLeaf.

------
With best regards,
Alexander Korotkov.  
Вложения

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/24/2014 10:03 AM, Alexander Korotkov wrote:
> ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some
> segments. Others are not even re-palloced. They are moved left
> in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with
> memcpy, proper solution is memmove. Using memcpy platform dependently can
> lead to page corruption. Another solution is to re-palloc segments in
> ginVacuumPostingTreeLeaf.

Good catch. Thanks, committed, changing memcpy to memmove. Will have to 
switch to pallocing everything in the future, if we make leafRepackItems 
smarter, so that it doesn't rewrite all the segments after the first 
modified one.

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Jan 24, 2014 at 12:50 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/24/2014 10:03 AM, Alexander Korotkov wrote:
ITSM I found this bug. ginVacuumPostingTreeLeaf re-encodes only some
segments. Others are not even re-palloced. They are moved left
in dataPlaceToPageLeafRecompress by memcpy. But it's incorrect to with
memcpy, proper solution is memmove. Using memcpy platform dependently can
lead to page corruption. Another solution is to re-palloc segments in
ginVacuumPostingTreeLeaf.

Good catch. Thanks, committed, changing memcpy to memmove. Will have to switch to pallocing everything in the future, if we make leafRepackItems smarter, so that it doesn't rewrite all the segments after the first modified one.

OK. What about previous fix in assert? 

------
With best regards,
Alexander Korotkov.  

Re: GIN improvements part 1: additional information

От
Heikki Linnakangas
Дата:
On 01/24/2014 10:53 AM, Alexander Korotkov wrote:
> OK. What about previous fix in assert?

Ah right, fixed that too now.

- Heikki



Re: GIN improvements part 1: additional information

От
Alexander Korotkov
Дата:
On Fri, Jan 24, 2014 at 1:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 01/24/2014 10:53 AM, Alexander Korotkov wrote:
OK. What about previous fix in assert?

Ah right, fixed that too now.

Good, now my test-suite passed. Results are so.

Time of operations
         event         |     period      
-----------------------+-----------------
 index_build           | 00:01:45.709546
 index_build_recovery  | 00:00:09
 index_update          | 00:05:45.732214
 index_update_recovery | 00:01:17
 search_new            | 00:24:02.191049
 search_updated        | 00:26:54.122852
(6 rows)

Index sizes
     label     |   size    
---------------+-----------
 new           | 387547136
 after_updates | 610877440
(2 rows) 

Before compression results were following.

Time of operations
         event         |     period      
-----------------------+-----------------
 index_build           | 00:01:51.116722
 index_build_recovery  | 00:00:08
 index_update          | 00:05:12.124758
 index_update_recovery | 00:01:43
 search_new            | 00:23:44.281424
 search_updated        | 00:25:41.96251
(6 rows)

Index sizes
     label     |    size    
---------------+------------
 new           |  884514816
 after_updates | 1595252736
(2 rows)

So, we can see little regression. However, I'm not sure if it's very significant.

------
With best regards,
Alexander Korotkov.  
On Wed, Jan 22, 2014 at 9:12 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/22/2014 03:39 AM, Tomas Vondra wrote:
>>
>> What annoys me a bit is the huge size difference between the index
>> updated incrementally (by a sequence of INSERT commands), and the index
>> rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
>> (2288 vs. 2035 MB), but is there a chance to improve this?
>
>
> Hmm. What seems to be happening is that pending item list pages that the
> fast update mechanism uses are not getting recycled. When enough list pages
> are filled up, they are flushed into the main index and the list pages are
> marked as deleted. But they are not recorded in the FSM, so they won't be
> recycled until the index is vacuumed. Almost all of the difference can be
> attributed to deleted pages left behind like that.
>
> So this isn't actually related to the packed postinglists patch at all. It
> just makes the bloat more obvious, because it makes the actual size of the
> index size, excluding deleted pages, smaller. But it can be observed on git
> master as well:
>
> I created a simple test table and index like this:
>
> create table foo (intarr int[]);
> create index i_foo on foo using gin(intarr) with (fastupdate=on);
>
> I filled the table like this:
>
> insert into foo select array[-1] from generate_series(1, 10000000) g;
>
> postgres=# \d+i
>                    List of relations
>  Schema | Name | Type  | Owner  |  Size  | Description
> --------+------+-------+--------+--------+-------------
>  public | foo  | table | heikki | 575 MB |
> (1 row)
>
> postgres=# \di+
>                        List of relations
>  Schema | Name  | Type  | Owner  | Table |  Size  | Description
> --------+-------+-------+--------+-------+--------+-------------
>  public | i_foo | index | heikki | foo   | 251 MB |
> (1 row)
>
> I wrote a little utility that scans all pages in a gin index, and prints out
> the flags indicating what kind of a page it is. The distribution looks like
> this:
>
>      19 DATA
>    7420 DATA LEAF
>   24701 DELETED
>       1 LEAF
>       1 META
>
> I think we need to add the deleted pages to the FSM more aggressively.
>
> I tried simply adding calls to RecordFreeIndexPage, after the list pages
> have been marked as deleted, but unfortunately that didn't help. The problem
> is that the FSM is organized into a three-level tree, and
> RecordFreeIndexPage only updates the bottom level. The upper levels are not
> updated until the FSM is vacuumed, so the pages are still not visible to
> GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a
> call to IndexFreeSpaceMapVacuum after flushing the pending list, per
> attached patch. I'm slightly worried about the performance impact of the
> IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which
> isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update
> the upper levels of the FSM in a retail-fashion instead.
>

I wonder if you pursued this further?

You recently added a number of TODO items related to GIN index; is it
worth adding this to the list?

--
Amit



On Thu, Jun 19, 2014 at 2:09 PM, Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Jan 22, 2014 at 9:12 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>
>> I think we need to add the deleted pages to the FSM more aggressively.
>>
>> I tried simply adding calls to RecordFreeIndexPage, after the list pages
>> have been marked as deleted, but unfortunately that didn't help. The problem
>> is that the FSM is organized into a three-level tree, and
>> RecordFreeIndexPage only updates the bottom level. The upper levels are not
>> updated until the FSM is vacuumed, so the pages are still not visible to
>> GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a
>> call to IndexFreeSpaceMapVacuum after flushing the pending list, per
>> attached patch. I'm slightly worried about the performance impact of the
>> IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which
>> isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update
>> the upper levels of the FSM in a retail-fashion instead.
>>
>
> I wonder if you pursued this further?
>
> You recently added a number of TODO items related to GIN index; is it
> worth adding this to the list?
>

In fact, I forgot to mention that I tried your patch and it seems to
be working for the particular example I am working with using pg_bigm
(bigram) (indexed data not so realistic maybe).

postgres=# CREATE TABLE test(contents text);
CREATE TABLE

postgres=# create or replace function rnd_str(length integer) returns text as
$$
declare chars text[] :=

'{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,
あ, い, う, え, お, か, き, く, け, こ, さ, し, す, せ, そ, た, ち, つ, て, と, な, に, ぬ,
ね, の, は, ひ, ふ, へ, ほ, ま, み, む, め, も, や, ゆ, よ, ら, り, る, れ, ろ, わ, ゐ, ゑ,
を}'; result text := ''; i integer := 0; arr_len integer;
begin chars := array_append(chars, ' '); arr_len := array_length(chars, 1); if length < 0 then   raise exception 'Given
lengthcannot be less than 0'; end if; for i in 1..length loop   result := result || chars[1+random()*(arr_len - 1)];
endloop; return result; 
end;
$$ language plpgsql;
CREATE FUNCTION


-- HEAD

postgres=# INSERT INTO test SELECT rnd_str((random() * 500)::int) FROM
generate_series(1, 100000);
INSERT 0 100000

Time: 76296.641 ms

postgres=# SELECT pg_size_pretty(pg_table_size('test'));pg_size_pretty
----------------49 MB
(1 row)

postgres=# CREATE INDEX test_bigm_idx ON test USING GIN (contents gin_bigm_ops);
CREATE INDEX
Time: 50517.912 ms

postgres=# SELECT pg_size_pretty(pg_relation_size('test_bigm_idx'));pg_size_pretty
----------------118 MB
(1 row)

postgres=# TRUNCATE test;
TRUNCATE TABLE

postgres=# INSERT INTO test SELECT rnd_str((random() * 500)::int) FROM
generate_series(1, 100000);
INSERT 0 100000
Time: 233369.366 ms

postgres=# SELECT pg_size_pretty(pg_relation_size('test_bigm_idx'));pg_size_pretty
----------------747 MB
(1 row)


-- Whereas, patched (gin-recycle-deleted-list-pages-1.patch) HEAD

postgres=# INSERT INTO test SELECT rnd_str((random() * 500)::int) FROM
generate_series(1, 100000);
INSERT 0 100000
Time: 32808.779 ms

postgres=# CREATE INDEX test_bigm_idx ON test USING GIN (contents gin_bigm_ops);
CREATE INDEX
Time: 24490.945 ms

postgres=# SELECT pg_size_pretty(pg_relation_size('test_bigm_idx'));pg_size_pretty
----------------118 MB
(1 row)

postgres=# TRUNCATE test;
TRUNCATE TABLE

postgres=# INSERT INTO test SELECT rnd_str((random() * 500)::int) FROM
generate_series(1, 100000);
INSERT 0 100000
Time: 153878.163 ms

postgres=# SELECT pg_size_pretty(pg_relation_size('test_bigm_idx'));pg_size_pretty
----------------119 MB
(1 row)

That sure looks good.

--
Amit