Обсуждение: HEAD seems to generate larger WAL regarding GIN index
Hi, I executed the following statements in HEAD and 9.3, and compared the size of WAL which were generated by data insertion in GIN index. --------------------- CREATE EXTENSION pg_trgm; CREATE TABLE hoge (col1 text); CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH (FASTUPDATE = off); CHECKPOINT; SELECT pg_switch_xlog(); SELECT pg_switch_xlog(); SELECT pg_current_xlog_location(); INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000); SELECT pg_current_xlog_location(); --------------------- The results of WAL size are 960 MB (9.3) 2113 MB (HEAD) The WAL size in HEAD was more than two times bigger than that in 9.3. Recently the source code of GIN index has been changed dramatically. Is the increase in GIN-related WAL intentional or a bug? Regards, -- Fujii Masao
On 03/15/2014 08:40 PM, Fujii Masao wrote: > Hi, > > I executed the following statements in HEAD and 9.3, and compared > the size of WAL which were generated by data insertion in GIN index. > > --------------------- > CREATE EXTENSION pg_trgm; > CREATE TABLE hoge (col1 text); > CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH > (FASTUPDATE = off); > > CHECKPOINT; > SELECT pg_switch_xlog(); > SELECT pg_switch_xlog(); > > SELECT pg_current_xlog_location(); > INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000); > SELECT pg_current_xlog_location(); > --------------------- > > The results of WAL size are > > 960 MB (9.3) > 2113 MB (HEAD) > > The WAL size in HEAD was more than two times bigger than that in 9.3. > Recently the source code of GIN index has been changed dramatically. > Is the increase in GIN-related WAL intentional or a bug? It was somewhat expected. Updating individual items on the new-format GIN pages requires decompressing and recompressing the page, and the recompressed posting lists need to be WAL-logged. Which generates much larger WAL records. That said, I didn't expect the difference to be quite that big when you're appending to the end of the table. When the new entries go to the end of the posting lists, you only need to recompress and WAL-log the last posting list, which is max 256 bytes long. But I guess that's still a lot more WAL than in the old format. That could be optimized, but I figured we can live with it, thanks to the fastupdate feature. Fastupdate allows amortizing that cost over several insertions. But of course, you explicitly disabled that... - Heikki
On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
------
With best regards,
Alexander Korotkov.
On 03/15/2014 08:40 PM, Fujii Masao wrote:It was somewhat expected. Updating individual items on the new-format GIN pages requires decompressing and recompressing the page, and the recompressed posting lists need to be WAL-logged. Which generates much larger WAL records.Hi,
I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.
---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);
CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();
SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------
The results of WAL size are
960 MB (9.3)
2113 MB (HEAD)
The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?
That said, I didn't expect the difference to be quite that big when you're appending to the end of the table. When the new entries go to the end of the posting lists, you only need to recompress and WAL-log the last posting list, which is max 256 bytes long. But I guess that's still a lot more WAL than in the old format.
That could be optimized, but I figured we can live with it, thanks to the fastupdate feature. Fastupdate allows amortizing that cost over several insertions. But of course, you explicitly disabled that...
Let me know if you want me to write patch addressing this issue.
------
With best regards,
Alexander Korotkov.
On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> >> On 03/15/2014 08:40 PM, Fujii Masao wrote: >>> >>> Hi, >>> >>> I executed the following statements in HEAD and 9.3, and compared >>> the size of WAL which were generated by data insertion in GIN index. >>> >>> --------------------- >>> CREATE EXTENSION pg_trgm; >>> CREATE TABLE hoge (col1 text); >>> CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH >>> (FASTUPDATE = off); >>> >>> CHECKPOINT; >>> SELECT pg_switch_xlog(); >>> SELECT pg_switch_xlog(); >>> >>> SELECT pg_current_xlog_location(); >>> INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000); >>> SELECT pg_current_xlog_location(); >>> --------------------- >>> >>> The results of WAL size are >>> >>> 960 MB (9.3) >>> 2113 MB (HEAD) >>> >>> The WAL size in HEAD was more than two times bigger than that in 9.3. >>> Recently the source code of GIN index has been changed dramatically. >>> Is the increase in GIN-related WAL intentional or a bug? >> >> >> It was somewhat expected. Updating individual items on the new-format GIN >> pages requires decompressing and recompressing the page, and the >> recompressed posting lists need to be WAL-logged. Which generates much >> larger WAL records. >> >> That said, I didn't expect the difference to be quite that big when you're >> appending to the end of the table. When the new entries go to the end of the >> posting lists, you only need to recompress and WAL-log the last posting >> list, which is max 256 bytes long. But I guess that's still a lot more WAL >> than in the old format. I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL, and then found its max seems more than 256B. Am I missing something? What I observed is [In HEAD] At first, the size of GIN-related WAL is gradually increasing up to about 1400B. rmgr: Gin len (rec/tot): 48/ 80, tx: 1813, lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F rmgr: Gin len (rec/tot): 56/ 88, tx: 1813, lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T rmgr: Gin len (rec/tot): 64/ 96, tx: 1813, lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T ... rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813, lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813, lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting tree, node: 1663/12945/16441 blkno: 4 Then the size decreases to about 100B and is gradually increasing again up to 320B. rmgr: Gin len (rec/tot): 116/ 148, tx: 1813, lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: 1372 (compressed) rmgr: Gin len (rec/tot): 40/ 72, tx: 1813, lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T ... rmgr: Gin len (rec/tot): 118/ 150, tx: 1813, lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: 1374 (compressed) ... rmgr: Gin len (rec/tot): 288/ 320, tx: 1813, lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node: 1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280 length: 1544 (compressed) Then the size decreases to 66B and is gradually increasing again up to 320B. This increase and decrease of WAL size seems to continue. [In 9.3] At first, the size of GIN-related WAL is gradually increasing up to about 2700B. rmgr: Gin len (rec/tot): 52/ 84, tx: 1812, lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node: 1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T isdelete F updateBlkno:4294967295 rmgr: Gin len (rec/tot): 60/ 92, tx: 1812, lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node: 1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T isdelete T updateBlkno:4294967295 ... rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812, lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node: 1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T isdelete T updateBlkno:4294967295 rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812, lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting tree, node: 1663/12896/16441 blkno: 6 The size decreases to 66B and then is never changed. rmgr: Gin len (rec/tot): 34/ 66, tx: 1812, lsn: 0/026D9F00, prev 0/026D9EB8, bkp: 0000, desc: Insert item, node: 1663/12896/16441 blkno: 6 offset: 451 nitem: 1 isdata: T isleaf T isdelete F updateBlkno:4294967295 rmgr: Gin len (rec/tot): 34/ 66, tx: 1812, lsn: 0/026D9F48, prev 0/026D9F00, bkp: 0000, desc: Insert item, node: 1663/12896/16441 blkno: 7 offset: 451 nitem: 1 isdata: T isleaf T isdelete F updateBlkno:4294967295 rmgr: Gin len (rec/tot): 34/ 66, tx: 1812, lsn: 0/026D9F90, prev 0/026D9F48, bkp: 0000, desc: Insert item, node: 1663/12896/16441 blkno: 8 offset: 451 nitem: 1 isdata: T isleaf T isdelete F updateBlkno:4294967295 ... This difference in GIN-related WAL size seems to cause HEAD to generate more than two times bigger WAL. Unfortunately the gap of WAL size would be continuously increasing :( >> >> That could be optimized, but I figured we can live with it, thanks to the >> fastupdate feature. Fastupdate allows amortizing that cost over several >> insertions. But of course, you explicitly disabled that... > > > Let me know if you want me to write patch addressing this issue. Yeah, I really want you to address this problem! That's definitely useful for every users disabling FASTUPDATE option for some reasons. Regards, -- Fujii Masao
Fujii Masao escribió: > On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov > <aekorotkov@gmail.com> wrote: > >> That could be optimized, but I figured we can live with it, thanks to the > >> fastupdate feature. Fastupdate allows amortizing that cost over several > >> insertions. But of course, you explicitly disabled that... > > > > Let me know if you want me to write patch addressing this issue. > > Yeah, I really want you to address this problem! That's definitely useful > for every users disabling FASTUPDATE option for some reasons. Users that disable FASTUPDATE, in my experience, do so because their stock work_mem is way too high and GIN searches become too slow due to having to scan too large a list. I think it might make sense to invest a modest amount of time in getting FASTUPDATE to be sized completely differently from today -- perhaps base it on a hardcoded factor of BLCKSZ, rather than work_mem. Or, if we really need to make it configurable, then let it have its own parameter. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 03/17/2014 03:20 PM, Fujii Masao wrote: > On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov > <aekorotkov@gmail.com> wrote: >> On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: > I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL, > and then found its max seems more than 256B. Am I missing something? > > What I observed is > > [In HEAD] > At first, the size of GIN-related WAL is gradually increasing up to about 1400B. > rmgr: Gin len (rec/tot): 48/ 80, tx: 1813, > lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F > rmgr: Gin len (rec/tot): 56/ 88, tx: 1813, > lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T > rmgr: Gin len (rec/tot): 64/ 96, tx: 1813, > lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T > ... > rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813, > lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T > rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813, > lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting > tree, node: 1663/12945/16441 blkno: 4 This corresponds to the stage where the items are stored in-line in the entry-tree. After it reaches a certain size, a posting tree is created. > Then the size decreases to about 100B and is gradually increasing > again up to 320B. > > rmgr: Gin len (rec/tot): 116/ 148, tx: 1813, > lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: > 1372 (compressed) > rmgr: Gin len (rec/tot): 40/ 72, tx: 1813, > lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T > ... > rmgr: Gin len (rec/tot): 118/ 150, tx: 1813, > lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length: > 1374 (compressed) > ... > rmgr: Gin len (rec/tot): 288/ 320, tx: 1813, > lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node: > 1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280 > length: 1544 (compressed) > > Then the size decreases to 66B and is gradually increasing again up to 320B. > This increase and decrease of WAL size seems to continue. Here the new items are appended to posting tree pages. This is where the maximum of 256 bytes I mentioned applies. 256 bytes is the max size of one compressed posting list, the WAL record containing it includes some other stuff too, which adds up to that 320 bytes. > [In 9.3] > At first, the size of GIN-related WAL is gradually increasing up to about 2700B. > > rmgr: Gin len (rec/tot): 52/ 84, tx: 1812, > lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node: > 1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T > isdelete F updateBlkno:4294967295 > rmgr: Gin len (rec/tot): 60/ 92, tx: 1812, > lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node: > 1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T > isdelete T updateBlkno:4294967295 > ... > rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812, > lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node: > 1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T > isdelete T updateBlkno:4294967295 > rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812, > lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting > tree, node: 1663/12896/16441 blkno: 6 > > The size decreases to 66B and then is never changed. Same mechanism on 9.3, but the insertions to the posting tree pages are constant size. >>> That could be optimized, but I figured we can live with it, thanks to the >>> fastupdate feature. Fastupdate allows amortizing that cost over several >>> insertions. But of course, you explicitly disabled that... >> >> Let me know if you want me to write patch addressing this issue. > > Yeah, I really want you to address this problem! That's definitely useful > for every users disabling FASTUPDATE option for some reasons. Ok, let's think about it a little bit. I think there are three fairly simple ways to address this: 1. The GIN data leaf "recompress" record contains an offset called "unmodifiedlength", and the data that comes after that offset. Currently, the record is written so that unmodifiedlength points to the end of the last compressed posting list stored on the page that was not modified, followed by all the modified ones. The straightforward way to cut down the WAL record size would be to be more fine-grained than that, and for the posting lists that were modified, only store the difference between the old and new version. To make this approach work well for random insertions, not just appending to the end, we would also need to make the logic in leafRepackItems a bit smarter so that it would not re-encode all the posting lists, after the first modified one. 2. Instead of storing the new compressed posting list in the WAL record, store only the new item pointers added to the page. WAL replay would then have to duplicate the work done in the main insertion code path: find the right posting lists to insert to, decode them, add the new items, and re-encode. The upside of that would be that the WAL format would be very compact. It would be quite simple to implement - you just need to call the same functions we use in the main insertion codepath to insert the new items. It could be more expensive, CPU-wise, to replay the records, however. This record format would be higher-level, in the sense that we would not store the physical copy of the compressed posting list as it was formed originally. The same work would be done at WAL replay. As the code stands, it will produce exactly the same result, but that's not guaranteed if we make bugfixes to the code later, and a master and standby are running different minor version. There's not necessarily anything wrong with that, but it's something to keep in mind. 3. Just reduce the GinPostingListSegmentMaxSize constant from 256, to say 128. That would halve the typical size of a WAL record that appends to the end. However, it would not help with insertions in the middle of a posting list, only appends to the end, and it would bloat the pages somewhat, as you would waste more space on the posting list headers. I'm leaning towards option 2. Alexander, what do you think? - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > 2. Instead of storing the new compressed posting list in the WAL record, > store only the new item pointers added to the page. WAL replay would > then have to duplicate the work done in the main insertion code path: > find the right posting lists to insert to, decode them, add the new > items, and re-encode. That sounds fairly dangerous ... is any user-defined code involved in those decisions? > This record format would be higher-level, in the sense that we would not > store the physical copy of the compressed posting list as it was formed > originally. The same work would be done at WAL replay. As the code > stands, it will produce exactly the same result, but that's not > guaranteed if we make bugfixes to the code later, and a master and > standby are running different minor version. There's not necessarily > anything wrong with that, but it's something to keep in mind. Version skew would be a hazard too, all right. I think it's important that WAL replay be a pretty mechanical, predictable process. regards, tom lane
On 03/17/2014 04:33 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> 2. Instead of storing the new compressed posting list in the WAL record, >> store only the new item pointers added to the page. WAL replay would >> then have to duplicate the work done in the main insertion code path: >> find the right posting lists to insert to, decode them, add the new >> items, and re-encode. > > That sounds fairly dangerous ... is any user-defined code involved in > those decisions? No. >> This record format would be higher-level, in the sense that we would not >> store the physical copy of the compressed posting list as it was formed >> originally. The same work would be done at WAL replay. As the code >> stands, it will produce exactly the same result, but that's not >> guaranteed if we make bugfixes to the code later, and a master and >> standby are running different minor version. There's not necessarily >> anything wrong with that, but it's something to keep in mind. > > Version skew would be a hazard too, all right. I think it's important > that WAL replay be a pretty mechanical, predictable process. Yeah. One particular point to note is that if in one place we do the more "high level" thing and have WAL replay re-encode the page as it sees fit, then we can *not* rely on the page being byte-by-byte identical in other places. Like, in vacuum, where items are deleted. Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the page, instead of making a physical copy of the modified parts. And _bt_restore_page even inserts the items physically in different order than the normal codepath does. So for good or bad, there is some precedence for this. The imminent danger I see is if we change the logic on how the items are divided into posting lists, and end up in a situation where a master server adds an item to a page, and it just fits, but with the compression logic the standby version has, it cannot make it fit. As an escape hatch for that, we could have the WAL replay code try the compression again, with a larger max. posting list size, if it doesn't fit at first. And/or always leave something like 10 bytes of free space on every data page to make up for small differences in the logic. - Heikki
On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the > page, instead of making a physical copy of the modified parts. And > _bt_restore_page even inserts the items physically in different order than > the normal codepath does. So for good or bad, there is some precedence for > this. Yikes. > The imminent danger I see is if we change the logic on how the items are > divided into posting lists, and end up in a situation where a master server > adds an item to a page, and it just fits, but with the compression logic the > standby version has, it cannot make it fit. As an escape hatch for that, we > could have the WAL replay code try the compression again, with a larger max. > posting list size, if it doesn't fit at first. And/or always leave something > like 10 bytes of free space on every data page to make up for small > differences in the logic. That scares the crap out of me. I don't see any intrinsic problem with relying on the existence page contents to figure out how to roll forward, as PageAddItem does; after all, we do FPIs precisely so that the page is in a known good state when we start. However, I really think we ought to try hard to make this deterministic in terms of what the resulting state of the page is; anything else seems like it's playing with fire, and I bet we'll get burned sooner rather than later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the >> page, instead of making a physical copy of the modified parts. And >> _bt_restore_page even inserts the items physically in different order than >> the normal codepath does. So for good or bad, there is some precedence for >> this. > Yikes. Yeah. I think it's arguably a bug that _bt_restore_page works like that, even though we've not been burnt up to now. >> The imminent danger I see is if we change the logic on how the items are >> divided into posting lists, and end up in a situation where a master server >> adds an item to a page, and it just fits, but with the compression logic the >> standby version has, it cannot make it fit. As an escape hatch for that, we >> could have the WAL replay code try the compression again, with a larger max. >> posting list size, if it doesn't fit at first. And/or always leave something >> like 10 bytes of free space on every data page to make up for small >> differences in the logic. > That scares the crap out of me. Likewise. Saving some WAL space is *not* worth this kind of risk. regards, tom lane
On 03/17/2014 05:35 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> The imminent danger I see is if we change the logic on how the items are >>> divided into posting lists, and end up in a situation where a master server >>> adds an item to a page, and it just fits, but with the compression logic the >>> standby version has, it cannot make it fit. As an escape hatch for that, we >>> could have the WAL replay code try the compression again, with a larger max. >>> posting list size, if it doesn't fit at first. And/or always leave something >>> like 10 bytes of free space on every data page to make up for small >>> differences in the logic. > >> That scares the crap out of me. > > Likewise. Saving some WAL space is *not* worth this kind of risk. One fairly good compromise would be to only include the new items, not the whole modified compression lists, and let the replay logic do the re-encoding of the posting lists. But also include the cutoff points of each posting list in the WAL record. That way the replay code would have no freedom in how it decides to split the items into compressed lists, that would be fully specified by the WAL record. Here's a refresher for those who didn't follow the development of the new page format: The data page basically contains a list of ItemPointers. The items are compressed, to save disk space. However, to make random access faster, all the items on the page are not compressed as one big list. Instead, the big array of items is split into roughly equal chunks, and each chunk is compressed separately. The chunks are stored on the page one after each other. (The chunks are called "posting lists" in the code, the struct is called GinPostingListData) The compression is completely deterministic (each item is stored as a varbyte-encoded delta from the previous item), but there are no hard rules on how the items on the page ought to be divided into the posting lists. Currently, the code tries to maintain a max size of 256 bytes per list - but it will cope with any size it finds on disk. This is where the danger lies, where we could end up with a different physical page after WAL replay, if we just include the new items in the WAL record. The WAL replay might decide to split the items into posting lists differently than was originally done. (as the code stands, it would always make the same decision, completely deterministically, but that might change in a minor version if we're not careful) We can tie WAL replay's hands about that, if we include a list of items that form the posting lists in the WAL record. That adds some bloat, compared to only including the new items, but not too much. (and we still only need do that for posting lists following the first modified one.) Alexander, would you like to give that a shot, or will I? - Heikki
On Mon, Mar 17, 2014 at 10:44 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Fujii Masao escribió: >> On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov >> <aekorotkov@gmail.com> wrote: > >> >> That could be optimized, but I figured we can live with it, thanks to the >> >> fastupdate feature. Fastupdate allows amortizing that cost over several >> >> insertions. But of course, you explicitly disabled that... >> > >> > Let me know if you want me to write patch addressing this issue. >> >> Yeah, I really want you to address this problem! That's definitely useful >> for every users disabling FASTUPDATE option for some reasons. > > Users that disable FASTUPDATE, in my experience, do so because their > stock work_mem is way too high and GIN searches become too slow due to > having to scan too large a list. Yes. Another reason that I've heard from users so far is that the size of GIN index with FASTUPDATE=off is likely to be smaller than that with FASTUPDATE=on. > I think it might make sense to invest > a modest amount of time in getting FASTUPDATE to be sized completely > differently from today -- perhaps base it on a hardcoded factor of > BLCKSZ, rather than work_mem. Or, if we really need to make it > configurable, then let it have its own parameter. I prefer to have the parameter. When users create multiple GIN indexes for various uses, they might want to use different thresholds of the pending list for each index. So, GIN index parameter might be better than GUC one. Regards, -- Fujii Masao
On 15/03/14 20:27, Heikki Linnakangas wrote: > That said, I didn't expect the difference to be quite that big when > you're appending to the end of the table. When the new entries go to > the end of the posting lists, you only need to recompress and WAL-log > the last posting list, which is max 256 bytes long. But I guess that's > still a lot more WAL than in the old format. > > That could be optimized, but I figured we can live with it, thanks to > the fastupdate feature. Fastupdate allows amortizing that cost over > several insertions. But of course, you explicitly disabled that... In a concurrent update environment, fastupdate as it is in 9.2 is not really useful. It may be that you can bulk up insertion, but you have no control over who ends up paying the debt. Doubling the amount of wal from gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB wal / day, keeping it for some weeks to be able to do PITR. The wal are mainly due to gin-index updates as new data is added and needs to be searchable by users. We do run gzip that cuts it down to 25-30% before keeping the for too long, but doubling this is going to be a migration challenge. If fast-update could be made to work in an environment where we both have users searching the index and manually updating it and 4+ backend processes updating the index concurrently then it would be a good benefit to gain. the gin index currently contains 70+ million records with and average tsvector of 124 terms. -- Jesper .. trying to add some real-world info. > - Heikki > >
I came up with the attached patch, to reduce the WAL volume of GIN insertions. It become fairly large, but I guess that's not too surprising as the old WAL-logging method was basically to dump the whole page to WAL record. This is now a lot more fine-grained and smarter. I separated constructing the WAL record from copying the changes back to the disk page, which IMHO is a readability improvement even though it's more code. There are two parts to this patch: * leafRepackItems has been rewritten. The previous coding basically searched for the first modified item, and decoded and re-encoded everything on the page that after that. Now it tries harder to avoid re-encoding segments that are still reasonably sized (between 128 and 384 bytes, with the target for new segments being 256 bytes). This ought to make random updates faster as a bonus, but I didn't performance test that. * Track more carefully which segments on the page have been modified. The in-memory structure used to manipulate a page now keeps an action code for each segment, indicating if the segment is completely new, deleted, or replaced with new content, or if just some new items have been added to it. These same actions are WAL-logged, and replayed in the redo routine. This brings the WAL volume back to the same ballpark as 9.3. Or better, depending on the operation. Fujii, Alexander, how does this look to you? - Heikki
Вложения
On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote: > On 15/03/14 20:27, Heikki Linnakangas wrote: >> That said, I didn't expect the difference to be quite that big when you're >> appending to the end of the table. When the new entries go to the end of the >> posting lists, you only need to recompress and WAL-log the last posting >> list, which is max 256 bytes long. But I guess that's still a lot more WAL >> than in the old format. >> >> That could be optimized, but I figured we can live with it, thanks to the >> fastupdate feature. Fastupdate allows amortizing that cost over several >> insertions. But of course, you explicitly disabled that... > > In a concurrent update environment, fastupdate as it is in 9.2 is not really > useful. It may be that you can bulk up insertion, but you have no control > over who ends up paying the debt. Doubling the amount of wal from > gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB > wal / day, keeping it > for some weeks to be able to do PITR. The wal are mainly due to gin-index > updates as new data is added and needs to be searchable by users. We do run > gzip that cuts it down to 25-30% before keeping the for too long, but > doubling this is going to be a migration challenge. > > If fast-update could be made to work in an environment where we both have > users searching the index and manually updating it and 4+ backend processes > updating the index concurrently then it would be a good benefit to gain. > > the gin index currently contains 70+ million records with and average > tsvector of 124 terms. Should we try to install some hack around fastupdate for 9.4? I fear the divergence between reasonable values of work_mem and reasonable sizes for that list is only going to continue to get bigger. I'm sure there's somebody out there who has work_mem = 16GB, and stuff like 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the appeal of large values. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote: >> On 15/03/14 20:27, Heikki Linnakangas wrote: >>> That said, I didn't expect the difference to be quite that big when you're >>> appending to the end of the table. When the new entries go to the end of the >>> posting lists, you only need to recompress and WAL-log the last posting >>> list, which is max 256 bytes long. But I guess that's still a lot more WAL >>> than in the old format. >>> >>> That could be optimized, but I figured we can live with it, thanks to the >>> fastupdate feature. Fastupdate allows amortizing that cost over several >>> insertions. But of course, you explicitly disabled that... >> >> In a concurrent update environment, fastupdate as it is in 9.2 is not really >> useful. It may be that you can bulk up insertion, but you have no control >> over who ends up paying the debt. Doubling the amount of wal from >> gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB >> wal / day, keeping it >> for some weeks to be able to do PITR. The wal are mainly due to gin-index >> updates as new data is added and needs to be searchable by users. We do run >> gzip that cuts it down to 25-30% before keeping the for too long, but >> doubling this is going to be a migration challenge. >> >> If fast-update could be made to work in an environment where we both have >> users searching the index and manually updating it and 4+ backend processes >> updating the index concurrently then it would be a good benefit to gain. >> >> the gin index currently contains 70+ million records with and average >> tsvector of 124 terms. > > Should we try to install some hack around fastupdate for 9.4? I fear > the divergence between reasonable values of work_mem and reasonable > sizes for that list is only going to continue to get bigger. I'm sure > there's somebody out there who has work_mem = 16GB, and stuff like > 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the > appeal of large values. Controlling the threshold of the size of pending list only by GUC doesn't seem reasonable. Users may want to increase the threshold only for the GIN index which can be updated heavily, and decrease it otherwise. So I think that it's better to add new storage parameter for GIN index to control the threshold, or both storage parameter and GUC. Regards, -- Fujii Masao
Fujii Masao <masao.fujii@gmail.com> writes: > On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Should we try to install some hack around fastupdate for 9.4? I fear >> the divergence between reasonable values of work_mem and reasonable >> sizes for that list is only going to continue to get bigger. I'm sure >> there's somebody out there who has work_mem = 16GB, and stuff like >> 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the >> appeal of large values. > Controlling the threshold of the size of pending list only by GUC doesn't > seem reasonable. Users may want to increase the threshold only for the > GIN index which can be updated heavily, and decrease it otherwise. So > I think that it's better to add new storage parameter for GIN index to control > the threshold, or both storage parameter and GUC. Yeah, -1 for a GUC. A GIN-index-specific storage parameter seems more appropriate. Or we could just hard-wire some maximum limit. Is it really likely that users would trouble to set such a parameter if it existed? regards, tom lane