Обсуждение: [PATCH] XLogReadRecord returns pointer to currently read page
Hi, hackers! I propose the patch for fix one small code defect. The XLogReadRecord() function reads the pages of a WAL segment that contain a WAL-record. Then it creates a readRecordBuf buffer in private memory of a backend and copy record from the pages to the readRecordBuf buffer. Pointer 'record' is set to the beginning of the readRecordBuf buffer. But if the WAL-record is fully placed on one page, the XLogReadRecord() function forgets to bind the "record" pointer with the beginning of the readRecordBuf buffer. In this case, XLogReadRecord() returns a pointer to an internal xlog page. This patch fixes the defect. Previously, in all cases of using WAL this was not a problem. However if you plan to perform some decoding operations before returning the WAL record to the caller (this is my case), this can lead to bugs that are difficult to catch. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Вложения
On Fri, Aug 17, 2018 at 08:47:15AM +0500, Andrey Lepikhov wrote: > Previously, in all cases of using WAL this was not a problem. However if you > plan to perform some decoding operations before returning the WAL record to > the caller (this is my case), this can lead to bugs that are difficult to > catch. What kind of cases with logical decoding are you referring to? If any of them can be reproduced with upstream, could you send a reproducer, or a way to see it? -- Michael
Вложения
17.08.2018 08:55, Michael Paquier пишет: > On Fri, Aug 17, 2018 at 08:47:15AM +0500, Andrey Lepikhov wrote: >> Previously, in all cases of using WAL this was not a problem. However if you >> plan to perform some decoding operations before returning the WAL record to >> the caller (this is my case), this can lead to bugs that are difficult to >> catch. > > What kind of cases with logical decoding are you referring to? If > any of them can be reproduced with upstream, could you send a > reproducer, or a way to see it? > -- > Michael > I'm working on the problem of a WAL-record size reducing. One of the PostgreSQL experimental extensions uses a 64-bit xid, and the XLogRecord size is increased to 32 bytes. Representing a 64-bit xid in WAL Segment as a pair (16-bit base, 8-bit offset) reduces the size of the WAL record header and its body. So, I encode a WAL-record in XLogInsertRecord() (decreasing the size) just before writing to a shared memory pages and performs the decoding just before processing DecodeXLogRecord() operation. It looks like a layer in the xlog subsystem for additional WAL compression. However, this is a long story ... -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
On 17/08/2018 06:47, Andrey Lepikhov wrote: > I propose the patch for fix one small code defect. > The XLogReadRecord() function reads the pages of a WAL segment that > contain a WAL-record. Then it creates a readRecordBuf buffer in private > memory of a backend and copy record from the pages to the readRecordBuf > buffer. Pointer 'record' is set to the beginning of the readRecordBuf > buffer. > > But if the WAL-record is fully placed on one page, the XLogReadRecord() > function forgets to bind the "record" pointer with the beginning of the > readRecordBuf buffer. In this case, XLogReadRecord() returns a pointer > to an internal xlog page. This patch fixes the defect. Hmm. I agree it looks weird the way it is. But I wonder, why do we even copy the record to readRecordBuf, if it fits on the WAL page? Returning a pointer to the xlog page buffer seems OK in that case. What if we remove the memcpy(), instead? - Heikki
On 22.10.2018 2:06, Heikki Linnakangas wrote: > On 17/08/2018 06:47, Andrey Lepikhov wrote: >> I propose the patch for fix one small code defect. >> The XLogReadRecord() function reads the pages of a WAL segment that >> contain a WAL-record. Then it creates a readRecordBuf buffer in private >> memory of a backend and copy record from the pages to the readRecordBuf >> buffer. Pointer 'record' is set to the beginning of the readRecordBuf >> buffer. >> >> But if the WAL-record is fully placed on one page, the XLogReadRecord() >> function forgets to bind the "record" pointer with the beginning of the >> readRecordBuf buffer. In this case, XLogReadRecord() returns a pointer >> to an internal xlog page. This patch fixes the defect. > > Hmm. I agree it looks weird the way it is. But I wonder, why do we even > copy the record to readRecordBuf, if it fits on the WAL page? Returning > a pointer to the xlog page buffer seems OK in that case. What if we > remove the memcpy(), instead? It depends on the PostgreSQL roadmap. If we want to compress main data and encode something to reduce a WAL-record size, than the representation of the WAL-record in shared buffers (packed) and in local memory (unpacked) will be different and the patch is needed. If something like this should not be in the PostgreSQL core, then it is more efficient to remove memcpy(), of course. I vote for the patch. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
On 22/10/2018 20:54, Andrey Lepikhov wrote: > > > On 22.10.2018 2:06, Heikki Linnakangas wrote: >> On 17/08/2018 06:47, Andrey Lepikhov wrote: >>> I propose the patch for fix one small code defect. >>> The XLogReadRecord() function reads the pages of a WAL segment that >>> contain a WAL-record. Then it creates a readRecordBuf buffer in private >>> memory of a backend and copy record from the pages to the readRecordBuf >>> buffer. Pointer 'record' is set to the beginning of the readRecordBuf >>> buffer. >>> >>> But if the WAL-record is fully placed on one page, the XLogReadRecord() >>> function forgets to bind the "record" pointer with the beginning of the >>> readRecordBuf buffer. In this case, XLogReadRecord() returns a pointer >>> to an internal xlog page. This patch fixes the defect. >> >> Hmm. I agree it looks weird the way it is. But I wonder, why do we even >> copy the record to readRecordBuf, if it fits on the WAL page? Returning >> a pointer to the xlog page buffer seems OK in that case. What if we >> remove the memcpy(), instead? > > It depends on the PostgreSQL roadmap. If we want to compress main data > and encode something to reduce a WAL-record size, than the > representation of the WAL-record in shared buffers (packed) and in local > memory (unpacked) will be different and the patch is needed. I'd expect the decompression to read from the on-disk buffer, and unpack to readRecordBuf, I still don't see a need to copy the packed record to readRecordBuf. If there is a need for that, though, the patch that implements the packing or compression can add the memcpy() where it needs it. - Heikki
On 23.10.2018 0:53, Heikki Linnakangas wrote: > I'd expect the decompression to read from the on-disk buffer, and unpack > to readRecordBuf, I still don't see a need to copy the packed record to > readRecordBuf. If there is a need for that, though, the patch that > implements the packing or compression can add the memcpy() where it > needs it. I agree with it. Eventually, placement of the WAL-record can be defined by comparison the record, readBuf and readRecordBuf pointers. In attachment new version of the patch. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Вложения
Hello. At Tue, 23 Oct 2018 10:25:27 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in <2553f2b0-0e39-eb0e-d382-6c0ed08ca8ae@postgrespro.ru> > > On 23.10.2018 0:53, Heikki Linnakangas wrote: > > I'd expect the decompression to read from the on-disk buffer, and > > unpack to readRecordBuf, I still don't see a need to copy the packed > > record to readRecordBuf. If there is a need for that, though, the > > patch that implements the packing or compression can add the memcpy() > > where it needs it. > > I agree with it. Eventually, placement of the WAL-record can be > defined by comparison the record, readBuf and readRecordBuf pointers. > In attachment new version of the patch. This looks quite clear and what it does is reasonable to me. Applies cleanly on top of current master and no regression seen. Just one comment. This patch leaves the following code. > /* Record does not cross a page boundary */ > if (!ValidXLogRecord(state, record, RecPtr)) > goto err; > > state->EndRecPtr = RecPtr + MAXALIGN(total_len); !> > state->ReadRecPtr = RecPtr; > } The empty line (marked by '!') looks a bit too much standing out after this patch. Could you remove the line? Then could you transpose the two assignments if you don't mind? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 26.10.2018 10:33, Kyotaro HORIGUCHI wrote: > Hello. > > At Tue, 23 Oct 2018 10:25:27 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in <2553f2b0-0e39-eb0e-d382-6c0ed08ca8ae@postgrespro.ru> >> >> On 23.10.2018 0:53, Heikki Linnakangas wrote: >>> I'd expect the decompression to read from the on-disk buffer, and >>> unpack to readRecordBuf, I still don't see a need to copy the packed >>> record to readRecordBuf. If there is a need for that, though, the >>> patch that implements the packing or compression can add the memcpy() >>> where it needs it. >> >> I agree with it. Eventually, placement of the WAL-record can be >> defined by comparison the record, readBuf and readRecordBuf pointers. >> In attachment new version of the patch. > > This looks quite clear and what it does is reasonable to me. > Applies cleanly on top of current master and no regression seen. > > > Just one comment. This patch leaves the following code. > > > /* Record does not cross a page boundary */ > > if (!ValidXLogRecord(state, record, RecPtr)) > > goto err; > > > > state->EndRecPtr = RecPtr + MAXALIGN(total_len); > !> > > state->ReadRecPtr = RecPtr; > > } > > The empty line (marked by '!') looks a bit too much standing out > after this patch. Could you remove the line? Then could you > transpose the two assignments if you don't mind? Thanks, see attachment. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Вложения
At Fri, 26 Oct 2018 11:43:14 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in <ea3dca6a-b898-e7a8-dcfa-b3ad227a6349@postgrespro.ru> > > > On 26.10.2018 10:33, Kyotaro HORIGUCHI wrote: > > Hello. > > At Tue, 23 Oct 2018 10:25:27 +0500, Andrey Lepikhov > > <a.lepikhov@postgrespro.ru> wrote in > > <2553f2b0-0e39-eb0e-d382-6c0ed08ca8ae@postgrespro.ru> > >> > >> On 23.10.2018 0:53, Heikki Linnakangas wrote: > >>> I'd expect the decompression to read from the on-disk buffer, and > >>> unpack to readRecordBuf, I still don't see a need to copy the packed > >>> record to readRecordBuf. If there is a need for that, though, the > >>> patch that implements the packing or compression can add the memcpy() > >>> where it needs it. > >> > >> I agree with it. Eventually, placement of the WAL-record can be > >> defined by comparison the record, readBuf and readRecordBuf pointers. > >> In attachment new version of the patch. > > This looks quite clear and what it does is reasonable to me. > > Applies cleanly on top of current master and no regression seen. > > Just one comment. This patch leaves the following code. > > > /* Record does not cross a page boundary */ > > > if (!ValidXLogRecord(state, record, RecPtr)) > > > goto err; > > > > > > state->EndRecPtr = RecPtr + MAXALIGN(total_len); > > !> > > > state->ReadRecPtr = RecPtr; > > > } > > The empty line (marked by '!') looks a bit too much standing out > > after this patch. Could you remove the line? Then could you > > transpose the two assignments if you don't mind? > > Thanks, see attachment. Thanks. This patch eliminates unnecessary copying that was done for non-continued records. Now the return value of XLogReadRecord directly points into page buffer holded in XLogReaderStats. It is safe because no caller site uses the returned pointer beyond the replacement of buffer content at the next call to the same function. The code looks fine and correct. This applied on HEAD@master cleanly. Passed all regression tests. Passed all tests under src/test/recovery I marked this "Ready for Committer". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Nov 15, 2018 at 06:12:38PM +0900, Kyotaro HORIGUCHI wrote: > This patch eliminates unnecessary copying that was done for > non-continued records. Now the return value of XLogReadRecord > directly points into page buffer holded in XLogReaderStats. It is > safe because no caller site uses the returned pointer beyond the > replacement of buffer content at the next call to the same > function. I was looking at this patch, and shouldn't we worry about compatibility with plugins or utilities which look directly at what's in readRecordBuf for the record contents? Let's not forget that the contents of XLogReaderState are public. -- Michael
Вложения
On Fri, Nov 16, 2018 at 03:23:55PM +0900, Michael Paquier wrote: > I was looking at this patch, and shouldn't we worry about compatibility > with plugins or utilities which look directly at what's in readRecordBuf > for the record contents? Let's not forget that the contents of > XLogReaderState are public. And a couple of days later, committed. I did not notice first that the comments of xlogreader.h mention that a couple of areas are private, and readRecordBuf is part of that, so objection withdrawn. There is a comment in xlog.c (on top of ReadRecord) referring to readRecordBuf which has not been removed as part of 7fcbf6a4 when WAL reading has been moved to its own facility. The original comment was from 0ffe11a. So I have removed the comment at the same time. -- Michael
Вложения
Thank you for taking this. At Mon, 19 Nov 2018 10:27:29 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20181119012728.GA1578@paquier.xyz> > On Fri, Nov 16, 2018 at 03:23:55PM +0900, Michael Paquier wrote: > > I was looking at this patch, and shouldn't we worry about compatibility > > with plugins or utilities which look directly at what's in readRecordBuf > > for the record contents? Let's not forget that the contents of > > XLogReaderState are public. > > And a couple of days later, committed. I did not notice first that the > comments of xlogreader.h mention that a couple of areas are private, and > readRecordBuf is part of that, so objection withdrawn. It wasn't changed the way the function replaces the content of the buffer. (Regardless of whether it is private or not, I believe no one tries to write directly there outside the function.) Also the memory pointed by the retuned pointer from the previous code was regarded as read-only. The only point to notice was the lifetime of the returned pointer, I suppose. > There is a comment in xlog.c (on top of ReadRecord) referring to > readRecordBuf which has not been removed as part of 7fcbf6a4 when WAL > reading has been moved to its own facility. The original comment was > from 0ffe11a. So I have removed the comment at the same time. - * The record is copied into readRecordBuf, so that on successful return, - * the returned record pointer always points there. Yeah, it is incorrect in some sense, but the comment was suggesting the lifetime of the pointed record. So I'd like to have a comment like this instead. + * The record is copied into xlogreader, so that on successful return, + * the returned record pointer always points there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Nov 19, 2018 at 12:28:06PM +0900, Kyotaro HORIGUCHI wrote: > Yeah, it is incorrect in some sense, but the comment was > suggesting the lifetime of the pointed record. So I'd like to > have a comment like this instead. I think that we can still live without as it is not the business of this routine to be careful how the lifetime of a record read is handled, but that's part of the internals of XLogReader. -- Michael
Вложения
On 16.11.2018 11:23, Michael Paquier wrote: > On Thu, Nov 15, 2018 at 06:12:38PM +0900, Kyotaro HORIGUCHI wrote: >> This patch eliminates unnecessary copying that was done for >> non-continued records. Now the return value of XLogReadRecord >> directly points into page buffer holded in XLogReaderStats. It is >> safe because no caller site uses the returned pointer beyond the >> replacement of buffer content at the next call to the same >> function. > > I was looking at this patch, and shouldn't we worry about compatibility > with plugins or utilities which look directly at what's in readRecordBuf > for the record contents? Let's not forget that the contents of > XLogReaderState are public. According to my experience, I clarify some comments to avoid this mistakes in the future (see attachment). -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Вложения
On Mon, Nov 19, 2018 at 10:48:06AM +0500, Andrey Lepikhov wrote: > According to my experience, I clarify some comments to avoid this mistakes > in the future (see attachment). No objections from here. > - * The returned pointer (or *errormsg) points to an internal buffer that's > - * valid until the next call to XLogReadRecord. > + * The returned pointer (or *errormsg) points to an internal read-only buffer > + * that's valid until the next call to XLogReadRecord. Not sure that this bit adds much. > - /* Buffer for current ReadRecord result (expandable) */ > + /* > + * Buffer for current ReadRecord result (expandable). > + * Used in the case, than current ReadRecord result don't fit on the > + * currently read WAL page. > + */ Yeah, this one is not entirely true now. What about the following instead: - /* Buffer for current ReadRecord result (expandable) */ + /* + * Buffer for current ReadRecord result (expandable), used when a record + * crosses a page boundary. + */ -- Michael
Вложения
On 20.11.2018 6:30, Michael Paquier wrote: > On Mon, Nov 19, 2018 at 10:48:06AM +0500, Andrey Lepikhov wrote: >> According to my experience, I clarify some comments to avoid this mistakes >> in the future (see attachment). > > No objections from here. > >> - * The returned pointer (or *errormsg) points to an internal buffer that's >> - * valid until the next call to XLogReadRecord. >> + * The returned pointer (or *errormsg) points to an internal read-only buffer >> + * that's valid until the next call to XLogReadRecord. > > Not sure that this bit adds much. Ok > >> - /* Buffer for current ReadRecord result (expandable) */ >> + /* >> + * Buffer for current ReadRecord result (expandable). >> + * Used in the case, than current ReadRecord result don't fit on the >> + * currently read WAL page. >> + */ > > Yeah, this one is not entirely true now. What about the following > instead: > - /* Buffer for current ReadRecord result (expandable) */ > + /* > + * Buffer for current ReadRecord result (expandable), used when a record > + * crosses a page boundary. > + */ I agree -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
On Tue, Nov 20, 2018 at 06:37:36AM +0500, Andrey Lepikhov wrote: > On 20.11.2018 6:30, Michael Paquier wrote: >> Yeah, this one is not entirely true now. What about the following >> instead: >> - /* Buffer for current ReadRecord result (expandable) */ >> + /* >> + * Buffer for current ReadRecord result (expandable), used when a record >> + * crosses a page boundary. >> + */ > > I agree Okay, I have committed this version, after checking that there were no other spots. -- Michael