Обсуждение: Re: [HACKERS] data on devel code perf dip
Mark Wong wrote: > On Thu, 11 Aug 2005 22:11:42 -0400 (EDT) > Bruce Momjian <pgman@candle.pha.pa.us> wrote: > > > Tom Lane wrote: > > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so > > > >> anyway), so shared_buffers should be irrelevant. > > > > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume > > > > that is not used for writing dirty buffers during a checkpoint. > > > > > > I double-checked that O_DIRECT is really just used for WAL, and only > > > when the sync mode is open_sync or open_datasync. So it seems > > > impossible that it affected a run with mode fdatasync. What seems the > > > best theory at the moment is that the grouped-WAL-write part of the > > > patch doesn't work so well as we thought. > > > > Yes, that's my only guess. Let us know if you want the patch to test, > > rather than pulling CVS before and after the patch was applied. > > Yeah, a patch would be a little easier. :) OK, patch attached. The code has been cleaned up a little since then but this is the basic change that should be tested. It is based on CVS of 2005/07/29 03:22:33 GMT. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.210 retrieving revision 1.211 diff -c -r1.210 -r1.211 *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210 --- src/backend/access/transam/xlog.c 29 Jul 2005 03:22:33 -0000 1.211 *************** *** 48,77 **** /* * This chunk of hackery attempts to determine which file sync methods * are available on the current platform, and to choose an appropriate * default method. We assume that fsync() is always available, and that * configure determined whether fdatasync() is. */ #if defined(O_SYNC) ! #define OPEN_SYNC_FLAG O_SYNC #else #if defined(O_FSYNC) ! #define OPEN_SYNC_FLAG O_FSYNC #endif #endif #if defined(O_DSYNC) #if defined(OPEN_SYNC_FLAG) ! #if O_DSYNC != OPEN_SYNC_FLAG ! #define OPEN_DATASYNC_FLAG O_DSYNC #endif #else /* !defined(OPEN_SYNC_FLAG) */ /* Win32 only has O_DSYNC */ ! #define OPEN_DATASYNC_FLAG O_DSYNC #endif #endif #if defined(OPEN_DATASYNC_FLAG) #define DEFAULT_SYNC_METHOD_STR "open_datasync" #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN --- 48,117 ---- /* + * Becauase O_DIRECT bypasses the kernel buffers, and because we never + * read those buffers except during crash recovery, it is a win to use + * it in all cases where we sync on each write(). We could allow O_DIRECT + * with fsync(), but because skipping the kernel buffer forces writes out + * quickly, it seems best just to use it for O_SYNC. It is hard to imagine + * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT. + */ + #ifdef O_DIRECT + #define PG_O_DIRECT O_DIRECT + #else + #define PG_O_DIRECT 0 + #endif + + /* * This chunk of hackery attempts to determine which file sync methods * are available on the current platform, and to choose an appropriate * default method. We assume that fsync() is always available, and that * configure determined whether fdatasync() is. */ #if defined(O_SYNC) ! #define CMP_OPEN_SYNC_FLAG O_SYNC #else #if defined(O_FSYNC) ! #define CMP_OPEN_SYNC_FLAG O_FSYNC #endif #endif + #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT) #if defined(O_DSYNC) #if defined(OPEN_SYNC_FLAG) ! #if O_DSYNC != CMP_OPEN_SYNC_FLAG ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT) #endif #else /* !defined(OPEN_SYNC_FLAG) */ /* Win32 only has O_DSYNC */ ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT) #endif #endif + /* + * Limitation of buffer-alignment for direct io depend on OS and filesystem, + * but BLCKSZ is assumed to be enough for it. + */ + #ifdef O_DIRECT + #define ALIGNOF_XLOG_BUFFER BLCKSZ + #else + #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF + #endif + + /* + * Switch the alignment routine because ShmemAlloc() returns a max-aligned + * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF. + */ + #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF + #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN)) + #else + #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER)) + #endif + /* assume sizeof(ptrdiff_t) == sizeof(void*) */ + #define POINTERALIGN(ALIGNVAL,PTR) \ + ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1)))) + #define XLOG_BUFFER_POINTERALIGN(PTR) \ + POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR)) + #if defined(OPEN_DATASYNC_FLAG) #define DEFAULT_SYNC_METHOD_STR "open_datasync" #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN *************** *** 469,474 **** --- 509,525 ---- static char *str_time(time_t tnow); static void issue_xlog_fsync(void); + /* XLog gather-write staffs */ + typedef struct XLogPages + { + char *head; /* Head of first page */ + int size; /* Total bytes of pages == count(pages) * BLCKSZ */ + int offset; /* Offset in xlog segment file */ + } XLogPages; + static void XLogPageReset(XLogPages *pages); + static void XLogPageWrite(XLogPages *pages, int index); + static void XLogPageFlush(XLogPages *pages, int index); + #ifdef WAL_DEBUG static void xlog_outrec(char *buf, XLogRecord *record); #endif *************** *** 1245,1253 **** XLogWrite(XLogwrtRqst WriteRqst) { XLogCtlWrite *Write = &XLogCtl->Write; - char *from; bool ispartialpage; bool use_existent; /* We should always be inside a critical section here */ Assert(CritSectionCount > 0); --- 1296,1305 ---- XLogWrite(XLogwrtRqst WriteRqst) { XLogCtlWrite *Write = &XLogCtl->Write; bool ispartialpage; bool use_existent; + int currentIndex = Write->curridx; + XLogPages pages; /* We should always be inside a critical section here */ Assert(CritSectionCount > 0); *************** *** 1258,1263 **** --- 1310,1317 ---- */ LogwrtResult = Write->LogwrtResult; + XLogPageReset(&pages); + while (XLByteLT(LogwrtResult.Write, WriteRqst.Write)) { /* *************** *** 1266,1279 **** * end of the last page that's been initialized by * AdvanceXLInsertBuffer. */ ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx])) elog(PANIC, "xlog write request %X/%X is past end of log %X/%X", LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff, ! XLogCtl->xlblocks[Write->curridx].xlogid, ! XLogCtl->xlblocks[Write->curridx].xrecoff); /* Advance LogwrtResult.Write to end of current buffer page */ ! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx]; ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write); if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg)) --- 1320,1333 ---- * end of the last page that's been initialized by * AdvanceXLInsertBuffer. */ ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex])) elog(PANIC, "xlog write request %X/%X is past end of log %X/%X", LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff, ! XLogCtl->xlblocks[currentIndex].xlogid, ! XLogCtl->xlblocks[currentIndex].xrecoff); /* Advance LogwrtResult.Write to end of current buffer page */ ! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex]; ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write); if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg)) *************** *** 1281,1286 **** --- 1335,1341 ---- /* * Switch to new logfile segment. */ + XLogPageFlush(&pages, currentIndex); if (openLogFile >= 0) { if (close(openLogFile)) *************** *** 1354,1384 **** openLogOff = 0; } ! /* Need to seek in the file? */ ! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize) ! { ! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize; ! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0) ! ereport(PANIC, ! (errcode_for_file_access(), ! errmsg("could not seek in log file %u, segment %u to offset %u: %m", ! openLogId, openLogSeg, openLogOff))); ! } ! ! /* OK to write the page */ ! from = XLogCtl->pages + Write->curridx * BLCKSZ; ! errno = 0; ! if (write(openLogFile, from, BLCKSZ) != BLCKSZ) ! { ! /* if write didn't set errno, assume problem is no disk space */ ! if (errno == 0) ! errno = ENOSPC; ! ereport(PANIC, ! (errcode_for_file_access(), ! errmsg("could not write to log file %u, segment %u at offset %u: %m", ! openLogId, openLogSeg, openLogOff))); ! } ! openLogOff += BLCKSZ; /* * If we just wrote the whole last page of a logfile segment, --- 1409,1416 ---- openLogOff = 0; } ! /* Add a page to buffer */ ! XLogPageWrite(&pages, currentIndex); /* * If we just wrote the whole last page of a logfile segment, *************** *** 1390,1397 **** * This is also the right place to notify the Archiver that the * segment is ready to copy to archival storage. */ ! if (openLogOff >= XLogSegSize && !ispartialpage) { issue_xlog_fsync(); LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */ --- 1422,1430 ---- * This is also the right place to notify the Archiver that the * segment is ready to copy to archival storage. */ ! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage) { + XLogPageFlush(&pages, currentIndex); issue_xlog_fsync(); LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */ *************** *** 1405,1412 **** LogwrtResult.Write = WriteRqst.Write; break; } ! Write->curridx = NextBufIdx(Write->curridx); } /* * If asked to flush, do so --- 1438,1446 ---- LogwrtResult.Write = WriteRqst.Write; break; } ! currentIndex = NextBufIdx(currentIndex); } + XLogPageFlush(&pages, currentIndex); /* * If asked to flush, do so *************** *** 3584,3590 **** if (XLOGbuffers < MinXLOGbuffers) XLOGbuffers = MinXLOGbuffers; ! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers) + BLCKSZ * XLOGbuffers + MAXALIGN(sizeof(ControlFileData)); } --- 3618,3624 ---- if (XLOGbuffers < MinXLOGbuffers) XLOGbuffers = MinXLOGbuffers; ! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers) + BLCKSZ * XLOGbuffers + MAXALIGN(sizeof(ControlFileData)); } *************** *** 3601,3607 **** XLogCtl = (XLogCtlData *) ShmemInitStruct("XLOG Ctl", ! MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers) + BLCKSZ * XLOGbuffers, &foundXLog); --- 3635,3641 ---- XLogCtl = (XLogCtlData *) ShmemInitStruct("XLOG Ctl", ! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers) + BLCKSZ * XLOGbuffers, &foundXLog); *************** *** 3630,3638 **** * Here, on the other hand, we must MAXALIGN to ensure the page * buffers have worst-case alignment. */ ! XLogCtl->pages = ! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) + ! sizeof(XLogRecPtr) * XLOGbuffers); memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers); /* --- 3664,3672 ---- * Here, on the other hand, we must MAXALIGN to ensure the page * buffers have worst-case alignment. */ ! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN( ! ((char *) XLogCtl) ! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers); memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers); /* *************** *** 3690,3699 **** /* First timeline ID is always 1 */ ThisTimeLineID = 1; ! /* Use malloc() to ensure buffer is MAXALIGNED */ ! buffer = (char *) malloc(BLCKSZ); ! page = (XLogPageHeader) buffer; ! memset(buffer, 0, BLCKSZ); /* Set up information for the initial checkpoint record */ checkPoint.redo.xlogid = 0; --- 3724,3732 ---- /* First timeline ID is always 1 */ ThisTimeLineID = 1; ! buffer = (char *) malloc(BLCKSZ + ALIGNOF_XLOG_BUFFER); ! page = (XLogPageHeader) XLOG_BUFFER_POINTERALIGN(buffer); ! memset(page, 0, BLCKSZ); /* Set up information for the initial checkpoint record */ checkPoint.redo.xlogid = 0; *************** *** 3745,3751 **** /* Write the first page with the initial record */ errno = 0; ! if (write(openLogFile, buffer, BLCKSZ) != BLCKSZ) { /* if write didn't set errno, assume problem is no disk space */ if (errno == 0) --- 3778,3784 ---- /* Write the first page with the initial record */ errno = 0; ! if (write(openLogFile, page, BLCKSZ) != BLCKSZ) { /* if write didn't set errno, assume problem is no disk space */ if (errno == 0) *************** *** 5837,5839 **** --- 5870,5940 ---- errmsg("could not remove file \"%s\": %m", BACKUP_LABEL_FILE))); } + + + /* XLog gather-write staffs */ + + static void + XLogPageReset(XLogPages *pages) + { + memset(pages, 0, sizeof(*pages)); + } + + static void + XLogPageWrite(XLogPages *pages, int index) + { + char *page = XLogCtl->pages + index * BLCKSZ; + int size = BLCKSZ; + int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize; + + if (pages->head + pages->size == page + && pages->offset + pages->size == offset) + { /* Pages are continuous. Append new page. */ + pages->size += size; + } + else + { /* Pages are not continuous. Flush and clear. */ + XLogPageFlush(pages, PrevBufIdx(index)); + pages->head = page; + pages->size = size; + pages->offset = offset; + } + } + + static void + XLogPageFlush(XLogPages *pages, int index) + { + if (!pages->head) + { /* No needs to write pages. */ + XLogCtl->Write.curridx = index; + return; + } + + /* Need to seek in the file? */ + if (openLogOff != pages->offset) + { + openLogOff = pages->offset; + if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0) + ereport(PANIC, + (errcode_for_file_access(), + errmsg("could not seek in log file %u, segment %u to offset %u: %m", + openLogId, openLogSeg, openLogOff))); + } + + /* OK to write the page */ + errno = 0; + if (write(openLogFile, pages->head, pages->size) != pages->size) + { + /* if write didn't set errno, assume problem is no disk space */ + if (errno == 0) + errno = ENOSPC; + ereport(PANIC, + (errcode_for_file_access(), + errmsg("could not write to log file %u, segment %u at offset %u: %m", + openLogId, openLogSeg, openLogOff))); + } + + openLogOff += pages->size; + XLogCtl->Write.curridx = index; + XLogPageReset(pages); + }
Just to be certain I know what I have and how to use it,
please confirm the following is correct.
According to Markw, the tarball we have at OSDL that is
dated 7/29 already has the O_DIRECT patch + wal grouping
applied. Hence, I will use the one from the day before:
postgresql-20050728.tar.bz2 and apply your patch.
I expect to have, after applying the patch, the O_DIRECT patch
for the log (which I should not be using given the config
parameters I have), but I will _not have the wal grouping?
Is that correct?
On Fri, 2005-08-12 at 12:12 -0400, Bruce Momjian wrote:
> Mark Wong wrote:
> > On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
> > Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> >
> > > Tom Lane wrote:
> > > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > > > >> anyway), so shared_buffers should be irrelevant.
> > > >
> > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > > > that is not used for writing dirty buffers during a checkpoint.
> > > >
> > > > I double-checked that O_DIRECT is really just used for WAL, and only
> > > > when the sync mode is open_sync or open_datasync. So it seems
> > > > impossible that it affected a run with mode fdatasync. What seems the
> > > > best theory at the moment is that the grouped-WAL-write part of the
> > > > patch doesn't work so well as we thought.
> > >
> > > Yes, that's my only guess. Let us know if you want the patch to test,
> > > rather than pulling CVS before and after the patch was applied.
> >
> > Yeah, a patch would be a little easier. :)
>
> OK, patch attached. The code has been cleaned up a little since then but
> this is the basic change that should be tested. It is based on CVS of
> 2005/07/29 03:22:33 GMT.
>
> Plain text document attachment (/bjm/diff)
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.210
> retrieving revision 1.211
> diff -c -r1.210 -r1.211
> *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
> --- src/backend/access/transam/xlog.c 29 Jul 2005 03:22:33 -0000 1.211
> ***************
> *** 48,77 ****
>
>
> /*
> * This chunk of hackery attempts to determine which file sync methods
> * are available on the current platform, and to choose an appropriate
> * default method. We assume that fsync() is always available, and that
> * configure determined whether fdatasync() is.
> */
> #if defined(O_SYNC)
> ! #define OPEN_SYNC_FLAG O_SYNC
> #else
> #if defined(O_FSYNC)
> ! #define OPEN_SYNC_FLAG O_FSYNC
> #endif
> #endif
>
> #if defined(O_DSYNC)
> #if defined(OPEN_SYNC_FLAG)
> ! #if O_DSYNC != OPEN_SYNC_FLAG
> ! #define OPEN_DATASYNC_FLAG O_DSYNC
> #endif
> #else /* !defined(OPEN_SYNC_FLAG) */
> /* Win32 only has O_DSYNC */
> ! #define OPEN_DATASYNC_FLAG O_DSYNC
> #endif
> #endif
>
> #if defined(OPEN_DATASYNC_FLAG)
> #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> --- 48,117 ----
>
>
> /*
> + * Becauase O_DIRECT bypasses the kernel buffers, and because we never
> + * read those buffers except during crash recovery, it is a win to use
> + * it in all cases where we sync on each write(). We could allow O_DIRECT
> + * with fsync(), but because skipping the kernel buffer forces writes out
> + * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> + * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> + */
> + #ifdef O_DIRECT
> + #define PG_O_DIRECT O_DIRECT
> + #else
> + #define PG_O_DIRECT 0
> + #endif
> +
> + /*
> * This chunk of hackery attempts to determine which file sync methods
> * are available on the current platform, and to choose an appropriate
> * default method. We assume that fsync() is always available, and that
> * configure determined whether fdatasync() is.
> */
> #if defined(O_SYNC)
> ! #define CMP_OPEN_SYNC_FLAG O_SYNC
> #else
> #if defined(O_FSYNC)
> ! #define CMP_OPEN_SYNC_FLAG O_FSYNC
> #endif
> #endif
> + #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)
>
> #if defined(O_DSYNC)
> #if defined(OPEN_SYNC_FLAG)
> ! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
> ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> #endif
> #else /* !defined(OPEN_SYNC_FLAG) */
> /* Win32 only has O_DSYNC */
> ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> #endif
> #endif
>
> + /*
> + * Limitation of buffer-alignment for direct io depend on OS and filesystem,
> + * but BLCKSZ is assumed to be enough for it.
> + */
> + #ifdef O_DIRECT
> + #define ALIGNOF_XLOG_BUFFER BLCKSZ
> + #else
> + #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
> + #endif
> +
> + /*
> + * Switch the alignment routine because ShmemAlloc() returns a max-aligned
> + * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
> + */
> + #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
> + #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
> + #else
> + #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
> + #endif
> + /* assume sizeof(ptrdiff_t) == sizeof(void*) */
> + #define POINTERALIGN(ALIGNVAL,PTR) \
> + ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
> + #define XLOG_BUFFER_POINTERALIGN(PTR) \
> + POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
> +
> #if defined(OPEN_DATASYNC_FLAG)
> #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> ***************
> *** 469,474 ****
> --- 509,525 ----
> static char *str_time(time_t tnow);
> static void issue_xlog_fsync(void);
>
> + /* XLog gather-write staffs */
> + typedef struct XLogPages
> + {
> + char *head; /* Head of first page */
> + int size; /* Total bytes of pages == count(pages) * BLCKSZ */
> + int offset; /* Offset in xlog segment file */
> + } XLogPages;
> + static void XLogPageReset(XLogPages *pages);
> + static void XLogPageWrite(XLogPages *pages, int index);
> + static void XLogPageFlush(XLogPages *pages, int index);
> +
> #ifdef WAL_DEBUG
> static void xlog_outrec(char *buf, XLogRecord *record);
> #endif
> ***************
> *** 1245,1253 ****
> XLogWrite(XLogwrtRqst WriteRqst)
> {
> XLogCtlWrite *Write = &XLogCtl->Write;
> - char *from;
> bool ispartialpage;
> bool use_existent;
>
> /* We should always be inside a critical section here */
> Assert(CritSectionCount > 0);
> --- 1296,1305 ----
> XLogWrite(XLogwrtRqst WriteRqst)
> {
> XLogCtlWrite *Write = &XLogCtl->Write;
> bool ispartialpage;
> bool use_existent;
> + int currentIndex = Write->curridx;
> + XLogPages pages;
>
> /* We should always be inside a critical section here */
> Assert(CritSectionCount > 0);
> ***************
> *** 1258,1263 ****
> --- 1310,1317 ----
> */
> LogwrtResult = Write->LogwrtResult;
>
> + XLogPageReset(&pages);
> +
> while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
> {
> /*
> ***************
> *** 1266,1279 ****
> * end of the last page that's been initialized by
> * AdvanceXLInsertBuffer.
> */
> ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
> elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> ! XLogCtl->xlblocks[Write->curridx].xlogid,
> ! XLogCtl->xlblocks[Write->curridx].xrecoff);
>
> /* Advance LogwrtResult.Write to end of current buffer page */
> ! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
> ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
>
> if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> --- 1320,1333 ----
> * end of the last page that's been initialized by
> * AdvanceXLInsertBuffer.
> */
> ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
> elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> ! XLogCtl->xlblocks[currentIndex].xlogid,
> ! XLogCtl->xlblocks[currentIndex].xrecoff);
>
> /* Advance LogwrtResult.Write to end of current buffer page */
> ! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
> ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
>
> if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> ***************
> *** 1281,1286 ****
> --- 1335,1341 ----
> /*
> * Switch to new logfile segment.
> */
> + XLogPageFlush(&pages, currentIndex);
> if (openLogFile >= 0)
> {
> if (close(openLogFile))
> ***************
> *** 1354,1384 ****
> openLogOff = 0;
> }
>
> ! /* Need to seek in the file? */
> ! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
> ! {
> ! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> ! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> ! ereport(PANIC,
> ! (errcode_for_file_access(),
> ! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> ! openLogId, openLogSeg, openLogOff)));
> ! }
> !
> ! /* OK to write the page */
> ! from = XLogCtl->pages + Write->curridx * BLCKSZ;
> ! errno = 0;
> ! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
> ! {
> ! /* if write didn't set errno, assume problem is no disk space */
> ! if (errno == 0)
> ! errno = ENOSPC;
> ! ereport(PANIC,
> ! (errcode_for_file_access(),
> ! errmsg("could not write to log file %u, segment %u at offset %u: %m",
> ! openLogId, openLogSeg, openLogOff)));
> ! }
> ! openLogOff += BLCKSZ;
>
> /*
> * If we just wrote the whole last page of a logfile segment,
> --- 1409,1416 ----
> openLogOff = 0;
> }
>
> ! /* Add a page to buffer */
> ! XLogPageWrite(&pages, currentIndex);
>
> /*
> * If we just wrote the whole last page of a logfile segment,
> ***************
> *** 1390,1397 ****
> * This is also the right place to notify the Archiver that the
> * segment is ready to copy to archival storage.
> */
> ! if (openLogOff >= XLogSegSize && !ispartialpage)
> {
> issue_xlog_fsync();
> LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
>
> --- 1422,1430 ----
> * This is also the right place to notify the Archiver that the
> * segment is ready to copy to archival storage.
> */
> ! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
> {
> + XLogPageFlush(&pages, currentIndex);
> issue_xlog_fsync();
> LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
>
> ***************
> *** 1405,1412 ****
> LogwrtResult.Write = WriteRqst.Write;
> break;
> }
> ! Write->curridx = NextBufIdx(Write->curridx);
> }
>
> /*
> * If asked to flush, do so
> --- 1438,1446 ----
> LogwrtResult.Write = WriteRqst.Write;
> break;
> }
> ! currentIndex = NextBufIdx(currentIndex);
> }
> + XLogPageFlush(&pages, currentIndex);
>
> /*
> * If asked to flush, do so
> ***************
> *** 3584,3590 ****
> if (XLOGbuffers < MinXLOGbuffers)
> XLOGbuffers = MinXLOGbuffers;
>
> ! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers +
> MAXALIGN(sizeof(ControlFileData));
> }
> --- 3618,3624 ----
> if (XLOGbuffers < MinXLOGbuffers)
> XLOGbuffers = MinXLOGbuffers;
>
> ! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers +
> MAXALIGN(sizeof(ControlFileData));
> }
> ***************
> *** 3601,3607 ****
>
> XLogCtl = (XLogCtlData *)
> ShmemInitStruct("XLOG Ctl",
> ! MAXALIGN(sizeof(XLogCtlData) +
> sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers,
> &foundXLog);
> --- 3635,3641 ----
>
> XLogCtl = (XLogCtlData *)
> ShmemInitStruct("XLOG Ctl",
> ! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
> sizeof(XLogRecPtr) * XLOGbuffers)
> + BLCKSZ * XLOGbuffers,
> &foundXLog);
> ***************
> *** 3630,3638 ****
> * Here, on the other hand, we must MAXALIGN to ensure the page
> * buffers have worst-case alignment.
> */
> ! XLogCtl->pages =
> ! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
> ! sizeof(XLogRecPtr) * XLOGbuffers);
> memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
>
> /*
> --- 3664,3672 ----
> * Here, on the other hand, we must MAXALIGN to ensure the page
> * buffers have worst-case alignment.
> */
> ! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
> ! ((char *) XLogCtl)
> ! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
> memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
>
> /*
> ***************
> *** 3690,3699 ****
> /* First timeline ID is always 1 */
> ThisTimeLineID = 1;
>
> ! /* Use malloc() to ensure buffer is MAXALIGNED */
> ! buffer = (char *) malloc(BLCKSZ);
> ! page = (XLogPageHeader) buffer;
> ! memset(buffer, 0, BLCKSZ);
>
> /* Set up information for the initial checkpoint record */
> checkPoint.redo.xlogid = 0;
> --- 3724,3732 ----
> /* First timeline ID is always 1 */
> ThisTimeLineID = 1;
>
> ! buffer = (char *) malloc(BLCKSZ + ALIGNOF_XLOG_BUFFER);
> ! page = (XLogPageHeader) XLOG_BUFFER_POINTERALIGN(buffer);
> ! memset(page, 0, BLCKSZ);
>
> /* Set up information for the initial checkpoint record */
> checkPoint.redo.xlogid = 0;
> ***************
> *** 3745,3751 ****
>
> /* Write the first page with the initial record */
> errno = 0;
> ! if (write(openLogFile, buffer, BLCKSZ) != BLCKSZ)
> {
> /* if write didn't set errno, assume problem is no disk space */
> if (errno == 0)
> --- 3778,3784 ----
>
> /* Write the first page with the initial record */
> errno = 0;
> ! if (write(openLogFile, page, BLCKSZ) != BLCKSZ)
> {
> /* if write didn't set errno, assume problem is no disk space */
> if (errno == 0)
> ***************
> *** 5837,5839 ****
> --- 5870,5940 ----
> errmsg("could not remove file \"%s\": %m",
> BACKUP_LABEL_FILE)));
> }
> +
> +
> + /* XLog gather-write staffs */
> +
> + static void
> + XLogPageReset(XLogPages *pages)
> + {
> + memset(pages, 0, sizeof(*pages));
> + }
> +
> + static void
> + XLogPageWrite(XLogPages *pages, int index)
> + {
> + char *page = XLogCtl->pages + index * BLCKSZ;
> + int size = BLCKSZ;
> + int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> +
> + if (pages->head + pages->size == page
> + && pages->offset + pages->size == offset)
> + { /* Pages are continuous. Append new page. */
> + pages->size += size;
> + }
> + else
> + { /* Pages are not continuous. Flush and clear. */
> + XLogPageFlush(pages, PrevBufIdx(index));
> + pages->head = page;
> + pages->size = size;
> + pages->offset = offset;
> + }
> + }
> +
> + static void
> + XLogPageFlush(XLogPages *pages, int index)
> + {
> + if (!pages->head)
> + { /* No needs to write pages. */
> + XLogCtl->Write.curridx = index;
> + return;
> + }
> +
> + /* Need to seek in the file? */
> + if (openLogOff != pages->offset)
> + {
> + openLogOff = pages->offset;
> + if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> + ereport(PANIC,
> + (errcode_for_file_access(),
> + errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> + openLogId, openLogSeg, openLogOff)));
> + }
> +
> + /* OK to write the page */
> + errno = 0;
> + if (write(openLogFile, pages->head, pages->size) != pages->size)
> + {
> + /* if write didn't set errno, assume problem is no disk space */
> + if (errno == 0)
> + errno = ENOSPC;
> + ereport(PANIC,
> + (errcode_for_file_access(),
> + errmsg("could not write to log file %u, segment %u at offset %u: %m",
> + openLogId, openLogSeg, openLogOff)));
> + }
> +
> + openLogOff += pages->size;
> + XLogCtl->Write.curridx = index;
> + XLogPageReset(pages);
> + }
Mary Edie Meredith wrote:
> Just to be certain I know what I have and how to use it,
> please confirm the following is correct.
>
> According to Markw, the tarball we have at OSDL that is
> dated 7/29 already has the O_DIRECT patch + wal grouping
> applied. Hence, I will use the one from the day before:
> postgresql-20050728.tar.bz2 and apply your patch.
Right, or use patch -R to reverse out the changes and revert to a
version with O_DIRECT.
>
> I expect to have, after applying the patch, the O_DIRECT patch
> for the log (which I should not be using given the config
> parameters I have), but I will _not have the wal grouping?
The patch adds O_DIRECT (which is not being used given your configure
paramters), and grouped WAL writes.
---------------------------------------------------------------------------
>
> Is that correct?
>
> On Fri, 2005-08-12 at 12:12 -0400, Bruce Momjian wrote:
> > Mark Wong wrote:
> > > On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
> > > Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> > >
> > > > Tom Lane wrote:
> > > > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > > > > >> anyway), so shared_buffers should be irrelevant.
> > > > >
> > > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > > > > that is not used for writing dirty buffers during a checkpoint.
> > > > >
> > > > > I double-checked that O_DIRECT is really just used for WAL, and only
> > > > > when the sync mode is open_sync or open_datasync. So it seems
> > > > > impossible that it affected a run with mode fdatasync. What seems the
> > > > > best theory at the moment is that the grouped-WAL-write part of the
> > > > > patch doesn't work so well as we thought.
> > > >
> > > > Yes, that's my only guess. Let us know if you want the patch to test,
> > > > rather than pulling CVS before and after the patch was applied.
> > >
> > > Yeah, a patch would be a little easier. :)
> >
> > OK, patch attached. The code has been cleaned up a little since then but
> > this is the basic change that should be tested. It is based on CVS of
> > 2005/07/29 03:22:33 GMT.
> >
> > Plain text document attachment (/bjm/diff)
> > Index: src/backend/access/transam/xlog.c
> > ===================================================================
> > RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> > retrieving revision 1.210
> > retrieving revision 1.211
> > diff -c -r1.210 -r1.211
> > *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
> > --- src/backend/access/transam/xlog.c 29 Jul 2005 03:22:33 -0000 1.211
> > ***************
> > *** 48,77 ****
> >
> >
> > /*
> > * This chunk of hackery attempts to determine which file sync methods
> > * are available on the current platform, and to choose an appropriate
> > * default method. We assume that fsync() is always available, and that
> > * configure determined whether fdatasync() is.
> > */
> > #if defined(O_SYNC)
> > ! #define OPEN_SYNC_FLAG O_SYNC
> > #else
> > #if defined(O_FSYNC)
> > ! #define OPEN_SYNC_FLAG O_FSYNC
> > #endif
> > #endif
> >
> > #if defined(O_DSYNC)
> > #if defined(OPEN_SYNC_FLAG)
> > ! #if O_DSYNC != OPEN_SYNC_FLAG
> > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > #endif
> > #else /* !defined(OPEN_SYNC_FLAG) */
> > /* Win32 only has O_DSYNC */
> > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > #endif
> > #endif
> >
> > #if defined(OPEN_DATASYNC_FLAG)
> > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > --- 48,117 ----
> >
> >
> > /*
> > + * Becauase O_DIRECT bypasses the kernel buffers, and because we never
> > + * read those buffers except during crash recovery, it is a win to use
> > + * it in all cases where we sync on each write(). We could allow O_DIRECT
> > + * with fsync(), but because skipping the kernel buffer forces writes out
> > + * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> > + * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> > + */
> > + #ifdef O_DIRECT
> > + #define PG_O_DIRECT O_DIRECT
> > + #else
> > + #define PG_O_DIRECT 0
> > + #endif
> > +
> > + /*
> > * This chunk of hackery attempts to determine which file sync methods
> > * are available on the current platform, and to choose an appropriate
> > * default method. We assume that fsync() is always available, and that
> > * configure determined whether fdatasync() is.
> > */
> > #if defined(O_SYNC)
> > ! #define CMP_OPEN_SYNC_FLAG O_SYNC
> > #else
> > #if defined(O_FSYNC)
> > ! #define CMP_OPEN_SYNC_FLAG O_FSYNC
> > #endif
> > #endif
> > + #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)
> >
> > #if defined(O_DSYNC)
> > #if defined(OPEN_SYNC_FLAG)
> > ! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
> > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > #endif
> > #else /* !defined(OPEN_SYNC_FLAG) */
> > /* Win32 only has O_DSYNC */
> > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > #endif
> > #endif
> >
> > + /*
> > + * Limitation of buffer-alignment for direct io depend on OS and filesystem,
> > + * but BLCKSZ is assumed to be enough for it.
> > + */
> > + #ifdef O_DIRECT
> > + #define ALIGNOF_XLOG_BUFFER BLCKSZ
> > + #else
> > + #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
> > + #endif
> > +
> > + /*
> > + * Switch the alignment routine because ShmemAlloc() returns a max-aligned
> > + * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
> > + */
> > + #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
> > + #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
> > + #else
> > + #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
> > + #endif
> > + /* assume sizeof(ptrdiff_t) == sizeof(void*) */
> > + #define POINTERALIGN(ALIGNVAL,PTR) \
> > + ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
> > + #define XLOG_BUFFER_POINTERALIGN(PTR) \
> > + POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
> > +
> > #if defined(OPEN_DATASYNC_FLAG)
> > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > ***************
> > *** 469,474 ****
> > --- 509,525 ----
> > static char *str_time(time_t tnow);
> > static void issue_xlog_fsync(void);
> >
> > + /* XLog gather-write staffs */
> > + typedef struct XLogPages
> > + {
> > + char *head; /* Head of first page */
> > + int size; /* Total bytes of pages == count(pages) * BLCKSZ */
> > + int offset; /* Offset in xlog segment file */
> > + } XLogPages;
> > + static void XLogPageReset(XLogPages *pages);
> > + static void XLogPageWrite(XLogPages *pages, int index);
> > + static void XLogPageFlush(XLogPages *pages, int index);
> > +
> > #ifdef WAL_DEBUG
> > static void xlog_outrec(char *buf, XLogRecord *record);
> > #endif
> > ***************
> > *** 1245,1253 ****
> > XLogWrite(XLogwrtRqst WriteRqst)
> > {
> > XLogCtlWrite *Write = &XLogCtl->Write;
> > - char *from;
> > bool ispartialpage;
> > bool use_existent;
> >
> > /* We should always be inside a critical section here */
> > Assert(CritSectionCount > 0);
> > --- 1296,1305 ----
> > XLogWrite(XLogwrtRqst WriteRqst)
> > {
> > XLogCtlWrite *Write = &XLogCtl->Write;
> > bool ispartialpage;
> > bool use_existent;
> > + int currentIndex = Write->curridx;
> > + XLogPages pages;
> >
> > /* We should always be inside a critical section here */
> > Assert(CritSectionCount > 0);
> > ***************
> > *** 1258,1263 ****
> > --- 1310,1317 ----
> > */
> > LogwrtResult = Write->LogwrtResult;
> >
> > + XLogPageReset(&pages);
> > +
> > while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
> > {
> > /*
> > ***************
> > *** 1266,1279 ****
> > * end of the last page that's been initialized by
> > * AdvanceXLInsertBuffer.
> > */
> > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
> > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > ! XLogCtl->xlblocks[Write->curridx].xlogid,
> > ! XLogCtl->xlblocks[Write->curridx].xrecoff);
> >
> > /* Advance LogwrtResult.Write to end of current buffer page */
> > ! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
> > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> >
> > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > --- 1320,1333 ----
> > * end of the last page that's been initialized by
> > * AdvanceXLInsertBuffer.
> > */
> > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
> > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > ! XLogCtl->xlblocks[currentIndex].xlogid,
> > ! XLogCtl->xlblocks[currentIndex].xrecoff);
> >
> > /* Advance LogwrtResult.Write to end of current buffer page */
> > ! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
> > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> >
> > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > ***************
> > *** 1281,1286 ****
> > --- 1335,1341 ----
> > /*
> > * Switch to new logfile segment.
> > */
> > + XLogPageFlush(&pages, currentIndex);
> > if (openLogFile >= 0)
> > {
> > if (close(openLogFile))
> > ***************
> > *** 1354,1384 ****
> > openLogOff = 0;
> > }
> >
> > ! /* Need to seek in the file? */
> > ! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
> > ! {
> > ! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > ! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > ! ereport(PANIC,
> > ! (errcode_for_file_access(),
> > ! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > ! openLogId, openLogSeg, openLogOff)));
> > ! }
> > !
> > ! /* OK to write the page */
> > ! from = XLogCtl->pages + Write->curridx * BLCKSZ;
> > ! errno = 0;
> > ! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
> > ! {
> > ! /* if write didn't set errno, assume problem is no disk space */
> > ! if (errno == 0)
> > ! errno = ENOSPC;
> > ! ereport(PANIC,
> > ! (errcode_for_file_access(),
> > ! errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > ! openLogId, openLogSeg, openLogOff)));
> > ! }
> > ! openLogOff += BLCKSZ;
> >
> > /*
> > * If we just wrote the whole last page of a logfile segment,
> > --- 1409,1416 ----
> > openLogOff = 0;
> > }
> >
> > ! /* Add a page to buffer */
> > ! XLogPageWrite(&pages, currentIndex);
> >
> > /*
> > * If we just wrote the whole last page of a logfile segment,
> > ***************
> > *** 1390,1397 ****
> > * This is also the right place to notify the Archiver that the
> > * segment is ready to copy to archival storage.
> > */
> > ! if (openLogOff >= XLogSegSize && !ispartialpage)
> > {
> > issue_xlog_fsync();
> > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> >
> > --- 1422,1430 ----
> > * This is also the right place to notify the Archiver that the
> > * segment is ready to copy to archival storage.
> > */
> > ! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
> > {
> > + XLogPageFlush(&pages, currentIndex);
> > issue_xlog_fsync();
> > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> >
> > ***************
> > *** 1405,1412 ****
> > LogwrtResult.Write = WriteRqst.Write;
> > break;
> > }
> > ! Write->curridx = NextBufIdx(Write->curridx);
> > }
> >
> > /*
> > * If asked to flush, do so
> > --- 1438,1446 ----
> > LogwrtResult.Write = WriteRqst.Write;
> > break;
> > }
> > ! currentIndex = NextBufIdx(currentIndex);
> > }
> > + XLogPageFlush(&pages, currentIndex);
> >
> > /*
> > * If asked to flush, do so
> > ***************
> > *** 3584,3590 ****
> > if (XLOGbuffers < MinXLOGbuffers)
> > XLOGbuffers = MinXLOGbuffers;
> >
> > ! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers +
> > MAXALIGN(sizeof(ControlFileData));
> > }
> > --- 3618,3624 ----
> > if (XLOGbuffers < MinXLOGbuffers)
> > XLOGbuffers = MinXLOGbuffers;
> >
> > ! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers +
> > MAXALIGN(sizeof(ControlFileData));
> > }
> > ***************
> > *** 3601,3607 ****
> >
> > XLogCtl = (XLogCtlData *)
> > ShmemInitStruct("XLOG Ctl",
> > ! MAXALIGN(sizeof(XLogCtlData) +
> > sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers,
> > &foundXLog);
> > --- 3635,3641 ----
> >
> > XLogCtl = (XLogCtlData *)
> > ShmemInitStruct("XLOG Ctl",
> > ! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
> > sizeof(XLogRecPtr) * XLOGbuffers)
> > + BLCKSZ * XLOGbuffers,
> > &foundXLog);
> > ***************
> > *** 3630,3638 ****
> > * Here, on the other hand, we must MAXALIGN to ensure the page
> > * buffers have worst-case alignment.
> > */
> > ! XLogCtl->pages =
> > ! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
> > ! sizeof(XLogRecPtr) * XLOGbuffers);
> > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> >
> > /*
> > --- 3664,3672 ----
> > * Here, on the other hand, we must MAXALIGN to ensure the page
> > * buffers have worst-case alignment.
> > */
> > ! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
> > ! ((char *) XLogCtl)
> > ! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
> > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> >
> > /*
> > ***************
> > *** 3690,3699 ****
> > /* First timeline ID is always 1 */
> > ThisTimeLineID = 1;
> >
> > ! /* Use malloc() to ensure buffer is MAXALIGNED */
> > ! buffer = (char *) malloc(BLCKSZ);
> > ! page = (XLogPageHeader) buffer;
> > ! memset(buffer, 0, BLCKSZ);
> >
> > /* Set up information for the initial checkpoint record */
> > checkPoint.redo.xlogid = 0;
> > --- 3724,3732 ----
> > /* First timeline ID is always 1 */
> > ThisTimeLineID = 1;
> >
> > ! buffer = (char *) malloc(BLCKSZ + ALIGNOF_XLOG_BUFFER);
> > ! page = (XLogPageHeader) XLOG_BUFFER_POINTERALIGN(buffer);
> > ! memset(page, 0, BLCKSZ);
> >
> > /* Set up information for the initial checkpoint record */
> > checkPoint.redo.xlogid = 0;
> > ***************
> > *** 3745,3751 ****
> >
> > /* Write the first page with the initial record */
> > errno = 0;
> > ! if (write(openLogFile, buffer, BLCKSZ) != BLCKSZ)
> > {
> > /* if write didn't set errno, assume problem is no disk space */
> > if (errno == 0)
> > --- 3778,3784 ----
> >
> > /* Write the first page with the initial record */
> > errno = 0;
> > ! if (write(openLogFile, page, BLCKSZ) != BLCKSZ)
> > {
> > /* if write didn't set errno, assume problem is no disk space */
> > if (errno == 0)
> > ***************
> > *** 5837,5839 ****
> > --- 5870,5940 ----
> > errmsg("could not remove file \"%s\": %m",
> > BACKUP_LABEL_FILE)));
> > }
> > +
> > +
> > + /* XLog gather-write staffs */
> > +
> > + static void
> > + XLogPageReset(XLogPages *pages)
> > + {
> > + memset(pages, 0, sizeof(*pages));
> > + }
> > +
> > + static void
> > + XLogPageWrite(XLogPages *pages, int index)
> > + {
> > + char *page = XLogCtl->pages + index * BLCKSZ;
> > + int size = BLCKSZ;
> > + int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > +
> > + if (pages->head + pages->size == page
> > + && pages->offset + pages->size == offset)
> > + { /* Pages are continuous. Append new page. */
> > + pages->size += size;
> > + }
> > + else
> > + { /* Pages are not continuous. Flush and clear. */
> > + XLogPageFlush(pages, PrevBufIdx(index));
> > + pages->head = page;
> > + pages->size = size;
> > + pages->offset = offset;
> > + }
> > + }
> > +
> > + static void
> > + XLogPageFlush(XLogPages *pages, int index)
> > + {
> > + if (!pages->head)
> > + { /* No needs to write pages. */
> > + XLogCtl->Write.curridx = index;
> > + return;
> > + }
> > +
> > + /* Need to seek in the file? */
> > + if (openLogOff != pages->offset)
> > + {
> > + openLogOff = pages->offset;
> > + if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > + ereport(PANIC,
> > + (errcode_for_file_access(),
> > + errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > + openLogId, openLogSeg, openLogOff)));
> > + }
> > +
> > + /* OK to write the page */
> > + errno = 0;
> > + if (write(openLogFile, pages->head, pages->size) != pages->size)
> > + {
> > + /* if write didn't set errno, assume problem is no disk space */
> > + if (errno == 0)
> > + errno = ENOSPC;
> > + ereport(PANIC,
> > + (errcode_for_file_access(),
> > + errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > + openLogId, openLogSeg, openLogOff)));
> > + }
> > +
> > + openLogOff += pages->size;
> > + XLogCtl->Write.curridx = index;
> > + XLogPageReset(pages);
> > + }
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 2005-08-12 at 14:53 -0400, Bruce Momjian wrote:
> Mary Edie Meredith wrote:
> > Just to be certain I know what I have and how to use it,
> > please confirm the following is correct.
> >
> > According to Markw, the tarball we have at OSDL that is
> > dated 7/29 already has the O_DIRECT patch + wal grouping
> > applied. Hence, I will use the one from the day before:
> > postgresql-20050728.tar.bz2 and apply your patch.
>
> Right, or use patch -R to reverse out the changes and revert to a
> version with O_DIRECT.
> >
> > I expect to have, after applying the patch, the O_DIRECT patch
> > for the log (which I should not be using given the config
> > parameters I have), but I will _not have the wal grouping?
>
> The patch adds O_DIRECT (which is not being used given your configure
> paramters), and grouped WAL writes.
So I'll use postgresql-20050728, I'll run with it to confirm it is
running similarly to my good run (42). If so, then I'll apply the patch
and see what happens....
This may take me a while just because I'm backed up with other
things ....
>
> ---------------------------------------------------------------------------
>
>
> >
> > Is that correct?
> >
> > On Fri, 2005-08-12 at 12:12 -0400, Bruce Momjian wrote:
> > > Mark Wong wrote:
> > > > On Thu, 11 Aug 2005 22:11:42 -0400 (EDT)
> > > > Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> > > >
> > > > > Tom Lane wrote:
> > > > > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > > > > >> O_DIRECT is only being used for WAL page writes (or I sure hope so
> > > > > > >> anyway), so shared_buffers should be irrelevant.
> > > > > >
> > > > > > > Uh, O_DIRECT really just enables when open_sync is used, and I assume
> > > > > > > that is not used for writing dirty buffers during a checkpoint.
> > > > > >
> > > > > > I double-checked that O_DIRECT is really just used for WAL, and only
> > > > > > when the sync mode is open_sync or open_datasync. So it seems
> > > > > > impossible that it affected a run with mode fdatasync. What seems the
> > > > > > best theory at the moment is that the grouped-WAL-write part of the
> > > > > > patch doesn't work so well as we thought.
> > > > >
> > > > > Yes, that's my only guess. Let us know if you want the patch to test,
> > > > > rather than pulling CVS before and after the patch was applied.
> > > >
> > > > Yeah, a patch would be a little easier. :)
> > >
> > > OK, patch attached. The code has been cleaned up a little since then but
> > > this is the basic change that should be tested. It is based on CVS of
> > > 2005/07/29 03:22:33 GMT.
> > >
> > > Plain text document attachment (/bjm/diff)
> > > Index: src/backend/access/transam/xlog.c
> > > ===================================================================
> > > RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> > > retrieving revision 1.210
> > > retrieving revision 1.211
> > > diff -c -r1.210 -r1.211
> > > *** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
> > > --- src/backend/access/transam/xlog.c 29 Jul 2005 03:22:33 -0000 1.211
> > > ***************
> > > *** 48,77 ****
> > >
> > >
> > > /*
> > > * This chunk of hackery attempts to determine which file sync methods
> > > * are available on the current platform, and to choose an appropriate
> > > * default method. We assume that fsync() is always available, and that
> > > * configure determined whether fdatasync() is.
> > > */
> > > #if defined(O_SYNC)
> > > ! #define OPEN_SYNC_FLAG O_SYNC
> > > #else
> > > #if defined(O_FSYNC)
> > > ! #define OPEN_SYNC_FLAG O_FSYNC
> > > #endif
> > > #endif
> > >
> > > #if defined(O_DSYNC)
> > > #if defined(OPEN_SYNC_FLAG)
> > > ! #if O_DSYNC != OPEN_SYNC_FLAG
> > > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > > #endif
> > > #else /* !defined(OPEN_SYNC_FLAG) */
> > > /* Win32 only has O_DSYNC */
> > > ! #define OPEN_DATASYNC_FLAG O_DSYNC
> > > #endif
> > > #endif
> > >
> > > #if defined(OPEN_DATASYNC_FLAG)
> > > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > > --- 48,117 ----
> > >
> > >
> > > /*
> > > + * Becauase O_DIRECT bypasses the kernel buffers, and because we never
> > > + * read those buffers except during crash recovery, it is a win to use
> > > + * it in all cases where we sync on each write(). We could allow O_DIRECT
> > > + * with fsync(), but because skipping the kernel buffer forces writes out
> > > + * quickly, it seems best just to use it for O_SYNC. It is hard to imagine
> > > + * how fsync() could be a win for O_DIRECT compared to O_SYNC and O_DIRECT.
> > > + */
> > > + #ifdef O_DIRECT
> > > + #define PG_O_DIRECT O_DIRECT
> > > + #else
> > > + #define PG_O_DIRECT 0
> > > + #endif
> > > +
> > > + /*
> > > * This chunk of hackery attempts to determine which file sync methods
> > > * are available on the current platform, and to choose an appropriate
> > > * default method. We assume that fsync() is always available, and that
> > > * configure determined whether fdatasync() is.
> > > */
> > > #if defined(O_SYNC)
> > > ! #define CMP_OPEN_SYNC_FLAG O_SYNC
> > > #else
> > > #if defined(O_FSYNC)
> > > ! #define CMP_OPEN_SYNC_FLAG O_FSYNC
> > > #endif
> > > #endif
> > > + #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)
> > >
> > > #if defined(O_DSYNC)
> > > #if defined(OPEN_SYNC_FLAG)
> > > ! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
> > > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > > #endif
> > > #else /* !defined(OPEN_SYNC_FLAG) */
> > > /* Win32 only has O_DSYNC */
> > > ! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
> > > #endif
> > > #endif
> > >
> > > + /*
> > > + * Limitation of buffer-alignment for direct io depend on OS and filesystem,
> > > + * but BLCKSZ is assumed to be enough for it.
> > > + */
> > > + #ifdef O_DIRECT
> > > + #define ALIGNOF_XLOG_BUFFER BLCKSZ
> > > + #else
> > > + #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
> > > + #endif
> > > +
> > > + /*
> > > + * Switch the alignment routine because ShmemAlloc() returns a max-aligned
> > > + * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
> > > + */
> > > + #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
> > > + #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
> > > + #else
> > > + #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
> > > + #endif
> > > + /* assume sizeof(ptrdiff_t) == sizeof(void*) */
> > > + #define POINTERALIGN(ALIGNVAL,PTR) \
> > > + ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
> > > + #define XLOG_BUFFER_POINTERALIGN(PTR) \
> > > + POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
> > > +
> > > #if defined(OPEN_DATASYNC_FLAG)
> > > #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> > > #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> > > ***************
> > > *** 469,474 ****
> > > --- 509,525 ----
> > > static char *str_time(time_t tnow);
> > > static void issue_xlog_fsync(void);
> > >
> > > + /* XLog gather-write staffs */
> > > + typedef struct XLogPages
> > > + {
> > > + char *head; /* Head of first page */
> > > + int size; /* Total bytes of pages == count(pages) * BLCKSZ */
> > > + int offset; /* Offset in xlog segment file */
> > > + } XLogPages;
> > > + static void XLogPageReset(XLogPages *pages);
> > > + static void XLogPageWrite(XLogPages *pages, int index);
> > > + static void XLogPageFlush(XLogPages *pages, int index);
> > > +
> > > #ifdef WAL_DEBUG
> > > static void xlog_outrec(char *buf, XLogRecord *record);
> > > #endif
> > > ***************
> > > *** 1245,1253 ****
> > > XLogWrite(XLogwrtRqst WriteRqst)
> > > {
> > > XLogCtlWrite *Write = &XLogCtl->Write;
> > > - char *from;
> > > bool ispartialpage;
> > > bool use_existent;
> > >
> > > /* We should always be inside a critical section here */
> > > Assert(CritSectionCount > 0);
> > > --- 1296,1305 ----
> > > XLogWrite(XLogwrtRqst WriteRqst)
> > > {
> > > XLogCtlWrite *Write = &XLogCtl->Write;
> > > bool ispartialpage;
> > > bool use_existent;
> > > + int currentIndex = Write->curridx;
> > > + XLogPages pages;
> > >
> > > /* We should always be inside a critical section here */
> > > Assert(CritSectionCount > 0);
> > > ***************
> > > *** 1258,1263 ****
> > > --- 1310,1317 ----
> > > */
> > > LogwrtResult = Write->LogwrtResult;
> > >
> > > + XLogPageReset(&pages);
> > > +
> > > while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
> > > {
> > > /*
> > > ***************
> > > *** 1266,1279 ****
> > > * end of the last page that's been initialized by
> > > * AdvanceXLInsertBuffer.
> > > */
> > > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
> > > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > > ! XLogCtl->xlblocks[Write->curridx].xlogid,
> > > ! XLogCtl->xlblocks[Write->curridx].xrecoff);
> > >
> > > /* Advance LogwrtResult.Write to end of current buffer page */
> > > ! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
> > > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> > >
> > > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > > --- 1320,1333 ----
> > > * end of the last page that's been initialized by
> > > * AdvanceXLInsertBuffer.
> > > */
> > > ! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
> > > elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
> > > LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
> > > ! XLogCtl->xlblocks[currentIndex].xlogid,
> > > ! XLogCtl->xlblocks[currentIndex].xrecoff);
> > >
> > > /* Advance LogwrtResult.Write to end of current buffer page */
> > > ! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
> > > ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);
> > >
> > > if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
> > > ***************
> > > *** 1281,1286 ****
> > > --- 1335,1341 ----
> > > /*
> > > * Switch to new logfile segment.
> > > */
> > > + XLogPageFlush(&pages, currentIndex);
> > > if (openLogFile >= 0)
> > > {
> > > if (close(openLogFile))
> > > ***************
> > > *** 1354,1384 ****
> > > openLogOff = 0;
> > > }
> > >
> > > ! /* Need to seek in the file? */
> > > ! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
> > > ! {
> > > ! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > > ! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > > ! ereport(PANIC,
> > > ! (errcode_for_file_access(),
> > > ! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > > ! openLogId, openLogSeg, openLogOff)));
> > > ! }
> > > !
> > > ! /* OK to write the page */
> > > ! from = XLogCtl->pages + Write->curridx * BLCKSZ;
> > > ! errno = 0;
> > > ! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
> > > ! {
> > > ! /* if write didn't set errno, assume problem is no disk space */
> > > ! if (errno == 0)
> > > ! errno = ENOSPC;
> > > ! ereport(PANIC,
> > > ! (errcode_for_file_access(),
> > > ! errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > > ! openLogId, openLogSeg, openLogOff)));
> > > ! }
> > > ! openLogOff += BLCKSZ;
> > >
> > > /*
> > > * If we just wrote the whole last page of a logfile segment,
> > > --- 1409,1416 ----
> > > openLogOff = 0;
> > > }
> > >
> > > ! /* Add a page to buffer */
> > > ! XLogPageWrite(&pages, currentIndex);
> > >
> > > /*
> > > * If we just wrote the whole last page of a logfile segment,
> > > ***************
> > > *** 1390,1397 ****
> > > * This is also the right place to notify the Archiver that the
> > > * segment is ready to copy to archival storage.
> > > */
> > > ! if (openLogOff >= XLogSegSize && !ispartialpage)
> > > {
> > > issue_xlog_fsync();
> > > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> > >
> > > --- 1422,1430 ----
> > > * This is also the right place to notify the Archiver that the
> > > * segment is ready to copy to archival storage.
> > > */
> > > ! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
> > > {
> > > + XLogPageFlush(&pages, currentIndex);
> > > issue_xlog_fsync();
> > > LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */
> > >
> > > ***************
> > > *** 1405,1412 ****
> > > LogwrtResult.Write = WriteRqst.Write;
> > > break;
> > > }
> > > ! Write->curridx = NextBufIdx(Write->curridx);
> > > }
> > >
> > > /*
> > > * If asked to flush, do so
> > > --- 1438,1446 ----
> > > LogwrtResult.Write = WriteRqst.Write;
> > > break;
> > > }
> > > ! currentIndex = NextBufIdx(currentIndex);
> > > }
> > > + XLogPageFlush(&pages, currentIndex);
> > >
> > > /*
> > > * If asked to flush, do so
> > > ***************
> > > *** 3584,3590 ****
> > > if (XLOGbuffers < MinXLOGbuffers)
> > > XLOGbuffers = MinXLOGbuffers;
> > >
> > > ! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers +
> > > MAXALIGN(sizeof(ControlFileData));
> > > }
> > > --- 3618,3624 ----
> > > if (XLOGbuffers < MinXLOGbuffers)
> > > XLOGbuffers = MinXLOGbuffers;
> > >
> > > ! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers +
> > > MAXALIGN(sizeof(ControlFileData));
> > > }
> > > ***************
> > > *** 3601,3607 ****
> > >
> > > XLogCtl = (XLogCtlData *)
> > > ShmemInitStruct("XLOG Ctl",
> > > ! MAXALIGN(sizeof(XLogCtlData) +
> > > sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers,
> > > &foundXLog);
> > > --- 3635,3641 ----
> > >
> > > XLogCtl = (XLogCtlData *)
> > > ShmemInitStruct("XLOG Ctl",
> > > ! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
> > > sizeof(XLogRecPtr) * XLOGbuffers)
> > > + BLCKSZ * XLOGbuffers,
> > > &foundXLog);
> > > ***************
> > > *** 3630,3638 ****
> > > * Here, on the other hand, we must MAXALIGN to ensure the page
> > > * buffers have worst-case alignment.
> > > */
> > > ! XLogCtl->pages =
> > > ! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
> > > ! sizeof(XLogRecPtr) * XLOGbuffers);
> > > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> > >
> > > /*
> > > --- 3664,3672 ----
> > > * Here, on the other hand, we must MAXALIGN to ensure the page
> > > * buffers have worst-case alignment.
> > > */
> > > ! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
> > > ! ((char *) XLogCtl)
> > > ! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
> > > memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);
> > >
> > > /*
> > > ***************
> > > *** 3690,3699 ****
> > > /* First timeline ID is always 1 */
> > > ThisTimeLineID = 1;
> > >
> > > ! /* Use malloc() to ensure buffer is MAXALIGNED */
> > > ! buffer = (char *) malloc(BLCKSZ);
> > > ! page = (XLogPageHeader) buffer;
> > > ! memset(buffer, 0, BLCKSZ);
> > >
> > > /* Set up information for the initial checkpoint record */
> > > checkPoint.redo.xlogid = 0;
> > > --- 3724,3732 ----
> > > /* First timeline ID is always 1 */
> > > ThisTimeLineID = 1;
> > >
> > > ! buffer = (char *) malloc(BLCKSZ + ALIGNOF_XLOG_BUFFER);
> > > ! page = (XLogPageHeader) XLOG_BUFFER_POINTERALIGN(buffer);
> > > ! memset(page, 0, BLCKSZ);
> > >
> > > /* Set up information for the initial checkpoint record */
> > > checkPoint.redo.xlogid = 0;
> > > ***************
> > > *** 3745,3751 ****
> > >
> > > /* Write the first page with the initial record */
> > > errno = 0;
> > > ! if (write(openLogFile, buffer, BLCKSZ) != BLCKSZ)
> > > {
> > > /* if write didn't set errno, assume problem is no disk space */
> > > if (errno == 0)
> > > --- 3778,3784 ----
> > >
> > > /* Write the first page with the initial record */
> > > errno = 0;
> > > ! if (write(openLogFile, page, BLCKSZ) != BLCKSZ)
> > > {
> > > /* if write didn't set errno, assume problem is no disk space */
> > > if (errno == 0)
> > > ***************
> > > *** 5837,5839 ****
> > > --- 5870,5940 ----
> > > errmsg("could not remove file \"%s\": %m",
> > > BACKUP_LABEL_FILE)));
> > > }
> > > +
> > > +
> > > + /* XLog gather-write staffs */
> > > +
> > > + static void
> > > + XLogPageReset(XLogPages *pages)
> > > + {
> > > + memset(pages, 0, sizeof(*pages));
> > > + }
> > > +
> > > + static void
> > > + XLogPageWrite(XLogPages *pages, int index)
> > > + {
> > > + char *page = XLogCtl->pages + index * BLCKSZ;
> > > + int size = BLCKSZ;
> > > + int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
> > > +
> > > + if (pages->head + pages->size == page
> > > + && pages->offset + pages->size == offset)
> > > + { /* Pages are continuous. Append new page. */
> > > + pages->size += size;
> > > + }
> > > + else
> > > + { /* Pages are not continuous. Flush and clear. */
> > > + XLogPageFlush(pages, PrevBufIdx(index));
> > > + pages->head = page;
> > > + pages->size = size;
> > > + pages->offset = offset;
> > > + }
> > > + }
> > > +
> > > + static void
> > > + XLogPageFlush(XLogPages *pages, int index)
> > > + {
> > > + if (!pages->head)
> > > + { /* No needs to write pages. */
> > > + XLogCtl->Write.curridx = index;
> > > + return;
> > > + }
> > > +
> > > + /* Need to seek in the file? */
> > > + if (openLogOff != pages->offset)
> > > + {
> > > + openLogOff = pages->offset;
> > > + if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
> > > + ereport(PANIC,
> > > + (errcode_for_file_access(),
> > > + errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> > > + openLogId, openLogSeg, openLogOff)));
> > > + }
> > > +
> > > + /* OK to write the page */
> > > + errno = 0;
> > > + if (write(openLogFile, pages->head, pages->size) != pages->size)
> > > + {
> > > + /* if write didn't set errno, assume problem is no disk space */
> > > + if (errno == 0)
> > > + errno = ENOSPC;
> > > + ereport(PANIC,
> > > + (errcode_for_file_access(),
> > > + errmsg("could not write to log file %u, segment %u at offset %u: %m",
> > > + openLogId, openLogSeg, openLogOff)));
> > > + }
> > > +
> > > + openLogOff += pages->size;
> > > + XLogCtl->Write.curridx = index;
> > > + XLogPageReset(pages);
> > > + }
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> > choose an index scan if your joining column's datatypes do not
> > match
> >
>
Mary Edie Meredith wrote: > On Fri, 2005-08-12 at 14:53 -0400, Bruce Momjian wrote: > > Mary Edie Meredith wrote: > > > Just to be certain I know what I have and how to use it, > > > please confirm the following is correct. > > > > > > According to Markw, the tarball we have at OSDL that is > > > dated 7/29 already has the O_DIRECT patch + wal grouping > > > applied. Hence, I will use the one from the day before: > > > postgresql-20050728.tar.bz2 and apply your patch. > > > > Right, or use patch -R to reverse out the changes and revert to a > > version with O_DIRECT. > > > > > > I expect to have, after applying the patch, the O_DIRECT patch > > > for the log (which I should not be using given the config > > > parameters I have), but I will _not have the wal grouping? > > > > The patch adds O_DIRECT (which is not being used given your configure > > paramters), and grouped WAL writes. > So I'll use postgresql-20050728, I'll run with it to confirm it is > running similarly to my good run (42). If so, then I'll apply the patch > and see what happens.... > > This may take me a while just because I'm backed up with other > things .... Right. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073