Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?
От | Bruce Momjian |
---|---|
Тема | Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables? |
Дата | |
Msg-id | 199803121410.JAA18969@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables? (dg@illustra.com (David Gould)) |
Список | pgsql-hackers |
Here is an archive of the pg_log discussion. --------------------------------------------------------------------------- From: Bruce Momjian <maillist@candle.pha.pa.us> Message-Id: <199711170542.AAA24561@candle.pha.pa.us> Subject: [HACKERS] Bufferd loggins/pg_log To: hackers@postgreSQL.org (PostgreSQL-development) Date: Mon, 17 Nov 1997 00:42:18 -0500 (EST) Cc: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR Here is my current idea for doing bufferd logging, and exists between the normal fsync on every transaction and no-fsync options. I believe it will be very popular, because it mimicks the Unix file system reliability structure. --------------------------------------------------------------------------- On startup, the postmaster makes a copy of pg_log, called pg_log_live. Each postgres backend mmaps() this new file into its address space. A lock is gotten to make changes to the file. All backend use pg_log_live rather than pg_log. Only the postmaster write to pg_log. (I will someday remove the exec() from postmaster, so backends will get this address space automatically.) The first 512 bytes of pg_log and pg_log_live are used for log managment information. We add a new field to pg_log_live called min_xid_commit which records the lowest transaction id that any backend has committed since the start of the last sync pass of the postmater. We also add fields to record current pg_variable oid and xid at the same time. (xid may have to be moved into pg_variable so backends can fsync it (see below).) Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes the minimum xid set in the start of pg_log, and resets its value. It records the current oid and xid from pg_variable. It then clears the lock, and starts reading from the minimum recorded xid changed to the end of pg_log_live, and copies it into allocated memory. It then does a sync (twice?), waits for completion, and then writes the pg_log_live partial copy it made to pg_log. We update the copies of oid and xid we saved before the sync to the bottom of pg_log_live. We can change the 60-90 seconds to be longer, but the system does it every 30 seconds anyway. When the postmaster stops, it does this same operation before shutting down, and pg_log_live is removed. We make a copy of the current xid and oid in the front of pg_log_live, so that if the postmaster starts up, and pg_log_live exists, the postmaster adds 10,000 to xid and oid of pg_variable, so no previously used but unsynced values are used. We know that the current values of pg_variable could not have been exceeded by 10,000, because each backend consults the pg_log copies of these variable to make sure they do not exceed 10,000 from the value before the last sync. They exceed those values only by fscyn'ing every 10,000 increments. Said another way, if a postgres backend exceeds the pg_log last xid or oid of pg_log, or any 10,000 multiple, it must fsync the change to pg_variable. This way, a crash skips over any unsynced oid/xid's used, and this is done without having to keep fsyncing pg_variable. In most cases, the 10,000 will never be exceeded by a backend before the postmaster does a sync and increases the last xid/oid again. I think this is a very clean way to give us no-fync performance with full-rollback buffered logging. The specification is clean and almost complete enough for coding. I think this gives us what we need, by having a mmap'ed() pg_log_live, which backends can use, and a postmaster-controlled pg_log, which is used on startup, with xid/oid controls in a crash situation to skip over partially committed transactions. Comments? -- Bruce Momjian maillist@candle.pha.pa.us --------------------------------------------------------------------------- Sender: root@www.krasnet.ru Message-ID: <346FF895.167EB0E7@sable.krasnoyarsk.su> Date: Mon, 17 Nov 1997 14:56:05 +0700 From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su> Organization: ITTS (Krasnoyarsk) X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386) MIME-Version: 1.0 To: Bruce Momjian <maillist@candle.pha.pa.us> CC: PostgreSQL-development <hackers@postgreSQL.org>, "Vadim B. Mikheev" <vadim@post.krasnet.ru> Subject: Re: Bufferd loggins/pg_log References: <199711170542.AAA24561@candle.pha.pa.us> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: OR Bruce Momjian wrote: > > On startup, the postmaster makes a copy of pg_log, called pg_log_live. > Each postgres backend mmaps() this new file into its address space. A > lock is gotten to make changes to the file. All backend use pg_log_live > rather than pg_log. Only the postmaster write to pg_log. (I will > someday remove the exec() from postmaster, so backends will get this > address space automatically.) What are advantages of mmaping entire pg_log over "online" pg_log pages ? pg_log may be very big (tens of Mb) - why we have to spend process address space for tens of Mb of mostly unused data ? Also, do all systems have mmap ? > > Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes > the minimum xid set in the start of pg_log, and resets its value. It > records the current oid and xid from pg_variable. It then clears the > lock, and starts reading from the minimum recorded xid changed to the > end of pg_log_live, and copies it into allocated memory. It then does a > sync (twice?), waits for completion, and then writes the pg_log_live ^^^^^ man sync: The sync() function forces a write of dirty (modified) buffers in the ^^^^^^ block buffer cache out to disk... ... BUGS Sync() may return before the buffers are completely flushed. Vadim --------------------------------------------------------------------------- From: Bruce Momjian <maillist@candle.pha.pa.us> Message-Id: <199711171346.IAA01964@candle.pha.pa.us> Subject: [HACKERS] Re: Bufferd loggins/pg_log To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev) Date: Mon, 17 Nov 1997 08:46:29 -0500 (EST) Cc: hackers@postgreSQL.org (PostgreSQL-development) In-Reply-To: <346FF895.167EB0E7@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 17, 97 02:56:05 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hackers@hub.org Precedence: bulk Status: OR > > Bruce Momjian wrote: > > > > On startup, the postmaster makes a copy of pg_log, called pg_log_live. > > Each postgres backend mmaps() this new file into its address space. A > > lock is gotten to make changes to the file. All backend use pg_log_live > > rather than pg_log. Only the postmaster write to pg_log. (I will > > someday remove the exec() from postmaster, so backends will get this > > address space automatically.) > > What are advantages of mmaping entire pg_log over "online" pg_log > pages ? > pg_log may be very big (tens of Mb) - why we have to spend > process address space for tens of Mb of mostly unused data ? > Also, do all systems have mmap ? I believe you are correct that it would be better keeping the last few pages of pg_log in shared memory rather than using mmap(). I think the important new ideas are keeping track of the oid/xid before sync so we can accurately add 10,000 after a crash. I am a little foggy on race condiions of growing the pg_log region while other backends are running, and modifying non-shared memory pages, but you seem to have a handle on it. We don't need pg_log_live if only the postmaster writes those last two pages to pg_log, and if we keep track of a crash status somewhere else, perhaps at the start of pg_log. > > > > > Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes > > the minimum xid set in the start of pg_log, and resets its value. It > > records the current oid and xid from pg_variable. It then clears the > > lock, and starts reading from the minimum recorded xid changed to the > > end of pg_log_live, and copies it into allocated memory. It then does a > > sync (twice?), waits for completion, and then writes the pg_log_live > ^^^^^ > man sync: > > The sync() function forces a write of dirty (modified) buffers in the > ^^^^^^ > block buffer cache out to disk... > ... > > BUGS > Sync() may return before the buffers are completely flushed. > > Vadim > My BSD/OS doesn't mention this, but twice is a good idea. -- Bruce Momjian maillist@candle.pha.pa.us -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
В списке pgsql-hackers по дате отправления: