Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?
Дата
Msg-id 199803121410.JAA18969@candle.pha.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] Re: [QUESTIONS] Does Storage Manager support >2GB tables?  (dg@illustra.com (David Gould))
Список pgsql-hackers
Here is an archive of the pg_log discussion.

---------------------------------------------------------------------------

From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711170542.AAA24561@candle.pha.pa.us>
Subject: [HACKERS] Bufferd loggins/pg_log
To: hackers@postgreSQL.org (PostgreSQL-development)
Date: Mon, 17 Nov 1997 00:42:18 -0500 (EST)
Cc: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

Here is my current idea for doing bufferd logging, and exists between
the normal fsync on every transaction and no-fsync options.  I believe
it will be very popular, because it mimicks the Unix file system
reliability structure.

---------------------------------------------------------------------------

On startup, the postmaster makes a copy of pg_log, called pg_log_live.
Each postgres backend mmaps() this new file into its address space.  A
lock is gotten to make changes to the file.  All backend use pg_log_live
rather than pg_log.  Only the postmaster write to pg_log.  (I will
someday remove the exec() from postmaster, so backends will get this
address space automatically.)

The first 512 bytes of pg_log and pg_log_live are used for log managment
information.  We add a new field to pg_log_live called min_xid_commit
which records the lowest transaction id that any backend has committed
since the start of the last sync pass of the postmater.  We also add
fields to record current pg_variable oid and xid at the same time.  (xid
may have to be moved into pg_variable so backends can fsync it (see
below).)

Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
the minimum xid set in the start of pg_log, and resets its value.  It
records the current oid and xid from pg_variable.  It then clears the
lock, and starts reading from the minimum recorded xid changed to the
end of pg_log_live, and copies it into allocated memory.  It then does a
sync (twice?), waits for completion, and then writes the pg_log_live
partial copy it made to pg_log.  We update the copies of oid and xid we
saved before the sync to the bottom of pg_log_live.

We can change the 60-90 seconds to be longer, but the system does it
every 30 seconds anyway.

When the postmaster stops, it does this same operation before shutting
down, and pg_log_live is removed.

We make a copy of the current xid and oid in the front of pg_log_live,
so that if the postmaster starts up, and pg_log_live exists, the
postmaster adds 10,000 to xid and oid of pg_variable, so no previously
used but unsynced values are used.

We know that the current values of pg_variable could not have been
exceeded by 10,000, because each backend consults the pg_log copies of
these variable to make sure they do not exceed 10,000 from the value
before the last sync.  They exceed those values only by fscyn'ing every
10,000 increments.

Said another way, if a postgres backend exceeds the pg_log last xid or
oid of pg_log, or any 10,000 multiple, it must fsync the change to
pg_variable.  This way, a crash skips over any unsynced oid/xid's used,
and this is done without having to keep fsyncing pg_variable.  In most
cases, the 10,000 will never be exceeded by a backend before the
postmaster does a sync and increases the last xid/oid again.

I think this is a very clean way to give us no-fync performance with
full-rollback buffered logging.  The specification is clean and almost
complete enough for coding.

I think this gives us what we need, by having a mmap'ed() pg_log_live,
which backends can use, and a postmaster-controlled pg_log, which is
used on startup, with xid/oid controls in a crash situation to skip over
partially committed transactions.

Comments?

--
Bruce Momjian
maillist@candle.pha.pa.us


---------------------------------------------------------------------------

Sender: root@www.krasnet.ru
Message-ID: <346FF895.167EB0E7@sable.krasnoyarsk.su>
Date: Mon, 17 Nov 1997 14:56:05 +0700
From: "Vadim B. Mikheev" <vadim@sable.krasnoyarsk.su>
Organization: ITTS (Krasnoyarsk)
X-Mailer: Mozilla 3.01 (X11; I; FreeBSD 2.2.5-RELEASE i386)
MIME-Version: 1.0
To: Bruce Momjian <maillist@candle.pha.pa.us>
CC: PostgreSQL-development <hackers@postgreSQL.org>,
        "Vadim B. Mikheev" <vadim@post.krasnet.ru>
Subject: Re: Bufferd loggins/pg_log
References: <199711170542.AAA24561@candle.pha.pa.us>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Status: OR

Bruce Momjian wrote:
>
> On startup, the postmaster makes a copy of pg_log, called pg_log_live.
> Each postgres backend mmaps() this new file into its address space.  A
> lock is gotten to make changes to the file.  All backend use pg_log_live
> rather than pg_log.  Only the postmaster write to pg_log.  (I will
> someday remove the exec() from postmaster, so backends will get this
> address space automatically.)

What are advantages of mmaping entire pg_log over "online" pg_log
pages ?
pg_log may be very big (tens of Mb) - why we have to spend
process address space for tens of Mb of mostly unused data ?
Also, do all systems have mmap ?

>
> Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
> the minimum xid set in the start of pg_log, and resets its value.  It
> records the current oid and xid from pg_variable.  It then clears the
> lock, and starts reading from the minimum recorded xid changed to the
> end of pg_log_live, and copies it into allocated memory.  It then does a
> sync (twice?), waits for completion, and then writes the pg_log_live
        ^^^^^
man sync:

     The sync() function forces a write of dirty (modified) buffers in the
                         ^^^^^^
     block buffer cache out to disk...
...

BUGS
     Sync() may return before the buffers are completely flushed.

Vadim

---------------------------------------------------------------------------

From: Bruce Momjian <maillist@candle.pha.pa.us>
Message-Id: <199711171346.IAA01964@candle.pha.pa.us>
Subject: [HACKERS] Re: Bufferd loggins/pg_log
To: vadim@sable.krasnoyarsk.su (Vadim B. Mikheev)
Date: Mon, 17 Nov 1997 08:46:29 -0500 (EST)
Cc: hackers@postgreSQL.org (PostgreSQL-development)
In-Reply-To: <346FF895.167EB0E7@sable.krasnoyarsk.su> from "Vadim B. Mikheev" at Nov 17, 97 02:56:05 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hackers@hub.org
Precedence: bulk
Status: OR

>
> Bruce Momjian wrote:
> >
> > On startup, the postmaster makes a copy of pg_log, called pg_log_live.
> > Each postgres backend mmaps() this new file into its address space.  A
> > lock is gotten to make changes to the file.  All backend use pg_log_live
> > rather than pg_log.  Only the postmaster write to pg_log.  (I will
> > someday remove the exec() from postmaster, so backends will get this
> > address space automatically.)
>
> What are advantages of mmaping entire pg_log over "online" pg_log
> pages ?
> pg_log may be very big (tens of Mb) - why we have to spend
> process address space for tens of Mb of mostly unused data ?
> Also, do all systems have mmap ?

I believe you are correct that it would be better keeping the last few
pages of pg_log in shared memory rather than using mmap().

I think the important new ideas are keeping track of the oid/xid before
sync so we can accurately add 10,000 after a crash.

I am a little foggy on race condiions of growing the pg_log region while
other backends are running, and modifying non-shared memory pages, but
you seem to have a handle on it.

We don't need pg_log_live if only the postmaster writes those last two
pages to pg_log, and if we keep track of a crash status somewhere else,
perhaps at the start of pg_log.

>
> >
> > Every 60-90 seconds, the postmaster gets a write lock on pg_log, takes
> > the minimum xid set in the start of pg_log, and resets its value.  It
> > records the current oid and xid from pg_variable.  It then clears the
> > lock, and starts reading from the minimum recorded xid changed to the
> > end of pg_log_live, and copies it into allocated memory.  It then does a
> > sync (twice?), waits for completion, and then writes the pg_log_live
>         ^^^^^
> man sync:
>
>      The sync() function forces a write of dirty (modified) buffers in the
>                          ^^^^^^
>      block buffer cache out to disk...
> ...
>
> BUGS
>      Sync() may return before the buffers are completely flushed.
>
> Vadim
>

My BSD/OS doesn't mention this, but twice is a good idea.



--
Bruce Momjian
maillist@candle.pha.pa.us




--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] varchar() vs char16 performance
Следующее
От: Bruce Momjian
Дата:
Сообщение: initdb and xpg_user