Обсуждение: File IO - why does PG do things in pages?

Поиск
Список
Период
Сортировка

File IO - why does PG do things in pages?

От
Dan Eloff
Дата:
Hi Hackers,

I've familiarized myself a little with the architecture of postgresql,
largely because it's interesting. There's one thing I can't quite
figure out though, and it seems that there's no better group of people
in the world to ask about it.

At the lower levels in PG, reading from the disk into cache, and
writing from the cache to the disk is always done in pages.

Why does PG work this way? Is it any slower to write whole pages
rather than just the region of the page that changed? Conversely, is
it faster? From what I think I know of operating systems, reading
should bring the whole page into the os buffers anyway, so reading the
whole page instead of just part of it isn't much more expensive.
Perhaps writing works similarly?

Thanks,
-Dan


Re: File IO - why does PG do things in pages?

От
"Florian G. Pflug"
Дата:
Dan Eloff wrote:
> At the lower levels in PG, reading from the disk into cache, and 
> writing from the cache to the disk is always done in pages.
> 
> Why does PG work this way? Is it any slower to write whole pages 
> rather than just the region of the page that changed? Conversely, is 
> it faster? From what I think I know of operating systems, reading 
> should bring the whole page into the os buffers anyway, so reading
> the whole page instead of just part of it isn't much more expensive. 
> Perhaps writing works similarly?

First, data fetched from the disk is (except for data in temporary
tables, I believe) not stored in private memory of the backend process
doing the read(), but instead a a shared memory segment accessible by
all backend processes. This allows two different backend processes to
work modify the data concurrently without them stepping on each other's
toes. Note that immediatly writing back any changes is *not* an option,
since WAL logging mandates that all changes got to the WAL *first*.
Hence, if you were to write out each changed tuple immediately, you'd
have to first write the changes to the WAL *and* fsync the WAL to
guarantee they hit the disk first.

Sharing the data between backend processes requires a fair amount of
infrastructure. You need a way to locate a given chunk of on-disk data
in the shared memory buffer cache, and be able to acquire and release
locks on those buffers to prevent two backends from wrecking havoc when
they try to update the same piece of information. Organizing data in
fixed-sized chunks (which is what pages are) helps with keeping the
complexity of that infrastructure manageable, and the overhead
reasonably low.

There are also things like tracking the free space in a data file, which
also gets easier if you only have to track it page-wise (Is there free
space on this page or not), instead of having to track arbitrary ranges
of free space.

Finally, since data writes happen in units of blocks (and not bytes),
you need to guarantee that you do your IO in some multiple of that unit
anyway, otherwise you'd have a very hard time guaranteeing data
consistency after a crash. Google for "torn page writes", that should
give you more details about this problem.

Note, however, that a postgres page (usually 8K) is usually larger than
the filesystem's blocksize (usually 512b). So always reading in full
pages induces *some* IO overhead. Just not that much - especially since
the blocks comprising a page are extremely likely to be arranges
consecutively on disk, so there is no extra seeking involved.

This, at least, are what I believe to be the main reasons for doing
things in units of pages - hope this helps at least somewhat.

best regards,
Florian Pflug

Re: File IO - why does PG do things in pages?

От
Robert Haas
Дата:
On Thu, Nov 26, 2009 at 4:14 PM, Dan Eloff <dan.eloff@gmail.com> wrote:
> Hi Hackers,
>
> I've familiarized myself a little with the architecture of postgresql,
> largely because it's interesting. There's one thing I can't quite
> figure out though, and it seems that there's no better group of people
> in the world to ask about it.
>
> At the lower levels in PG, reading from the disk into cache, and
> writing from the cache to the disk is always done in pages.
>
> Why does PG work this way? Is it any slower to write whole pages
> rather than just the region of the page that changed? Conversely, is
> it faster? From what I think I know of operating systems, reading
> should bring the whole page into the os buffers anyway, so reading the
> whole page instead of just part of it isn't much more expensive.
> Perhaps writing works similarly?

Yep.  It's not just PG that organizes things into pages - disks and
disk caches and main memory and kernel buffers are similarly
organized.  So, for example, to write 5 bytes to the disk, some part
of the stack (disk drive or kernel or application) must read the whole
page (however large it is) into memory, modify the 5 bytes, and write
it back out.  Changing a larger fraction of the page contents is
essentially free.

Another reason to use pages is that a lot of operations have a
substantial "per request" overhead.  It would be extremely inefficient
to have the operating system request each desired byte from the disk
individually, or likewise the operating system from the kernel.  In
fact, I think that disks are typically addressed in fixed-size
sectors, and the atomic operation is actually to read or write a
sector rather than a byte range.  But even at the operating system
level, where the kernel interface will LET you read or write bytes one
at a time, it's dreadfully slow.

Organizing things into pages also simplifies bookkeeping.  For
example, suppose you want to keep track of which parts of your 256M
shared-memory buffer need to be written out to disk.  If you organize
your data into 8K pages, you can use a bitmap with 1 meaning dirty
(needs to be written) and 0 meaning clean, and the whole data
structure fits in 4K of memory.  If you need to track arbitrary ranges
of dirty bytes, you'll need a pair of 32-bit integers for each range
(starting offset and ending offset, or starting offset and length).
There's no reasonable fixed-size data structure which is guaranteed to
be large enough to hold all the ranges you might have (and 4K is not
even close to adequate), and inserting additional ranges, or clearing
out ranges that have been written, will be FAR more expensive than
under the bitmap implementation.  You'll likely to to set up some kind
of balanced tree structure to make the performance reasonable, which
will be more complex and therefore more likely to have bugs - and
it'll still be slower.

...Robert