Обсуждение: Linux max on shared buffers?
We are trying to throw a lot of memory at PostgreSQL to try to boost performance. In an attempt to put our entire database into memory, I want to allocate 2 to 3 GB out of 4 GB on a dual processor server running Red Hat Linux 7.3 and PostgreSQL 7.2.1. We only expect 4 or 5 concurrent backends. When I try to allocate 2 GB or more, I get the following error when I try to start PostgreSQL (after setting kernel.shmall and kernel.shmmax appropriately): IpcMemoryCreate: shmat(id=163840) failed: Cannot allocate memory I can safely allocate a little under 2 GB. Is this a Linux upper bound on how much memory can be allocated to a single program? Is there another kernel parameter besides kernel.shmall and kernel.shmmax that can be set to allow more memory to be allocated?
How does one increase the SHMMAX? Does it require recompiling the kernel? Terry Fielder Network Engineer Great Gulf Homes / Ashton Woods Homes terry@greatgulfhomes.com > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Martin Dillard > Sent: Wednesday, July 10, 2002 1:45 PM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Linux max on shared buffers? > > > We are trying to throw a lot of memory at PostgreSQL to try to boost > performance. In an attempt to put our entire database into > memory, I want to > allocate 2 to 3 GB out of 4 GB on a dual processor server > running Red Hat > Linux 7.3 and PostgreSQL 7.2.1. We only expect 4 or 5 > concurrent backends. > > When I try to allocate 2 GB or more, I get the following > error when I try to > start PostgreSQL (after setting kernel.shmall and kernel.shmmax > appropriately): > > IpcMemoryCreate: shmat(id=163840) failed: Cannot allocate memory > > I can safely allocate a little under 2 GB. Is this a Linux > upper bound on > how much memory can be allocated to a single program? Is there another > kernel parameter besides kernel.shmall and kernel.shmmax that > can be set to > allow more memory to be allocated? > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
At 02:17 PM 7/10/2002, terry@greatgulfhomes.com wrote: >How does one increase the SHMMAX? Does it require recompiling the kernel? To answer both questions: Increase it with /etc/sysctl.conf entries such as: (Linux 2.2 and up) kernel.shmall = 1073741824 kernel.shmmax = 1073741824 Or: echo 1073741824 > /proc/sys/kernel/shmall etc. Linux on i386 has a problem with much memory: a 4 gig limit on addressible memory per process (at least, without using bank switching). The Kernel usually reserves 1 or 2 gigs for itself, leaving only 2 or 3 gigs for the process. Hence, it's unlikely you'd ever be able to map a larger shared memory segment into a process anyway, so I figure the 2gb limit is pretty reasonable. On my 8gb system, I give only 256-512 megs to shared memory, and also set effective_cache_size = 625000 Telling Postgresql that there's probably 5 gigs of data in the OS level cache. I'm not sure if that does anything, but since there's more like 6.5 or 7 gigs in the OS cache, I figure it can't hurt. Cheers, Doug >Terry Fielder >Network Engineer >Great Gulf Homes / Ashton Woods Homes >terry@greatgulfhomes.com > > > -----Original Message----- > > From: pgsql-general-owner@postgresql.org > > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Martin Dillard > > Sent: Wednesday, July 10, 2002 1:45 PM > > To: pgsql-general@postgresql.org > > Subject: [GENERAL] Linux max on shared buffers? > > > > > > We are trying to throw a lot of memory at PostgreSQL to try to boost > > performance. In an attempt to put our entire database into > > memory, I want to > > allocate 2 to 3 GB out of 4 GB on a dual processor server > > running Red Hat > > Linux 7.3 and PostgreSQL 7.2.1. We only expect 4 or 5 > > concurrent backends. > > > > When I try to allocate 2 GB or more, I get the following > > error when I try to > > start PostgreSQL (after setting kernel.shmall and kernel.shmmax > > appropriately): > > > > IpcMemoryCreate: shmat(id=163840) failed: Cannot allocate memory > > > > I can safely allocate a little under 2 GB. Is this a Linux > > upper bound on > > how much memory can be allocated to a single program? Is there another > > kernel parameter besides kernel.shmall and kernel.shmmax that > > can be set to > > allow more memory to be allocated? > > > > > > ---------------------------(end of > > broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
On Wed, 10 Jul 2002, Martin Dillard wrote: > When I try to allocate 2 GB or more.... If I recall correctly, under normal circumstances a process under Linux has an address space of only 2 GB. Therefore you will never be able to allocate more memory than that. I think there's a patch (maybe from SGI?) that lets you increase this to 3 GB, but at any rate it's always going to be well under 4 GB, no matter what you do, unless you move to a 64-bit processor. But really, as discussed just in the last week on this list, you want to allocate more like 10 MB or so to postgres' shared memory area. Then the rest of your memory will be used as buffer cache and you will be happy. If you want to know why, go back though the archives of this list. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > On Wed, 10 Jul 2002, Martin Dillard wrote: > > > When I try to allocate 2 GB or more.... > > If I recall correctly, under normal circumstances a process under > Linux has an address space of only 2 GB. Therefore you will never > be able to allocate more memory than that. I think there's a patch > (maybe from SGI?) that lets you increase this to 3 GB, but at any > rate it's always going to be well under 4 GB, no matter what you > do, unless you move to a 64-bit processor. > > But really, as discussed just in the last week on this list, you want to > allocate more like 10 MB or so to postgres' shared memory area. Then the > rest of your memory will be used as buffer cache and you will be happy. > If you want to know why, go back though the archives of this list. Woh, 10MB is clearly too low. Remember, there is copying overhead of moving data from the kernel to the PostgreSQL shared buffers. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Wed, 10 Jul 2002, Bruce Momjian wrote: > Woh, 10MB is clearly too low. Remember, there is copying overhead of > moving data from the kernel to the PostgreSQL shared buffers. Yes, but the cost of copying between a postgres buffer and an OS buffer is much, much less than the cost of copying between an OS buffer and disk. However, it all depends on your working set, doesn't it? So I can't make a strong argument either way. What do you think is better? 20 MB? 100 MB? Do you allocate based on the number of connections, or a proportion of the machine's memory, or something else? I was estimating based on the number of connections. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > On Wed, 10 Jul 2002, Bruce Momjian wrote: > > > Woh, 10MB is clearly too low. Remember, there is copying overhead of > > moving data from the kernel to the PostgreSQL shared buffers. > > Yes, but the cost of copying between a postgres buffer and an OS buffer > is much, much less than the cost of copying between an OS buffer and disk. > > However, it all depends on your working set, doesn't it? So I can't make > a strong argument either way. What do you think is better? 20 MB? 100 > MB? Do you allocate based on the number of connections, or a proportion > of the machine's memory, or something else? I was estimating based on > the number of connections. > If it is a dedicated machine, I would think some percentage of total RAM would be good, perhaps 25%. If it isn't dedicated, the working set becomes a major issue, yes. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
What I have found, on a machine that is doing mostly DBMS servering if you have a small shared buffer you will see that the CPU is not fully utilized as it is waiting on I/O a lot of the time. I keep adding buffer space until the CPU utilization goes to ~100%. Adding more buffer after this does little to speed up querries. I use the "top" display under either Linux or Solaris. This is purly imperical and based on observation but it makes sense that if the point of the buffer is to avoid disk I/O that after disk I/O is no longer the bottle neck then adding more buffers does little good. On a typial 1 CPU PC after about 200MB of buffer the CPU becomes the bottle neck ===== Chris Albertson Home: 310-376-1029 chrisalbertson90278@yahoo.com Cell: 310-990-7550 Office: 310-336-5189 Christopher.J.Albertson@aero.org __________________________________________________ Do You Yahoo!? Sign up for SBC Yahoo! Dial - First Month Free http://sbc.yahoo.com
On Wed, 10 Jul 2002, Chris Albertson wrote: > What I have found, on a machine that is doing mostly DBMS servering > if you have a small shared buffer you will see that the CPU is > not fully utilized as it is waiting on I/O a lot of the time. What OS are you using? Unless it's really strange or really old, you are probably just removing an operating system buffer for each shared memory buffer you add. (But this is not good; two smaller buffers in series are not as good as one big buffer.) > I keep adding buffer space until the CPU utilization goes to > ~100%. Adding more buffer after this does little to speed up > querries. I use the "top" display under either Linux or Solaris. Um...I just don't quite feel up to explaining why I think this is quite wrong... cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
> > What I have found, on a machine that is doing mostly DBMS servering > > if you have a small shared buffer you will see that the CPU is > > not fully utilized as it is waiting on I/O a lot of the time. Just to repeat what curt said :-) : This doesn't make sence to me. Whatever memory you give to Postgres for caching *could* be used by the OS to cache. It doesn't seem that changing PG's buffer count should make any real difference in real disk IO; the disk pages will get cached one way or the other. > What OS are you using? Unless it's really strange or really old, you > are probably just removing an operating system buffer for each shared > memory buffer you add. (But this is not good; two smaller buffers in > series are not as good as one big buffer.) When are PG's buffers *not* in series with the OS's? *Any* page that postgres has read from disk has been in the OS cache. When reading a page that's not in either cache, you get: [Disk] => <read> => [OS cache] => <mem copy> => [PG cache] And when its in only the OS cache, you get: [OS cache] => <mem copy> => [PG cache] The bigger your PG cache, the less often you have to ask the OS for a page. That <mem copy> (plus syscall overhead) that happens between OS cache and PG cache is expensive compared to not having to do it. Not horribly so, I realize, but every little bit helps... Unless I'm just completely missing something here, the bigger the PG cache the better, within reason. Please don't come back with a remark about other processes needing memory and possible swap storms, etc.; I know to avoid that, and my approach would need to be *heavily* curbed on machines that do things besides run Postgres. But as long as there is no danger of starving the machine for memory, I don't see what the harm is in giving more memory to Postgres. <snip snip> --Glen Parker
--- Curt Sampson <cjs@cynic.net> wrote: > On Wed, 10 Jul 2002, Chris Albertson wrote: <SNIP> > > I keep adding buffer space until the CPU utilization goes to > > ~100%. Adding more buffer after this does little to speed up > > querries. I use the "top" display under either Linux or Solaris. > > Um...I just don't quite feel up to explaining why I think this is > quite wrong... It is easy to explain why I am wrong; You have run timming tests and have measurments that disagree with what I wrote. If you do have such data, good then we can find out the diferences in the test methods. I am not so interrrested in theroy but possibly the O/S (linux and Solaris in my case) buffer and PostgreSQL bufers are not interchanceable as they are managed using diferent algorithums and the OS level buffer is shared with other aplications ===== Chris Albertson Home: 310-376-1029 chrisalbertson90278@yahoo.com Cell: 310-990-7550 Office: 310-336-5189 Christopher.J.Albertson@aero.org __________________________________________________ Do You Yahoo!? Sign up for SBC Yahoo! Dial - First Month Free http://sbc.yahoo.com
On Thu, 11 Jul 2002, Chris Albertson wrote: > --- Curt Sampson <cjs@cynic.net> wrote: > > On Wed, 10 Jul 2002, Chris Albertson wrote: > <SNIP> > > > I keep adding buffer space until the CPU utilization goes to > > > ~100%. Adding more buffer after this does little to speed up > > > querries. I use the "top" display under either Linux or Solaris. > > > > Um...I just don't quite feel up to explaining why I think this is > > quite wrong... > > It is easy to explain why I am wrong; You have run timming tests > and have measurments that disagree with what I wrote. You're measuring the wrong thing. The point is not to make your CPU usage as high as possible; the point is to make your queries as fast as possible. > I am not so interrrested in theroy but possibly the O/S (linux and > Solaris in my case) buffer and PostgreSQL bufers are not > interchanceable > as they are managed using diferent algorithums and the OS level buffer > is shared with other aplications They are interchanged; if you use memory for buffering, your OS doesn't. The OS buffers are always used. Worse yet, it appears that many systems page shared memory anyway, so your buffers may be swapped out to disk, which sort of negates the point of them. Anyway, maybe one day I'll have the time and energy to write up a big long explanation of a data block's trip between disk and postgres buffer, so that people will understand. There seems to be a lot of confusion here. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 11 Jul 2002, Glen Parker wrote:
> When are PG's buffers *not* in series with the OS's?
Never. That's the problem.
> [Disk] => <read> => [OS cache] => <mem copy> => [PG cache]
>
> And when its in only the OS cache, you get:
> [OS cache] => <mem copy> => [PG cache]
Right.
> The bigger your PG cache, the less often you have to ask the OS for a
> page.  That <mem copy> (plus syscall overhead) that happens between OS
> cache and PG cache is expensive compared to not having to do it.  Not
> horribly so, I realize, but every little bit helps...
Yes, but it's much, much cheaper than reading a page from disk.
> Unless I'm just completely missing something here, the bigger the PG
> cache the better, within reason.
Let's walk through an example. I have four pages total for caching.
Let's look at a read scenario based on two for postgres and two for the
OS, and one for postgres and three for the OS. Pn is a postgres buffer
and OSn is an OS buffer; the numbers below those show which disk blocks
are in which caches. We'll use an LRU algorithm for both caches and read
the blocks in this order: 1 2 3 2 1 2 3.
    OS1 OS2 P1  P2    Action
    -   -   -   -    Start with all caches empty.
    1   -   1   -       Read block 1 (replaces -/-). disk + memory copy.
    1   2   1   2    Read block 2 (replaces -/-). disk + memory copy.
    3   2   3   2    Read block 3 (replaces 1/1). disk + memory copy.
    3   2   3   2    Read block 2 (in cache). no copies.
    3   1   3   1    Read block 1 (replaces 2/2). disk + memory copy.
    2   1   2   1    Read block 2 (replaces 3/3). disk + memory copy.
    2   3   2   3    Read block 3 (replaces 1/1). disk + memory copy.
Now with postgres getting one buffer and the OS getting three:
    OS1 OS2 OS3 P1    Action
    -   -   -   -    Start with all caches empty.
    1   -   -   1       Read block 1 (replaces -/-). disk + memory copy.
    1   2   -   2    Read block 2 (replaces -/1). disk + memory copy.
    1   2   3   3    Read block 3 (replaces -/2). disk + memory copy.
    1   2   3   2    Read block 2 (in cache, replaces 3). memory copy.
    1   2   3   1    Read block 1 (in cache, replaces 2). memory copy.
    1   2   3   2    Read block 2 (in cache, replaces 1). memory copy.
    1   2   3   3    Read block 3 (in cache, replaces 2). memory copy.
So with 2 and 2 buffers for the OS and postgres, we end up doing
six disk reads and six memory copies. With 3 and 1, we do three
disk reads and seven memory copies. This is why you do not want to
take buffers from the OS in order to give them to postgres.
So, you ask, why not devote almost all your memory to postgres cache,
and minimize the OS's caching? Answer: if you can do that without
causing any of your programs to swap, that's fine. But I doubt you can,
unless you have, say, perfect control over how many sorts will ever be
ocurring at once, and so on.
cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC
			
		On Thu, Jul 11, 2002 at 05:07:29PM +0900, Curt Sampson wrote: > Let's walk through an example. I have four pages total for caching. > Let's look at a read scenario based on two for postgres and two for the > OS, and one for postgres and three for the OS. Pn is a postgres buffer > and OSn is an OS buffer; the numbers below those show which disk blocks > are in which caches. We'll use an LRU algorithm for both caches and read > the blocks in this order: 1 2 3 2 1 2 3. Hmm, what about OS's that swap shared memory to disk. Wouldn't that change things somewhat? Probably more in favour of giving more memory to the OS. The other possibility would be to use mmap instead. That way you avoid the double buffering altogether. Do you have any ideas about that? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
On Thu, 11 Jul 2002, Martijn van Oosterhout wrote:
> Hmm, what about OS's that swap shared memory to disk. Wouldn't that change
> things somewhat?
It just makes things worse. Paging a buffer to disk does the I/O you
were trying not to do.
> Probably more in favour of giving more memory to the OS.
Yup.
> The other possibility would be to use mmap instead. That way you avoid the
> double buffering altogether. Do you have any ideas about that?
On this page
    http://archives.postgresql.org/pgsql-hackers/2002-06/threads.php
there's a thread about "Buffer management", and I have some posts to that
about using mmap. Unfortunately, I can't point out the exact posts, because
the php that's generating these pages is probably missing an end table
tag, and so the pages render as completely blank on Linux Netscape 4.78.
cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC
			
		On Thu, Jul 11, 2002 at 06:04:20PM +0900, Curt Sampson wrote: > On Thu, 11 Jul 2002, Martijn van Oosterhout wrote: > > The other possibility would be to use mmap instead. That way you avoid the > > double buffering altogether. Do you have any ideas about that? > > On this page > > http://archives.postgresql.org/pgsql-hackers/2002-06/threads.php > > there's a thread about "Buffer management", and I have some posts to that > about using mmap. Unfortunately, I can't point out the exact posts, because > the php that's generating these pages is probably missing an end table > tag, and so the pages render as completely blank on Linux Netscape 4.78. Yeah, I thought of the writeback issue also. I was thinking that you might have to keep the current shared memory scheme for written pages but use mmap for reading them in. Since the number of written pages is much smaller than the number of read pages, you can avoid a lot of double buffering. But that does increase complexity and possibly cause portability problems, since that could make assumptions about how buffers, shared memory and mmaps are shared. Not totally straight forward. Needs more thought obviously. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
By my understanding, there is a limit to that. Since there can be no more than 4GB memory addressed on a ix86 system PER process (you can have 16 of these partitions of 4GB with a Xeon machine) and the kernel allocates 1GB of memory for its own purposes (can be adjusted up or down during the kernel compilation) it leaves you with 3GB of memory maximum per process. Ofcoarse it is possible that the sharedmemory handler has a max of 2GB it is a "pretty number" ;) What you should try first is whether you gain any performance from 1GB to 1.5GB and from 1.5GB to 2GB... If that doesn't improve any, why bother about the 3GB. Martin Dillard wrote: > We are trying to throw a lot of memory at PostgreSQL to try to boost > performance. In an attempt to put our entire database into memory, I want to > allocate 2 to 3 GB out of 4 GB on a dual processor server running Red Hat > Linux 7.3 and PostgreSQL 7.2.1. We only expect 4 or 5 concurrent backends. > > When I try to allocate 2 GB or more, I get the following error when I try to > start PostgreSQL (after setting kernel.shmall and kernel.shmmax > appropriately): > > IpcMemoryCreate: shmat(id=163840) failed: Cannot allocate memory > > I can safely allocate a little under 2 GB. Is this a Linux upper bound on > how much memory can be allocated to a single program? Is there another > kernel parameter besides kernel.shmall and kernel.shmmax that can be set to > allow more memory to be allocated?
Martijn van Oosterhout wrote:
> The other possibility would be to use mmap instead. That way you avoid the
> double buffering altogether. Do you have any ideas about that?
>
I ran across a link last night on the R list to the TPIE (Transparent
Parallel I/O Environment) project at Duke University. It looks
interesting and somewhat related to this thread:
    http://www.cs.duke.edu/TPIE/
It is C++ and GPL'd, but there are links to some related papers.
Joe
			
		Doug Fields schrieb:
> At 02:17 PM 7/10/2002, terry@greatgulfhomes.com wrote:
> >How does one increase the SHMMAX?  Does it require recompiling the kernel?
>
> To answer both questions:
>
> Increase it with /etc/sysctl.conf entries such as: (Linux 2.2 and up)
>
> kernel.shmall = 1073741824
> kernel.shmmax = 1073741824
Hello,
here on Debian, those settings are done *after* PostgreSQL is started.
  $ ls /etc/rc2.d/*{proc,post}* | sort
  /etc/rc2.d/S20postgresql
  /etc/rc2.d/S20procps.sh
no problem? or should I start reportbug?
Thanks,
Knut Sübert
			
		On Thu, 11 Jul 2002, Martijn van Oosterhout wrote: > Yeah, I thought of the writeback issue also. I was thinking that you > might have to keep the current shared memory scheme for written pages > but use mmap for reading them in. Since the number of written pages > is much smaller than the number of read pages, you can avoid a lot of > double buffering. Actually, come to think of it, since we write the entire new page to the WAL, we'd have to copy the page anyway. So this basically solves the problem; you mark the page as having a "to-write" copy, copy it, modify it, write the log file, and copy the new page back to the mmap buffer. Not too difficult at all, really. If we ever did stop writing the entire page to WAL, it's not too difficult anyway, becuase we just have to do the actually write to the page *after* the log entry is on stable storage, as I pointed out, which means keeping a record of the change to be written. But we have that record of course, because it's exactly what we just wrote to the log. And we also, of course, have logic to take that record and apply it to the table... > ...and possibly cause portability > problems, since that could make assumptions about how buffers, shared > memory and mmaps are shared. No, portability problems are very unlikely. We already use shared memory and it works. Mmap has very well defined behaviour and, in my experience, works very consistently across all modern systems. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Thu, 11 Jul 2002 18:43:08 +1000 Martijn van Oosterhout <kleptog@svana.org> wrote: > On Thu, Jul 11, 2002 at 05:07:29PM +0900, Curt Sampson wrote: > > Let's walk through an example. I have four pages total for caching. > > Let's look at a read scenario based on two for postgres and two for the > > OS, and one for postgres and three for the OS. Pn is a postgres buffer > > and OSn is an OS buffer; the numbers below those show which disk blocks > > are in which caches. We'll use an LRU algorithm for both caches and read > > the blocks in this order: 1 2 3 2 1 2 3. > > Hmm, what about OS's that swap shared memory to disk. Wouldn't that change > things somewhat? Probably more in favour of giving more memory to the OS. I don't know about Linux, but under FreeBSD you can tell the OS to lock shared mem in core. This DOES help out. > The other possibility would be to use mmap instead. That way you avoid the > double buffering altogether. Do you have any ideas about that? Not all platforms have mmap. This has been discussed before I belive. > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > > There are 10 kinds of people in the world, those that can do binary > > arithmetic and those that can't. -- GB Clark II | Roaming FreeBSD Admin gclarkii@VSServices.COM | General Geek CTHULU for President - Why choose the lesser of two evils?
On Wed, 17 Jul 2002, GB Clark wrote: > Not all platforms have mmap. This has been discussed before I belive. I've now heard several times here that not all platforms have mmap and/or mmap is not compatable across all platforms. I've yet to see any solid evidence of this, however, and I'm inclined to believe that mmap compatability is no worse than compatability with the system V shared memory we're using already, since both are fairly specifically defined by POSIX. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > > On Wed, 17 Jul 2002, GB Clark wrote: > > > Not all platforms have mmap. This has been discussed before I belive. > > I've now heard several times here that not all platforms have mmap > and/or mmap is not compatable across all platforms. I've yet to > see any solid evidence of this, however, and I'm inclined to believe > that mmap compatability is no worse than compatability with the > system V shared memory we're using already, since both are fairly > specifically defined by POSIX. Curt, I still don't completely understand what you are proposing. What I understood so far is that you want to avoid double buffering (OS buffer plus SHMEM). Wouldn't that require that the access to a block in the file (table, index, sequence, ...) has to go directly through a mmapped region of that file? Let's create a little test case to discuss. I have two tables, 2 Gigabyte in size each (making 4 segments of 1 GB total) plus a 512 MB index for each. Now I join them in a query, that results in a nestloop doing index scans. On a 32 bit system you cannot mmap both tables plus the indexes at the same time completely. But the access of the execution plan is reading one tables index, fetching the heap tuples from it by random access, and inside of that loop doing the same for the second table. So chances are, that this plan randomly peeks around in the entire 5 Gigabyte, at least you cannot predict which blocks it will need. So far so good. Now what do you map when? Can you map multiple noncontigous 8K blocks out of each file? If so, how do you coordinate that all backends in summary use at maximum the number of blocks you want PostgreSQL to use (each unique block counts, regardless of how many backends have it mmap()'d, right?). And if a backend needs a block and the max is reached already, how does it tell the other backends to unmap something? I assume I am missing something very important here, or I am far off with my theory and the solution looks totally different. So could you please tell me how this is going to work? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes:
> So far so good. Now what do you map when? Can you map multiple
> noncontigous 8K blocks out of each file? If so, how do you coordinate
> that all backends in summary use at maximum the number of blocks you
> want PostgreSQL to use (each unique block counts, regardless of how many
> backends have it mmap()'d, right?). And if a backend needs a block and
> the max is reached already, how does it tell the other backends to unmap
> something?
Just to throw some additional wrenches into the gears: some platforms
(eg HPPA) have strong restrictions on where you can mmap stuff.
I quote some interesting material from the HPUX mmap(2) man page below.
Possibly these restrictions could be worked around, but it looks
painful.
Note that platforms where these restrictions don't exist are not
automatically better: that just means that they're willing to swap a
larger number of address translation table entries for each process
dispatch.  If we tried to mmap disk file blocks individually, our
process dispatch time would go to hell; but trying to map large ranges
instead (to hold down the number of translation entries) is going to
have a bunch of problems too.
            regards, tom lane
     If the size of the mapped file changes after the call to mmap(), the
     effect of references to portions of the mapped region that correspond
     to added or removed portions of the file is unspecified.
    [ hence, any extension of a file requires re-mmaping; moreover
      it appears that you cannot extend a file *at all* via mmap,
      but must do so via write, after which you can re-mmap -- tgl ]
     ...
     Because the PA-RISC memory architecture utilizes a globally shared
     virtual address space between processes, and discourages multiple
     virtual address translations to the same physical address, all
     concurrently existing MAP_SHARED mappings of a file range must share
     the same virtual address offsets and hardware translations. PA-RISC-
     based HP-UX systems allocate virtual address ranges for shared memory
     and shared mapped files in the range 0x80000000 through 0xefffffff.
     This address range is used globally for all memory objects shared
     between processes.
     This implies the following:
          o    Any single range of a file cannot be mapped multiply into
               different virtual address ranges.
          o    After the initial MAP_SHARED mmap() of a file range, all
               subsequent MAP_SHARED calls to mmap() to map the same range
               of a file must either specify MAP_VARIABLE in flags and
               inherit the virtual address range the system has chosen for
               this range, or specify MAP_FIXED with an addr that
               corresponds exactly to the address chosen by the system for
               the initial mapping.  Only after all mappings for a file
               range have been destroyed can that range be mapped to a
               different virtual address.
          o    In most cases, two separate calls to mmap() cannot map
               overlapping ranges in a file.
        [ that statement holds GLOBALLY, not per-process -- tgl ]
               The virtual address range
               reserved for a file range is determined at the time of the
               initial mapping of the file range into a process address
               space.  The system allocates only the virtual address range
               necessary to represent the initial mapping.  As long as the
               initial mapping exists, subsequent attempts to map a
               different file range that includes any portion of the
               initial range may fail with an ENOMEM error if an extended
               contiguous address range that preserves the mappings of the
               initial range cannot be allocated.
          o    Separate calls to mmap() to map contiguous ranges of a file
               do not necessarily return contiguous virtual address ranges.
               The system may allocate virtual addresses for each call to
               mmap() on a first available basis.
			
		On Fri, Jul 19, 2002 at 10:41:28AM -0400, Jan Wieck wrote: > Curt Sampson wrote: > > > > On Wed, 17 Jul 2002, GB Clark wrote: > > > > > Not all platforms have mmap. This has been discussed before I belive. > > > > I've now heard several times here that not all platforms have mmap > > and/or mmap is not compatable across all platforms. I've yet to > > see any solid evidence of this, however, and I'm inclined to believe > > that mmap compatability is no worse than compatability with the > > system V shared memory we're using already, since both are fairly > > specifically defined by POSIX. > > Curt, > > I still don't completely understand what you are proposing. What I > understood so far is that you want to avoid double buffering (OS buffer > plus SHMEM). Wouldn't that require that the access to a block in the > file (table, index, sequence, ...) has to go directly through a mmapped > region of that file? Well, you would have to deal with the fact that writing changes to a mmap() is allowed, but you have no guarentee when it will be finally written. Given WAL I would suggest using mmap() for reading only and using write() to update the file. > Let's create a little test case to discuss. I have two tables, 2 > Gigabyte in size each (making 4 segments of 1 GB total) plus a 512 MB > index for each. Now I join them in a query, that results in a nestloop > doing index scans. > > On a 32 bit system you cannot mmap both tables plus the indexes at the > same time completely. But the access of the execution plan is reading > one tables index, fetching the heap tuples from it by random access, and > inside of that loop doing the same for the second table. So chances are, > that this plan randomly peeks around in the entire 5 Gigabyte, at least > you cannot predict which blocks it will need. Correct. > So far so good. Now what do you map when? Can you map multiple > noncontigous 8K blocks out of each file? If so, how do you coordinate > that all backends in summary use at maximum the number of blocks you > want PostgreSQL to use (each unique block counts, regardless of how many > backends have it mmap()'d, right?). And if a backend needs a block and > the max is reached already, how does it tell the other backends to unmap > something? You can mmap() any portions anywhere (except for PA-RISC as Tom pointed out). I was thinking in 8MB lots to avoid doing too many system calls (also on i386, this chunk size could save the kernel making many page tables). You don't need any coordination between backends over memory usage. The mmap() is merely a window into the kernels disk cache. You are not currently limiting the disk cache of the kernel, nor would it be senseble to do so. If you need a block, you simply dereference the appropriate pointer (after checking you have mmap()ed it in). If the data is in memory, the dereference succeeds. If it's no, you get a page fault, the data is fetched and the dereference succeeds on the second try. If in that process the kernel needed to throw out another page, who cares? If another backend needs that page it'll get read back in. One case where this would be useful would be i386 machine with 64GB of memory. Then you are in effect simply mapping different parts of the cache at different times. No blocks are copied *ever*. > I assume I am missing something very important here, or I am far off > with my theory and the solution looks totally different. So could you > please tell me how this is going to work? It is different. I beleive you would still need some form of shared memory to co-ordinate write()s. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
On Fri, 19 Jul 2002, Jan Wieck wrote: > I still don't completely understand what you are proposing. What I > understood so far is that you want to avoid double buffering (OS buffer > plus SHMEM). Wouldn't that require that the access to a block in the > file (table, index, sequence, ...) has to go directly through a mmapped > region of that file? Right. > Let's create a little test case to discuss. I have two tables, 2 > Gigabyte in size each (making 4 segments of 1 GB total) plus a 512 MB > index for each. Now I join them in a query, that results in a nestloop > doing index scans. > > On a 32 bit system you cannot mmap both tables plus the indexes at the > same time completely. But the access of the execution plan is reading > one tables index, fetching the heap tuples from it by random access, and > inside of that loop doing the same for the second table. So chances are, > that this plan randomly peeks around in the entire 5 Gigabyte, at least > you cannot predict which blocks it will need. Well, you can certainly predict the index blocks. So after some initial reads to get to the bottom level of the index, you might map a few megabytes of it contiguously because you know you'll need it. While you're at it, you can inform the OS you're using it sequentially (so it can do read-ahead--even though it otherwise looks to the OS like the process is doing random reads) by doing an madvise() with MADV_SEQUENTIAL. > So far so good. Now what do you map when? Can you map multiple > noncontigous 8K blocks out of each file? Sure. You can just map one 8K block at a time, and when you've got lots of mappings, start dropping the ones you've not used for a while, LRU-style. How many mappings you want to keep "cached" for your process would depend on the overhead of doing the map versus the overhead of having a lot of system calls. Personally, I think that the overhead of having tons of mappings is pretty low, but I'll have to read through some kernel code to make sure. At any rate, it's no problem to change the figure depending on any factor you like. > If so, how do you coordinate that all backends in summary use at > maximum the number of blocks you want PostgreSQL to use.... You don't. Just map as much as you like; the operating system takes care of what blocks will remain in memory or be written out to disk (or dropped if they're clean), bringing in a block from disk when you reference one that's not currently in physical memory, and so on. > And if a backend needs a block and the max is reached already, how > does it tell the other backends to unmap something? You don't. The mappings are completely separate for every process. > I assume I am missing something very important here.... Yeah, you're missing that the OS does all of the work for you. :-) Of course, this only works on systems with a POSIX mmap, which those particular HP systems Tom mentioned obviously don't have. For those systems, though, I expect running as a 64-bit program fixes the problem (because you've got a couple billion times as much address space). But if postgres runs on 32-bit systems with the same restrictions, we'd probably just have to keep the option of using read/write instead, and take the performance hit that we do now. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Sat, 20 Jul 2002, Martijn van Oosterhout wrote: > Well, you would have to deal with the fact that writing changes to a mmap() > is allowed, but you have no guarentee when it will be finally written. Given > WAL I would suggest using mmap() for reading only and using write() to > update the file. You can always do an msync to force a block out. But I don't think you'd ever bother; the transaction log is the only thing for which you need to force writes, and that's probably better done with regular file I/O (read/write) anyway. The real problem is that you can't make sure a block is *not* written until you want it to be, which is why you need to write the log entry before you can update the block. > If in that process the kernel needed to throw out another page, who > cares? If another backend needs that page it'll get read back in. Right. And all of the standard kernel strategies for deciding which blocks to throw out will be in place, so commonly hit pages will be thrown out after more rarely hit ones. You also have the advantage that if you're doing, say, a sequential scan, you can madvise the pages MADV_WILLNEED when you first map them, and madvise each one MADV_DONTNEED after you're done with it, and avoid blowing out your entire buffer cache and replacing it with data you know you're not likely to read again any time soon. > One case where this would be useful would be i386 machine with 64GB of > memory. Then you are in effect simply mapping different parts of the cache > at different times. No blocks are copied *ever*. Right. > It is different. I beleive you would still need some form of shared memory > to co-ordinate write()s. Sure. For that, you can just mmap an anonymous memory area and share it amongst all your processes, or use sysv shared memory. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Fri, 19 Jul 2002, Tom Lane wrote: > Just to throw some additional wrenches into the gears: some platforms > (eg HPPA) have strong restrictions on where you can mmap stuff. > I quote some interesting material from the HPUX mmap(2) man page below. > Possibly these restrictions could be worked around, but it looks > painful. Very painful indeed. Probably it would be much easier to build a little mmap-type interface on top of the current system and use that instead of mmap on such a system. I wonder how many other systems are this screwed up? cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Martijn van Oosterhout <kleptog@svana.org> writes:
> Well, you would have to deal with the fact that writing changes to a mmap()
> is allowed, but you have no guarentee when it will be finally written. Given
> WAL I would suggest using mmap() for reading only and using write() to
> update the file.
This is surely NOT workable; every mmap man page I've looked at is very
clear that you cannot expect predictable behavior if you use both
filesystem and mmap access to the same file.  For instance, HP says
     It is also unspecified whether write references to a memory region
     mapped with MAP_SHARED are visible to processes reading the file and
     whether writes to a file are visible to processes that have mapped the
     modified portion of that file, except for the effect of msync().
So unless you want to msync after every write I do not think this can fly.
> If in that process the kernel needed
> to throw out another page, who cares?
We do, because we have to control write ordering.
> It is different. I beleive you would still need some form of shared memory
> to co-ordinate write()s.
The whole idea becomes less workable the more we look at it.
            regards, tom lane
			
		Curt Sampson <cjs@cynic.net> writes:
> You can always do an msync to force a block out. But I don't think
> you'd ever bother; the transaction log is the only thing for which
> you need to force writes,
This is completely wrong.  One of the fatal problems with mmap is that
there's no equivalent of sync(), but only fsync --- a process can only
msync those mmap regions that it has itself got mmap'd.  That means that
CHECKPOINT doesn't work, because it has no way to force out all
currently dirty data blocks before placing a "checkpoint done" record
in WAL.  You could perhaps get around that with centralized control of
mmap'ing, but at that point you've just got a klugy, not-very-portable
reimplementation of the existing shared buffer manager.
> Sure. For that, you can just mmap an anonymous memory area and share it
> amongst all your processes, or use sysv shared memory.
AFAICS the only thing we'll ever use mmap for is as a direct substitute
for SYSV shared memory on platforms that haven't got it.
            regards, tom lane
			
		On Sat, Jul 20, 2002 at 09:09:59AM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Well, you would have to deal with the fact that writing changes to a mmap() > > is allowed, but you have no guarentee when it will be finally written. Given > > WAL I would suggest using mmap() for reading only and using write() to > > update the file. > > This is surely NOT workable; every mmap man page I've looked at is very > clear that you cannot expect predictable behavior if you use both > filesystem and mmap access to the same file. For instance, HP says > > It is also unspecified whether write references to a memory region > mapped with MAP_SHARED are visible to processes reading the file and > whether writes to a file are visible to processes that have mapped the > modified portion of that file, except for the effect of msync(). > > So unless you want to msync after every write I do not think this can fly. Well ofcourse. The entire speed improvment is based on the fact that mmap() is giving you a window into the system disk cache. If the OS isn't built that way then it's not going to work. It does work on Linux and is fairly easy to test for. I've even attached a simple program to try it out. Ofcourse it's not complete. You'd need to try multiple processes to see what happens, but I'd be interested how diverse the mmap() implementations are. > > If in that process the kernel needed > > to throw out another page, who cares? > > We do, because we have to control write ordering. Which is why you use write() to control that > > It is different. I beleive you would still need some form of shared memory > > to co-ordinate write()s. > > The whole idea becomes less workable the more we look at it. I guess this is one of those cases where working code would be need to convince anybody. In the hypothetical case someone had time, the approprite place to add this would be src/backend/storage/buffer, since all buffer loads go through there, right? The only other question is whether there is anyway to know when a buffer will be modified. I get the impression sometimes bits are twiddled without the buffer being marked dirty. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Whoopsie. Here's the program :) On Sun, Jul 21, 2002 at 12:19:43AM +1000, Martijn van Oosterhout wrote: > On Sat, Jul 20, 2002 at 09:09:59AM -0400, Tom Lane wrote: > > Martijn van Oosterhout <kleptog@svana.org> writes: > > > Well, you would have to deal with the fact that writing changes to a mmap() > > > is allowed, but you have no guarentee when it will be finally written. Given > > > WAL I would suggest using mmap() for reading only and using write() to > > > update the file. > > > > This is surely NOT workable; every mmap man page I've looked at is very > > clear that you cannot expect predictable behavior if you use both > > filesystem and mmap access to the same file. For instance, HP says > > > > It is also unspecified whether write references to a memory region > > mapped with MAP_SHARED are visible to processes reading the file and > > whether writes to a file are visible to processes that have mapped the > > modified portion of that file, except for the effect of msync(). > > > > So unless you want to msync after every write I do not think this can fly. > > Well ofcourse. The entire speed improvment is based on the fact that mmap() > is giving you a window into the system disk cache. If the OS isn't built > that way then it's not going to work. It does work on Linux and is fairly > easy to test for. I've even attached a simple program to try it out. > > Ofcourse it's not complete. You'd need to try multiple processes to see what > happens, but I'd be interested how diverse the mmap() implementations are. > > > > If in that process the kernel needed > > > to throw out another page, who cares? > > > > We do, because we have to control write ordering. > > Which is why you use write() to control that > > > > It is different. I beleive you would still need some form of shared memory > > > to co-ordinate write()s. > > > > The whole idea becomes less workable the more we look at it. > > I guess this is one of those cases where working code would be need to > convince anybody. In the hypothetical case someone had time, the approprite > place to add this would be src/backend/storage/buffer, since all buffer > loads go through there, right? > > The only other question is whether there is anyway to know when a buffer > will be modified. I get the impression sometimes bits are twiddled without > the buffer being marked dirty. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Вложения
Martijn van Oosterhout <kleptog@svana.org> writes:
>> For instance, HP says
>>
>> It is also unspecified whether write references to a memory region
>> mapped with MAP_SHARED are visible to processes reading the file and
>> whether writes to a file are visible to processes that have mapped the
>> modified portion of that file, except for the effect of msync().
>>
>> So unless you want to msync after every write I do not think this can fly.
> Well ofcourse. The entire speed improvment is based on the fact that mmap()
> is giving you a window into the system disk cache. If the OS isn't built
> that way then it's not going to work. It does work on Linux and is fairly
> easy to test for.
You mean "today, using the kernel version I have and allowing for the
current phase of the moon, it appears to work".  I read the following in
the mmap(2) man page supplied with RH Linux 7.2:
       MAP_SHARED Share this mapping  with  all  other  processes
                  that map this object.  Storing to the region is
                  equivalent to writing to the  file.   The  file
                  may  not  actually be updated until msync(2) or
                  munmap(2) are called.
That last seems to make it fairly clear that the mmap region should not
be taken as exactly equivalent to the file, despite the preceding
sentence.  I also note a complete lack of commitment as to whether
writes to the file reflect instantly into the mmap'd region...
            regards, tom lane
			
		Here's a rediculous hack of Martijn's program that runs on windows (win2K in my case), using the sorta-mmap-like calls in windows. Several runs on my box produced errors at offsets 0x048D and 0x159E. Glen Parker. > > Whoopsie. Here's the program :) > > On Sun, Jul 21, 2002 at 12:19:43AM +1000, Martijn van > Oosterhout wrote: > > On Sat, Jul 20, 2002 at 09:09:59AM -0400, Tom Lane wrote: > > > Martijn van Oosterhout <kleptog@svana.org> writes: > > > > Well, you would have to deal with the fact that writing > changes to a mmap() > > > > is allowed, but you have no guarentee when it will be > finally written. Given > > > > WAL I would suggest using mmap() for reading only and > using write() to > > > > update the file. > > > > > > This is surely NOT workable; every mmap man page I've > looked at is very > > > clear that you cannot expect predictable behavior if you use both > > > filesystem and mmap access to the same file. For > instance, HP says > > > > > > It is also unspecified whether write references to a > memory region > > > mapped with MAP_SHARED are visible to processes > reading the file and > > > whether writes to a file are visible to processes > that have mapped the > > > modified portion of that file, except for the effect > of msync(). > > > > > > So unless you want to msync after every write I do not > think this can fly. > > > > Well ofcourse. The entire speed improvment is based on the > fact that mmap() > > is giving you a window into the system disk cache. If the > OS isn't built > > that way then it's not going to work. It does work on Linux > and is fairly > > easy to test for. I've even attached a simple program to try it out. > > > > Ofcourse it's not complete. You'd need to try multiple > processes to see what > > happens, but I'd be interested how diverse the mmap() > implementations are. > > > > > > If in that process the kernel needed > > > > to throw out another page, who cares? > > > > > > We do, because we have to control write ordering. > > > > Which is why you use write() to control that > > > > > > It is different. I beleive you would still need some > form of shared memory > > > > to co-ordinate write()s. > > > > > > The whole idea becomes less workable the more we look at it. > > > > I guess this is one of those cases where working code would > be need to > > convince anybody. In the hypothetical case someone had > time, the approprite > > place to add this would be src/backend/storage/buffer, > since all buffer > > loads go through there, right? > > > > The only other question is whether there is anyway to know > when a buffer > > will be modified. I get the impression sometimes bits are > twiddled without > > the buffer being marked dirty. > > -- > Martijn van Oosterhout <kleptog@svana.org> > http://svana.org/kleptog/ > > There are 10 kinds of people in > the world, those that can do binary > > arithmetic and those that can't. >
Вложения
Curt Sampson wrote: > > On Fri, 19 Jul 2002, Tom Lane wrote: > > > Just to throw some additional wrenches into the gears: some platforms > > (eg HPPA) have strong restrictions on where you can mmap stuff. > > I quote some interesting material from the HPUX mmap(2) man page below. > > Possibly these restrictions could be worked around, but it looks > > painful. > > Very painful indeed. Probably it would be much easier to build a little > mmap-type interface on top of the current system and use that instead of > mmap on such a system. I wonder how many other systems are this screwed up? I have some more wrinkles to iron out as well. We can hold blocks of hundreds of different files in our buffer cache without the need to keep an open file descriptor (there is a reason for our VFD system). Access to those blocks requires a spinlock and hash lookup in the buffer cache. In a complicated schema where you cannot keep all files open anymore, access to your kernel buffered blocks requires open(), mmap(), munmap() and close() then? Four system calls to get access to a cached block where we get away with a TAS today? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Mon, 22 Jul 2002, Jan Wieck wrote: > I have some more wrinkles to iron out as well. We can hold blocks of > hundreds of different files in our buffer cache without the need to keep > an open file descriptor... As you can with mmap as well. The mapping remains after the file descriptor is closed. > In a complicated schema where you cannot keep all files open anymore... You can keep just as many files open when you use mmap as you can with the current scheme. The limit is the number of file descriptors you have available. > ...access to your kernel buffered blocks requires open(), mmap(), munmap() > and close() then? Four system calls to get access to a cached block > where we get away with a TAS today? If you leave the block mapped, you do not need to do any of that. Your limit on the number of mapped blocks is basically just the limit on address space. So it would not be unreasonable to have your "cache" of mappings be, say, a gigabyte in size. On systems which normally allocate less shared memory than that, this would mean you would actually make fewer syscalls than you would with shared memory. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson wrote: > > On Mon, 22 Jul 2002, Jan Wieck wrote: > > > I have some more wrinkles to iron out as well. We can hold blocks of > > hundreds of different files in our buffer cache without the need to keep > > an open file descriptor... > > As you can with mmap as well. The mapping remains after the file > descriptor is closed. No resource limit on that? So my 200 backends all map some random 16,000 blocks from 400 files and the kernel jiggles with it like it's never did anything else? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Tue, Jul 23, 2002 at 07:52:56AM -0400, Jan Wieck wrote: > Curt Sampson wrote: > > > > On Mon, 22 Jul 2002, Jan Wieck wrote: > > > > > I have some more wrinkles to iron out as well. We can hold blocks of > > > hundreds of different files in our buffer cache without the need to keep > > > an open file descriptor... > > > > As you can with mmap as well. The mapping remains after the file > > descriptor is closed. > > No resource limit on that? So my 200 backends all map some random > 16,000 blocks from 400 files and the kernel jiggles with it like > it's never did anything else? My machine here is idling not doing much and the kernel is managing 53 processes with 223 open files currently caching some 52,000 blocks and 1140 mmap()ed areas. I'm sure if I actually did some work I could make that much higher. In short, if you have a machine capable of running 200 backends, I wouldn't be surprised if the kernel wasn't already managing that kind of load. Remember, brk() is really just a special kind of mmap(). -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Martijn van Oosterhout wrote: > > On Tue, Jul 23, 2002 at 07:52:56AM -0400, Jan Wieck wrote: > > > > No resource limit on that? So my 200 backends all map some random > > 16,000 blocks from 400 files and the kernel jiggles with it like > > it's never did anything else? > > My machine here is idling not doing much and the kernel is managing 53 > processes with 223 open files currently caching some 52,000 blocks and 1140 > mmap()ed areas. I'm sure if I actually did some work I could make that much > higher. > > In short, if you have a machine capable of running 200 backends, I wouldn't > be surprised if the kernel wasn't already managing that kind of load. > Remember, brk() is really just a special kind of mmap(). Okay, I'm running out of arguments here. So where is the patch? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On Tue, 23 Jul 2002, Jan Wieck wrote: > No resource limit on that? So my 200 backends all map some random > 16,000 blocks from 400 files and the kernel jiggles with it like > it's never did anything else? Do you know how much of your process is already mmaped in exactly this way? If you take a look at the source code for ld.so, you'll likely find that all of the shared libraries your executable is using are mapped in to your process space with mmap. And in fact, it's all mmap under the hood as well; the kernel is effectively internal calls to mmap to map the executable and ld.so into memory in the first place. On BSD systems, at least, but probably others, all of the shared memory is just an anonymous (i.e., using swap space as the backing store) bunch of mmapped pages. So in theory, having several hundred processes each with a gigabyte or two of mmaped 8K blocks is not going to be a problem. However, I've not tested this in practice (though I will do so one day) so that's not to say it won't fail on certain OSes. In that case you'd want to map much smaller amounts, and live with the extra syscall overhead. But even so, it would still be far more efficient than copying blocks into and out of shared memory. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
On Tue, 23 Jul 2002, Jan Wieck wrote: > I'm running out of arguments here. So where is the patch? Oh, I thought you were going to write it! :-) This is something I'd like to work on, actually, but I doubt I'd have enough time before autumn. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Is it significant that you these shared libraries remain mapped for the duration of the process, while blocks are moved in and out? --------------------------------------------------------------------------- Curt Sampson wrote: > On Tue, 23 Jul 2002, Jan Wieck wrote: > > > No resource limit on that? So my 200 backends all map some random > > 16,000 blocks from 400 files and the kernel jiggles with it like > > it's never did anything else? > > Do you know how much of your process is already mmaped in exactly this > way? If you take a look at the source code for ld.so, you'll likely find > that all of the shared libraries your executable is using are mapped > in to your process space with mmap. And in fact, it's all mmap under > the hood as well; the kernel is effectively internal calls to mmap to > map the executable and ld.so into memory in the first place. On BSD > systems, at least, but probably others, all of the shared memory is just > an anonymous (i.e., using swap space as the backing store) bunch of > mmapped pages. > > So in theory, having several hundred processes each with a gigabyte > or two of mmaped 8K blocks is not going to be a problem. However, > I've not tested this in practice (though I will do so one day) so > that's not to say it won't fail on certain OSes. In that case you'd > want to map much smaller amounts, and live with the extra syscall > overhead. But even so, it would still be far more efficient than > copying blocks into and out of shared memory. > > cjs > -- > Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org > Don't you know, in this new Dark Age, we're all light. --XTC > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Tue, 23 Jul 2002, Bruce Momjian wrote: > Is it significant that you these shared libraries remain mapped for the > duration of the process, while blocks are moved in and out? No; that's just syscall overhead to add and remove the mappings. The only thing I would be worried about, really, would be OS overhead beyond the standard page mapping tables to keep track of mmaped data. This might limit you to keeping a "cache" of mappings of just a few hundred or thousand, say, rather than a few hundred thousand. But this would only cost us more syscalls, which are relatively inexpensive (compared to things like memory copies) anyway. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes:
> But this would only cost us more syscalls, which are relatively
> inexpensive (compared to things like memory copies) anyway.
Run that by me again?
I'd take memcpy over a kernel call any day.  If you want to assert
that the latter is cheaper, you'd better supply some evidence.
            regards, tom lane
			
		Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > But this would only cost us more syscalls, which are relatively > > inexpensive (compared to things like memory copies) anyway. > > Run that by me again? > > I'd take memcpy over a kernel call any day. If you want to assert > that the latter is cheaper, you'd better supply some evidence. I assume he meant memory copies from kernel to process address space. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Wed, 24 Jul 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > But this would only cost us more syscalls, which are relatively > > inexpensive (compared to things like memory copies) anyway. > > Run that by me again? > > I'd take memcpy over a kernel call any day. If you want to assert > that the latter is cheaper, you'd better supply some evidence. So on my Athlon system: ironic $ /usr/pkg/bin/lmbench/mhz 1533 MHz, 0.65 nanosec clock ironic $ /usr/pkg/bin/lmbench/lat_syscall null Simple syscall: 0.2026 microseconds ironic $ /usr/pkg/bin/lmbench/bw_mem 32m bcopy 33.55 352.69 352.69 MB/sec works out to about 370 bytes per microsecond. Thus, copying an 8K page should take a bit over 22 microseconds, enough time for the overhead of 110 syscalls. ironic $ /usr/pkg/bin/lmbench/bw_mem 64k bcopy 0.065536 2038.58 Even entirely in L1 cache, I still get only about 2138 bytes per microsecond, thus taking almost 4 microseconds to copy a page, the same as the overhead for 8 syscalls. On older systems, such as a P5-133, a syscall was about 4 microseconds, and bcopy was 42 MB/sec, or 44 bytes per microsecond. Thus a page copy was 186 microseconds, enough time for the overhead of only 46 syscalls. I expect that the long, long trend of CPU speed growing faster than memory speed will continue, and memory copies will become ever more expensive than more CPU-boun activity such as syscalls. All this aside, remember that when you copy a block under the current system, you're doing it with read() or write(), and thus paying the syscall overhead anyway. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes:
> ironic $ /usr/pkg/bin/lmbench/lat_syscall null
> Simple syscall: 0.2026 microseconds
> ironic $ /usr/pkg/bin/lmbench/bw_mem 32m bcopy
> 33.55 352.69
> 352.69 MB/sec works out to about 370 bytes per microsecond. Thus,
> copying an 8K page should take a bit over 22 microseconds, enough
> time for the overhead of 110 syscalls.
> ironic $ /usr/pkg/bin/lmbench/bw_mem 64k bcopy
> 0.065536 2038.58
> Even entirely in L1 cache, I still get only about 2138 bytes per
> microsecond, thus taking almost 4 microseconds to copy a page, the same
> as the overhead for 8 syscalls.
Hm.  What's the particular syscall being used for reference here?
And how does it compare to the sorts of activities we'd actually be
concerned about (open, close, mmap)?
> All this aside, remember that when you copy a block under the
> current system, you're doing it with read() or write(), and thus
> paying the syscall overhead anyway.
Certainly.  But as you just demonstrated, the overhead of entering the
kernel is negligible compared to doing the copying (to say nothing of
any actual I/O required).  I'm not convinced that futzing with a
process' memory mapping tables is free, however ... especially not if
you're creating a large number of separate small mappings.  If mmap
provokes a TLB flush for your process, it's going to be expensive
(just how expensive will be hard to measure, too, since most of the
cycles will be expended after returning from mmap).
            regards, tom lane
			
		On Sun, 28 Jul 2002, Tom Lane wrote:
> Hm.  What's the particular syscall being used for reference here?
It's a one-byte write() to /dev/null.
> And how does it compare to the sorts of activities we'd actually be
> concerned about (open, close, mmap)?
Well, I don't see that open and close are relevant, since that part of
the file handling would be exactly the same if you continued to use the
same file handle caching code we use now.
lmbench does have a test for mmap latency which tells you how long it
takes, on average, to mmap the first given number of bytes of a file.
Unfortunately, it's not giving me output for anything smaller than about
half a megabyte (perhaps because it's too fast to measure accurately?),
but here are the times, in microseconds, for sizes from that to 1 GB on
my 1533 MHz Athlon:
    0.524288 7.688
    1.048576 15
    2.097152 22
    4.194304 40
    16.777216 169
    33.554432 358
    67.108864 740
    134.217728 2245
    268.435456 5080
    536.870912 9971
    805.306368 14927
    1073.741824 19898
It seems roughly linear, so I'm guessing that an 8k mmap would be
around 0.1-0.2 microseconds, or ten times the cost of a syscall.
Really, I need to write a better benchmark for this. I'm a bit busy
this week, but I'll try to find time to do that.
Keep in mind, though, that mmap is generally quite heavily optimized,
because it's so heavily used. Almost all programs in the system are
dynamically linked (on some systems, such as Linux and Solaris, they
essentially all are), and thus they all use mmap to map in their
libraries.
> I'm not convinced that futzing with a process' memory mapping tables
> is free, however ... especially not if you're creating a large number
> of separate small mappings.
It's not free, no. But then again, memory copies are really, really
expensive.
In NetBSD, at least, you probably don't want to keep a huge number of
mappings around becuase they're stored as a linked list (ordered by
address) that's searched linearly when you need to add or delete a
mapping (though there's a hint for the most recent entry).
> If mmap provokes a TLB flush for your process, it's going to be
> expensive (just how expensive will be hard to measure, too, since most
> of the cycles will be expended after returning from mmap).
True enough, though blowing out your cache with copies is also not
cheap. But measuring this should not be hard; writing a little
program to do a bunch of copies versus a bunch of mmaps of random blocks
from a file should only be a couple of hours work. I'll work on this in my
spare time and report the results.
cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC