Обсуждение: [Fwd: Re: 8192 BLCKSZ ?]
Tom Samplonius wrote: > On Tue, 28 Nov 2000, mlw wrote: > > > Tom Samplonius wrote: > > > > > > On Mon, 27 Nov 2000, mlw wrote: > > > > > > > This is just a curiosity. > > > > > > > > Why is the default postgres block size 8192? These days, with caching > > > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe > > > > even gigabytes. Surely, 8K is inefficient. > > > > > > I think it is a pretty wild assumption to say that 32k is more efficient > > > than 8k. Considering how blocks are used, 32k may be in fact quite a bit > > > slower than 8k blocks. > > > > I'm not so sure I agree. Perhaps I am off base here, but I did a bit of > > OS profiling a while back when I was doing a DICOM server. I > > experimented with block sizes and found that the best throughput on > > Linux and Windows NT was at 32K. The graph I created showed a steady > > increase in performance and a drop just after 32K, then steady from > > there. In Windows NT it was more pronounced than it was in Linux, but > > Linux still exhibited a similar trait. > > You are a bit off base here. The typical access pattern is random IO, > not sequentional. If you use a large block size in Postgres, Postgres > will read and write more data than necessary. Which is faster? 1000 x 8K > IOs? Or 1000 x 32K IOs I can sort of see your point, but the 8K vs 32K is not a linear relationship. The big hit is the disk I/O operation, more so than just the data size. It may be almost as efficient to write 32K as it is to write 8K. While I do not know the exact numbers, and it varies by OS and disk subsystem, I am sure that writing 32K is not even close to 4x more expensive than 8K. Think about seek times, writing anything to the disk is expensive regardless of the amount of data. Most disks today have many heads, and are RL encoded. It may only add 10us (approx. 1-2 sectors of a 64 sector drive spinning 7200 rpm) to a disk operation which takes an order of magnitude longer positioning the heads. The overhead of an additional 24K is minute compared to the cost of a disk operation. So if any measurable benefit can come from having bigger buffers, i.e. having more data available per disk operation, it will probably be faster.
Kevin O'Gorman wrote: > > mlw wrote: > > > > Tom Samplonius wrote: > > > > > On Tue, 28 Nov 2000, mlw wrote: > > > > > > > Tom Samplonius wrote: > > > > > > > > > > On Mon, 27 Nov 2000, mlw wrote: > > > > > > > > > > > This is just a curiosity. > > > > > > > > > > > > Why is the default postgres block size 8192? These days, with caching > > > > > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe > > > > > > even gigabytes. Surely, 8K is inefficient. > > > > > > > > > > I think it is a pretty wild assumption to say that 32k is more efficient > > > > > than 8k. Considering how blocks are used, 32k may be in fact quite a bit > > > > > slower than 8k blocks. > > > > > > > > I'm not so sure I agree. Perhaps I am off base here, but I did a bit of > > > > OS profiling a while back when I was doing a DICOM server. I > > > > experimented with block sizes and found that the best throughput on > > > > Linux and Windows NT was at 32K. The graph I created showed a steady > > > > increase in performance and a drop just after 32K, then steady from > > > > there. In Windows NT it was more pronounced than it was in Linux, but > > > > Linux still exhibited a similar trait. > > > > > > You are a bit off base here. The typical access pattern is random IO, > > > not sequentional. If you use a large block size in Postgres, Postgres > > > will read and write more data than necessary. Which is faster? 1000 x 8K > > > IOs? Or 1000 x 32K IOs > > > > I can sort of see your point, but the 8K vs 32K is not a linear > > relationship. > > The big hit is the disk I/O operation, more so than just the data size. > > It may > > be almost as efficient to write 32K as it is to write 8K. While I do not > > know the > > exact numbers, and it varies by OS and disk subsystem, I am sure that > > writing > > 32K is not even close to 4x more expensive than 8K. Think about seek > > times, > > writing anything to the disk is expensive regardless of the amount of > > data. Most > > disks today have many heads, and are RL encoded. It may only add 10us > > (approx. > > 1-2 sectors of a 64 sector drive spinning 7200 rpm) to a disk operation > > which > > takes an order of magnitude longer positioning the heads. > > > > The overhead of an additional 24K is minute compared to the cost of a > > disk > > operation. So if any measurable benefit can come from having bigger > > buffers, i.e. > > having more data available per disk operation, it will probably be > > faster. > > This is only part of the story. It applies best when you're going > to use sequential scans, for instance, or otherwise use all the info > in any block that you fetch. However, when your blocks are 8x bigger, > your number of blocks in the disk cache is 8x fewer. If you're > accessing random blocks, your hopes of finding the block in the > cache are affected (probably not 8x, but there is an effect). > > So don't just blindly think that bigger blocks are better. It > ain't necessarily so. > First, the difference between 8K and 32K is 4 not 8. The problem is you are looking at these numbers as if there is a linear relationship between the 8 and the 32. You are thinking 8 is 1/4 the size of 32, so it must be 1/4 the amount of work. This is not true at all. Many operating systems used a fixed memory block size allocation for their disk cache. They do not allocate a new block for every disk request, they maintain a pool of fixed sized buffer blocks. So if you use fewer bytes than the OS block size you waste the difference between your block size and the block size of the OS cache entry. I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm pretty confident that NT does as well from my previous tests. So, in effect, an 8K block may waste 3/4 of the memory in the disk cache. http://www.mohawksoft.com
Kevin O'Gorman wrote: > > mlw wrote: > > Many operating systems used a fixed memory block size allocation for > > their disk cache. They do not allocate a new block for every disk > > request, they maintain a pool of fixed sized buffer blocks. So if you > > use fewer bytes than the OS block size you waste the difference between > > your block size and the block size of the OS cache entry. > > > > I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm > > pretty confident that NT does as well from my previous tests. > > I dunno about NT, but here's a quote from "Linux Kernel Internals" > 2nd Ed, page 92-93: > .. The block size for any given device may be 512, 1024, 2048 or > 4096 bytes.... > > ... the buffer cache manages individual block buffers of > varying size. For this, every block is given a 'buffer_head' data > structure. ... The definition of the buffer head is in linux/fs.h > > ... the size of this area exactly matches the block size 'b_size'... > > The quote goes on to describe how the data structures are designed to > be processor-cache-aware. > I double checked the kernel source, and you are right. I stand corrected about the disk caching. My assertion stands, it is a neglagable difference to read 32K vs 8K from a disk, and the probability of data being within a 4 times larger block is 4 times better, even though the probability of having the correct block in memory is 4 times less. So, I don't think it is a numerically significant issue. -- http://www.mohawksoft.com
Kevin O'Gorman wrote: > > mlw wrote: > > > > Kevin O'Gorman wrote: > > > > > > mlw wrote: > > > > Many operating systems used a fixed memory block size allocation for > > > > their disk cache. They do not allocate a new block for every disk > > > > request, they maintain a pool of fixed sized buffer blocks. So if you > > > > use fewer bytes than the OS block size you waste the difference between > > > > your block size and the block size of the OS cache entry. > > > > > > > > I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm > > > > pretty confident that NT does as well from my previous tests. > > > > > > I dunno about NT, but here's a quote from "Linux Kernel Internals" > > > 2nd Ed, page 92-93: > > > .. The block size for any given device may be 512, 1024, 2048 or > > > 4096 bytes.... > > > > > > ... the buffer cache manages individual block buffers of > > > varying size. For this, every block is given a 'buffer_head' data > > > structure. ... The definition of the buffer head is in linux/fs.h > > > > > > ... the size of this area exactly matches the block size 'b_size'... > > > > > > The quote goes on to describe how the data structures are designed to > > > be processor-cache-aware. > > > > > > > I double checked the kernel source, and you are right. I stand corrected > > about the disk caching. > > > > My assertion stands, it is a neglagable difference to read 32K vs 8K > > from a disk, and the probability of data being within a 4 times larger > > block is 4 times better, even though the probability of having the > > correct block in memory is 4 times less. So, I don't think it is a > > numerically significant issue. > > > > My point is that it's going to depend strongly on what you're doing. > If you're getting only one item from each block, you pay a cost in cache > flushing even if the disk I/O time isn't much different. You're carrying > 3x unused bytes and displacing other, possibly useful, things from the > cache. > > So whether it's a good thing or not is something you have to measure, not > argue about. Because it will vary depending on your workload. That's > where a DBA begins to earn his/her pay. I would tend to disagree "in general." One can always find more optimal ways to search data if one knows the nature of the data and the nature of the search before hand. The nature of the data could be knowledge of whether it is sorted along the lines of the type of search you want to do. It could be knowledge of the entirety of the data, and so on. The cost difference between 32K vs 8K disk reads/writes are so small these days when compared with overall cost of the disk operation itself, that you can even measure it, well below 1%. Remember seek times advertised on disks are an average. SQL itself is a compromise between a hand coded search program and a general purpose solution. As a general purpose search system, one can not conclude that data is less likely to be in a larger block vs more likely to be in a smaller block that remains in cache. There are just as many cases where one could make an argument about one verses the other based on the nature of data and the nature of the search. However, that being said, memory DIMMS are 256M for $100 and time is priceless. The 8K default has been there as long I can remember having to think about it, and only recently did I learn it can be changed. I have been using Postgres since about 1996. I argue that reading 32K is, for all practical purposes, not measurably different to read or write to disk than is 8K. The sole point in your argument is that with a 4x larger block you have a 1/4 chance that the block will be in memory. I argue that with a 4x greater block size, you have 4x greater chance that data will be in a block, and that this offsets the 1/4 chance of something being in cache. The likelihood of something being in a cache is directly proportional to the ratio of the size of whole object being cached vs size of the cache itself, and the algorithms used to calculate what remains in cache. Typically this is a combination of LRU, frequency, and some predictive analysis. Small databases may, in fact, reside entirely in disk cache because of the amount of RAM on modern machines. Large databases can not be entirely cached and some small percentage of them will be in cache. Depending on the "randomness" of the search criteria, the probability of the item which you wish to locate being in cache has, as far as I can see, little to do with the block size. I am going to see if I can get some time together this weekend and see if the benchmark programs measure a difference in block sizes, and if so, compare. I will try to test 8K, 16K, 24K, 32K. -- http://www.mohawksoft.com
> The cost difference between 32K vs 8K disk reads/writes are so small > these days when compared with overall cost of the disk operation itself, > that you can even measure it, well below 1%. Remember seek times > advertised on disks are an average. It has been said how small the difference is - therefore in my opinion it should remain at 8KB to maintain best average performance with all existing platforms. I say its best let the OS and mass storage subsystem worry about read-ahead caching and whether they actually read 8KB off the disk, or 32KB or 64KB when we ask for 8. - Andrew
At 10:52 AM 12/2/00 +1100, Andrew Snow wrote: > > >> The cost difference between 32K vs 8K disk reads/writes are so small >> these days when compared with overall cost of the disk operation itself, >> that you can even measure it, well below 1%. Remember seek times >> advertised on disks are an average. > >It has been said how small the difference is - therefore in my opinion it >should remain at 8KB to maintain best average performance with all existing >platforms. With versions <= PG 7.0, the motivation that's been stated isn't performance based as much as an option to let you stick relatively big chunks of text (~40k-ish+ for lzText) in a single row without resorting to classic PG's ugly LOB interface or something almost as ugly as the built-in LOB handler I did for AOLserver many months ago. The performance arguments have mostly been of the form "it won't really cost you much and you can use rows that are so much longer ..." I think there's been recognition that 8KB is a reasonable default, along with lamenting (at least on my part) that the fact that this is just a DEFAULT hasn't been well-communicated, leading many casual surveyors of DB alternatives to believe that it is truly a hard-wired limitation. Causing PG's reputation to suffer as a result. One could argue that PG"s reputation would've been enhanced in past years if a 32KB block size limit rather than 8KB block size default had been emphasized. But you wouldn't have to change the DEFAULT in order to make this claim! It would've been just a matter of emphasizing the limit rather than the default. PG 7.1 will pretty much end any confusion. The segmented approach used by TOAST should work well (the AOLserver LOB handler I wrote months ago works well in the OpenACS context, and uses a very similar segmentation scheme, so I expect TOAST to work even better). Users will still be able change to larger blocksizes (perhaps a wise thing to do if a large percentage of their data won't fit into a single PG block). Users using the default will be able to store rows of *awesome* length, efficiently. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Don Baccus wrote: > > ... > I expect TOAST to work even better). Users will still be able change to > larger blocksizes (perhaps a wise thing to do if a large percentage of their > data won't fit into a single PG block). Users using the default will > be able to store rows of *awesome* length, efficiently. Depends... Actually the toaster already jumps in if your tuples exceed BLKSZ/4, so with the default of 8K blocks it tries to keep all tuples smaller than 2K. The reasons behind that are: 1. An average tuple size of 8K means an average of 4K unused space at the end of each block. Wasting space means to waste IO bandwidth. 2. Since big items are unlikely to be search criteria, needing to read them into memory for every chech for a match on other columns is a waste again. So the more big items are off from the main tuple, thesmaller the main table becomes, the more likely it is that the main tuples (holding the keys) are cached and the cheaper a sequential scan becomes. Of course, especially for 2. there is a break even point. That is when the extra fetches to send toast values to the client cost more than there was saved from not doing it during the main scan already. A full table SELECT * definitely costs more if TOAST is involved. But who does unqualified SELECT * from a multi-gigtable without problems anyway? Usually you pick a single or a few based on some other key attributes -don't you? Let's make an example. You have a forum server that displays one article plus the date and sender of all follow-ups.The article bodies are usually big (1-10K). So you do a SELECT * to fetch the actually displayed article,and another SELECT sender, date_sent just to get the info for the follow-ups. If we assume a uniform distributionof body size and an average of 10 follow-ups, that'd mean that we save 52K of IO and cache usage foreach article displayed. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #