Re: 7.3.1 New install, large queries are slow
От | Sean Chittenden |
---|---|
Тема | Re: 7.3.1 New install, large queries are slow |
Дата | |
Msg-id | 20030127081745.GK15936@perrin.int.nxad.com обсуждение исходный текст |
Ответ на | Re: 7.3.1 New install, large queries are slow (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: 7.3.1 New install, large queries are slow
(Curt Sampson <cjs@cynic.net>)
|
Список | pgsql-performance |
> >> Yeah, but isn't that theory a hangover from pre-Unix operating systems? > > > Informix, oracle, etc all do raw device access bypassing the kernels > > buffering, etc. So they need heaping gobules of memory to do the same > > thing the kernel does.. > > D'oh, I believe Jeff's put his finger on it. You need lotsa RAM if you > are trying to bypass the OS. But Postgres would like to work with the > OS, not bypass it. > > > but since they know the exact patterns of data and > > how things will be done they can fine tune their buffer caches to get much > > better performance than the kernel (15-20% in informix's case) since the > > kernel needs to be a "works good generally" > > They go to all that work for 15-20% ??? Remind me not to follow that > primrose path. I can think of lots of places where we can buy 20% for > less work than implementing (and maintaining) our own raw-device access > layer. This is related somewhat to the raw device access discussion. This is a quote from Matt Dillion (FreeBSD VM guru) on the topic of disk caches (Message-Id: <200301270657.h0R6v2qH071774@apollo.backplane.com>) and a few bits at the end: ### Begin quote Mmmmm. Basically what it comes down to is that without foreknowledge of the data locations being accessed, it is not possible for any cache algorithm to adapt to all the myriad ways data might be accessed. If you focus the cache on one methodology it will probably perform terribly when presented with some other methodology. What this means is that for the cases where you *KNOW* how a program intends to access a dataset larger then main memory, it is better to have the program explicitly cache/not-cache the data under program control rather then trying to force the system cache to adapt. I'll also be nice and decode some of Terry's Jargon for the rest of the readers. :will result in significant failure of random page replacement to :result in cache hits; likewise, going to 85% overage will practically :guarantee an almost 100% failure rate, as cyclical access with random :replacement is statistically more likely, in aggregate, to replace :the pages which are there longer (the probability is iterative and :additive: it's effectively a permutation). What Terry is saying is that if you have a dataset that is 2x the size of your cache, the cache hit rate on that data with random page replacement is NOT going to be 50%. This is because with random page replacement the likelihood of a piece of data being found in the cache depends on how long the data has been sitting in the cache. The longer the data has been sitting in the cache, the less likely you will find it when you need it (because it is more likely to have been replaced by the random replacement algorithm over time). So, again, the best caching methodology to use in the case where you *know* how much data you will be accessing and how you will access it is to build the caching directly into your program and not depend on system caching at all (for datasets which are far larger then main memory). This is true of all systems, not just BSD. This is one reason why databases do their own caching (another is so databases know when an access will require I/O for scheduling reasons, but that's a different story). The FreeBSD VM caching system does prevent one process from exhausting another process's cached data due to a sequential access, but the FreeBSD VM cache does not try to outsmart sequential data accesses to datasets which are larger then available cache space because it's an insanely difficult (impossible) problem to solve properly without foreknowledge of what data elements will be accessed when. This isn't to say that we can't improve on what we have now. I certainly think we can. But there is no magic bullet that will deal with every situation. -Matt ### End quote So if there really is only a 15-20% performance gain to be had from using raw disk access, that 15-20% loss comes from not being able to tell the OS what to cache, what not to cache, and what order to have the pages in... which only really matters if there is RAM available to the kernel to cache, and that it is able to determine what is valuable to cache in the course of its operations. Predictive caching by the OS isn't done because it understands PostgreSQL, because it understands a generic algorithm for page hits/misses. What is interesting after reading this, however, is the prospect of a 15-20% speed up on certain tables that we know are accessed frequently by implicitly specifying a set of data to be preferred in a user space cache. It's impossible for the OS to cache the pages that make the biggest impact on user visible performance given the OS has no understanding of what pages make a big difference on user visible performance, a user land database process, however, would. As things stand, it's entirely possible for a set of large queries to come through and wipe the kernel's cache that smaller queries were using. Once a cache misses, the kernel then has to fetch the data again which could slow down over all number of transactions per second. That said, this is something that an in-database scheduler could avoid by placing a lower priority on larger, more complex queries with the assumption being that having the smaller queries continue to process and get in/out is more important than shaving a few seconds off of a larger query that would deplete the cache used by the smaller queries. Oh to be a DBA and being able to make those decisions instead of the kernel... Hrm, so two ideas or questions come to mind: 1) On some of my really large read only queries, it would be SUUUPER nice to be able to re-nice the process from SQL land to 5, 10, or even 20. IIRC, BSD's VM system is smart enough to prevent lower priority jobs from monopolizing the disk cache, which would let the smaller faster turn around queries, continue to exist with their data in the kernel's disk cache. (some kind of query complexity threshold that results in a reduction of priority or an explicit directive to run at a lower priority) 2) Is there any way of specifying that a particular set of tables should be kept in RAM or some kind of write through cache? I know data is selected into a backend out of the catalogs, but would it be possible to have them kept in memory and only re-read on change with some kind of semaphore? Now that all system tables are in their own schemas (pg_catalog and pg_toast), would it be hard to set a flag on a change to those tables that would cause the postmaster, or children, to re-read then instead of rely on their cache? With copy-on-write forking, this could be pretty efficient if the postmaster did this and forked off a copy with the tables already in memory instead of on disk. Just a few ideas/ramblings, hope someone finds them interesting... the renice function is one that I think I'll spend some time looking into here shortly actually. -sc -- Sean Chittenden
В списке pgsql-performance по дате отправления: