Обсуждение: Raw device I/O for large objects
Hello, I am a graduate student of computer science and I have been looking at PostgreSQL for my master's thesis work. I am looking into implementing raw device I/O for large objects into PostgreSQL (maybe for all storage, I'm not sure which would be easier/better). I am extremely new to the codebase, however. Could someone please point me to the right places to look at, and how/where to get started? Would such a development be useful at all? Is anyone working on anything related? Any feedback / information would be highly appreciated! Thanks, Georgi
On 9/17/07, Georgi Chulkov <godji@metapenguin.org> wrote: > > Could someone please point me to the right places to look at, and how/where to > get started? Would such a development be useful at all? Is anyone working on > anything related? > > Any feedback / information would be highly appreciated! > http://www.postgresql.org/docs/techdocs http://www.postgresql.org/docs/faq/ The postgresql documentation: http://www.postgresql.org/docs/8.2/interactive/index.html Also, If you have the source, the src/tools/backend directory has some useful material for starters. regards, -- Sibte Abbas
Georgi Chulkov <godji@metapenguin.org> writes: > I am looking into implementing raw device I/O for large objects into > PostgreSQL (maybe for all storage, I'm not sure which would be > easier/better). We've heard this idea proposed before, and it's been shot down as a poor use of development effort every time. Check the archives for previous threads, but the basic argument goes like this: when Oracle et al did that twenty years ago, it was a good idea because (1) operating systems tended to have sucky filesystems, (2) performance and reliability properties of same were not very consistent across platforms, and (3) being large commercial software vendors they could afford to throw lots of warm bodies at anything that seemed like a bottleneck. None of those arguments holds up well for us today however. If you think you want to reimplement a filesystem you need to have some pretty concrete reasons why you can outsmart all the smart folks who have worked on your-favorite-OS's filesystems for lo these many years. There's also the fact that on any reasonably modern disk hardware, "raw I/O" is anything but. My opinion is that there is lots of lower-hanging fruit elsewhere. You can find some ideas on our TODO list, or troll the pghackers list archives for other discussions. regards, tom lane
Hi, > We've heard this idea proposed before, and it's been shot down as a poor > use of development effort every time. Check the archives for previous > threads, but the basic argument goes like this: when Oracle et al did > that twenty years ago, it was a good idea because (1) operating systems > tended to have sucky filesystems, (2) performance and reliability > properties of same were not very consistent across platforms, and (3) > being large commercial software vendors they could afford to throw lots > of warm bodies at anything that seemed like a bottleneck. None of those > arguments holds up well for us today however. If you think you want to > reimplement a filesystem you need to have some pretty concrete reasons > why you can outsmart all the smart folks who have worked on > your-favorite-OS's filesystems for lo these many years. There's also > the fact that on any reasonably modern disk hardware, "raw I/O" is > anything but. Thanks, I agree with all your arguments. Here's the reason why I'm looking at raw device storage for large objects only (as opposed to all tables): with raw device I/O I can control, to an extent, spatial locality. So, if I have an application that wants to store N large objects (totaling several gigabytes) and read them back in some order that is well-known in advance, I could store my large objects in that order on the raw device.* Sequentially reading them back would then be very efficient. With a file system underneath, I don't have that freedom. (Such a scenario occurs with raster databases, for example.) * assuming I have a way to communicate these requirements; that's a whole new problem Please allow me to ask then: 1. In your opinion, would the above scenario indeed benefit from a raw-device interface for large objects? 2. How feasible it is to decouple general table storage from large object storage? Thank you for your time, Georgi
<p><font size="2">Index organized tables would do this and it would be a generic capability.<br /><br /> - Luke<br /><br/> Msg is shrt cuz m on ma treo<br /><br /> -----Original Message-----<br /> From: Georgi Chulkov [<a href="mailto:godji@metapenguin.org">mailto:godji@metapenguin.org</a>]<br/> Sent: Monday, September 17, 2007 11:50 PM EasternStandard Time<br /> To: Tom Lane<br /> Cc: pgsql-hackers@postgresql.org<br /> Subject: Re: [HACKERS]Raw device I/O for large objects<br /><br /> Hi,<br /><br /> > We've heard this idea proposed before, and it'sbeen shot down as a poor<br /> > use of development effort every time. Check the archives for previous<br /> >threads, but the basic argument goes like this: when Oracle et al did<br /> > that twenty years ago, it was a goodidea because (1) operating systems<br /> > tended to have sucky filesystems, (2) performance and reliability<br />> properties of same were not very consistent across platforms, and (3)<br /> > being large commercial software vendorsthey could afford to throw lots<br /> > of warm bodies at anything that seemed like a bottleneck. None of those<br/> > arguments holds up well for us today however. If you think you want to<br /> > reimplement a filesystemyou need to have some pretty concrete reasons<br /> > why you can outsmart all the smart folks who have workedon<br /> > your-favorite-OS's filesystems for lo these many years. There's also<br /> > the fact that on anyreasonably modern disk hardware, "raw I/O" is<br /> > anything but.<br /><br /> Thanks, I agree with all your arguments.<br/><br /> Here's the reason why I'm looking at raw device storage for large objects only<br /> (as opposed toall tables): with raw device I/O I can control, to an extent,<br /> spatial locality. So, if I have an application thatwants to store N large<br /> objects (totaling several gigabytes) and read them back in some order that is<br /> well-knownin advance, I could store my large objects in that order on the<br /> raw device.* Sequentially reading them backwould then be very efficient.<br /> With a file system underneath, I don't have that freedom. (Such a scenario<br />occurs with raster databases, for example.)<br /><br /> * assuming I have a way to communicate these requirements; that'sa whole new<br /> problem<br /><br /> Please allow me to ask then:<br /> 1. In your opinion, would the above scenarioindeed benefit from a raw-device<br /> interface for large objects?<br /> 2. How feasible it is to decouple generaltable storage from large object<br /> storage?<br /><br /> Thank you for your time,<br /><br /> Georgi<br /><br />---------------------------(end of broadcast)---------------------------<br /> TIP 1: if posting/reading through Usenet,please send an appropriate<br /> subscribe-nomail command to majordomo@postgresql.org so that your<br /> message can get through to the mailing list cleanly<br /></font>
Hi, Georgi Chulkov wrote: > Please allow me to ask then: > 1. In your opinion, would the above scenario indeed benefit from a raw-device > interface for large objects? No, because file systems also try to do what you outline above. They certainly don't split sequential data up into blocks and distribute them randomly over the device, at least not without having a pretty good reason to do so (with which you'd also have to fight). The possible gain achievable is pretty minimal, especially in conjunction with a (hopefully battery backed) write cache. > 2. How feasible it is to decouple general table storage from large object > storage? I think that would be the easiest part. I would go for a pluggable storage implementation, selectable per tablespace. But then again, I wouldn't do it at all. After all, this is what MySQL is doing. And we certainly don't want to repeat their mistakes! Or do you know anybody who goes like: "Yepee, multiple storages engines to choose from for my (un)valuable data, lets put some here and others there...". Let's optimize the *one* storage engine we have and try to make that work well together with the various filesystems it uses. Because filesystems are already very good in what they are used for. (And we are glad we can use a filesystem and don't need to implement one ourselves). Regards Markus
Georgi Chulkov <godji@metapenguin.org> writes: > Here's the reason why I'm looking at raw device storage for large objects only > (as opposed to all tables): with raw device I/O I can control, to an extent, > spatial locality. So, if I have an application that wants to store N large > objects (totaling several gigabytes) and read them back in some order that is > well-known in advance, I could store my large objects in that order on the > raw device.* Sequentially reading them back would then be very efficient. > With a file system underneath, I don't have that freedom. (Such a scenario > occurs with raster databases, for example.) Not sure I buy that argument. If you have loaded these large objects in the desired order, then the data will be consecutively located in pg_largeobject, and if the underlying filesystem is at all sane about where it extends a growing file, the data will be pretty much consecutive on disk too. You could probably get marginal improvements by cutting out the middleman but I'm not sure there's reason to think there'd be spectacular improvements. > Please allow me to ask then: > 1. In your opinion, would the above scenario indeed benefit from a raw-device > interface for large objects? I don't say it wouldn't benefit. What I'm questioning is the size of the benefit compared to the amount of work required to get it. "Supporting raw I/O" is not some trivial bit of work --- you essentially have to reimplement your own filesystem, because like it or not you *do* have to think about space management. If we went in this direction we'd be buying into a lot of work, not to mention a lot of ongoing portability headaches. So far no one's been able to make a case that it's worth that level of effort. > 2. How feasible it is to decouple general table storage from large object > storage? You might try digging into the original POSTGRES sources --- at one time there were several different large-object APIs. I'm not sure if they exposed them just as different sets of access functions or if there was something more elegant. My own feeling though is that you probably don't want to go that way, because with outside-the-database storage you lose transactional behavior (unless you're up for reinventing that wheel too). I'd try replacing md.c, or maybe resurrecting smgr.c as something that can really switch between more than one underlying storage manager. regards, tom lane
Thank you everyone for your valuable input! I will have a look at some other part of PostgreSQL, and maybe find something else to do instead. Best, Georgi