Обсуждение: mdnblocks() sabotages error checking in _mdfd_getseg()
This is per a report by an EnterpriseDB customer and a bunch of off-list analysis by Kevin Grittner and Rahila Syed. Suppose you have a large relation with OID 123456. There are segment files 123456, 123456.1, and 123456.2. Due to some kind of operating system malfeasance, 123456.1 disappears; you are officially in trouble. Now, a funny thing happens. The next time you call mdnblocks() on this relation, which will probably happen pretty quickly since every sequential scan does one, it will see that 123456 is a complete segment and it will create an empty 123456.1. It and any future mdnblocks() calls will report that the length of the relation is equal to the length of one full segment, because they don't check for the next segment unless the current segment is completely full. Now, if subsequent to this an index scan happens to sweep through and try to fetch a block in 123456.2, it will work! This happens because _mdfd_getseg() doesn't care about the length of the segments; it only cares whether or not they exist. If 123456.1 were actually missing, then we'd never test whether 123456.2 exists and we'd get an error. But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite happy to fetch blocks in 123456.2; it considers the empty 123456.1 file to be a sufficient reason to look for 123456.2, and seeing that file, and finding the block it wants therein, it happily returns that block, blithely ignoring the fact that it passed over a completely .1 segment before returning a block from .2. This is maybe not the smartest thing ever. The comment in mdnblocks.c says this: * Because we pass O_CREAT, we will create the next segment (with * zero length) immediately, if the last segment is of length * RELSEG_SIZE. While perhaps not strictly necessary, this keeps * the logic simple. I don't really see how this "keeps the logic simple". What it does is allow sequential scans and index scans to have two completely different notions of how many accessible blocks there are in the relation. Granted, this kind of thing really shouldn't happen, but sometimes bad things do happen. Therefore, I propose the attached patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Robert Haas <robertmhaas@gmail.com> writes: > The comment in mdnblocks.c says this: > * Because we pass O_CREAT, we will create the > next segment (with > * zero length) immediately, if the last > segment is of length > * RELSEG_SIZE. While perhaps not strictly > necessary, this keeps > * the logic simple. > I don't really see how this "keeps the logic simple". My (vague) recollection is that this is defending some assumptions elsewhere in md.c. You sure you haven't broken anything with this? Relation extension across a segment boundary, perhaps? regards, tom lane
On Wed, Dec 9, 2015 at 4:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The comment in mdnblocks.c says this: > >> * Because we pass O_CREAT, we will create the >> next segment (with >> * zero length) immediately, if the last >> segment is of length >> * RELSEG_SIZE. While perhaps not strictly >> necessary, this keeps >> * the logic simple. > >> I don't really see how this "keeps the logic simple". > > My (vague) recollection is that this is defending some assumptions > elsewhere in md.c. You sure you haven't broken anything with this? > Relation extension across a segment boundary, perhaps? It doesn't look like that particular thing gets broken here. The key to extending across a segment boundary is that _mdfd_openseg() needs to get O_CREAT as its fourth argument, but _mdfd_getseg() thankfully does not rely on mdnblocks() to have already done that: it passes O_CREAT itself. I did also test it by creating a 13GB pgbench_accounts table, and at least in that simple case it works fine. It's possible that there is some other thing I've missed, but in general I think that if there is some dependency on mdnblocks() creating new segments, that's a bad idea and we should try to get rid of it. A call that supposedly reports the number of blocks in a relation really has no business changing on-disk state. (More broadly, as Kevin was pointing out to me yesterday, md.c looks like it could do with a face lift. Keeping a linked list of 1GB segments and chasing down the list to find the length of the file may have been fine when relations over 1GB were rare, but that's now routine. Some relations may be over 1TB, and using a linked list of 1000 entries to keep track of all of those segments does not seem like an especially good choice. In fact, having no way to get the relation length other than scanning 1000 files doesn't seem like an especially good choice even if we used a better data structure. Putting a header page in the heap would make getting the length of a relation O(1) instead of O(segments), and for a bonus, we'd be able to reliably detect it if a relation file disappeared out from under us. That's a difficult project and definitely not my top priority, but this code is old and crufty all the same.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-12-10 11:10:10 -0500, Robert Haas wrote: > (More broadly, as Kevin was pointing out to me yesterday, md.c looks > like it could do with a face lift. Keeping a linked list of 1GB > segments and chasing down the list to find the length of the file may > have been fine when relations over 1GB were rare, but that's now > routine. Some relations may be over 1TB, and using a linked list of > 1000 entries to keep track of all of those segments does not seem like > an especially good choice. Yes, that sucks. I've posted a patch for that at http://archives.postgresql.org/message-id/20141031223210.GN13584%40awork2.anarazel.de - I don't have access to to either the workload or a good testing machine anymore though, so I've kinda lost interest for a while. I'll try to push the patch with the renaming suggested downthread by Tom soonish. > In fact, having no way to get the relation length other than scanning > 1000 files doesn't seem like an especially good choice even if we used > a better data structure. Putting a header page in the heap would make > getting the length of a relation O(1) instead of O(segments), and for > a bonus, we'd be able to reliably detect it if a relation file > disappeared out from under us. That's a difficult project and > definitely not my top priority, but this code is old and crufty all > the same.) The md layer doesn't really know whether it's dealing with an index, or with an index, or ... So handling this via a metapage doesn't seem particularly straightforward. Andres
On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote: >> In fact, having no way to get the relation length other than scanning >> 1000 files doesn't seem like an especially good choice even if we used >> a better data structure. Putting a header page in the heap would make >> getting the length of a relation O(1) instead of O(segments), and for >> a bonus, we'd be able to reliably detect it if a relation file >> disappeared out from under us. That's a difficult project and >> definitely not my top priority, but this code is old and crufty all >> the same.) > > The md layer doesn't really know whether it's dealing with an index, or > with an index, or ... So handling this via a metapage doesn't seem > particularly straightforward. It's not straightforward, but I don't think that's the reason. What we could do is look at the call sites that use RelationGetNumberOfBlocks() and change some of them to get the information some other way instead. I believe get_relation_info() and initscan() are the primary culprits, accounting for some enormous percentage of the system calls we do on a read-only pgbench workload. Those functions certainly know enough to consult a metapage if we had such a thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 10, 2015 at 4:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > It's not straightforward, but I don't think that's the reason. What > we could do is look at the call sites that use > RelationGetNumberOfBlocks() and change some of them to get the > information some other way instead. I believe get_relation_info() and > initscan() are the primary culprits, accounting for some enormous > percentage of the system calls we do on a read-only pgbench workload. > Those functions certainly know enough to consult a metapage if we had > such a thing. Would this not run into a chicken and egg problem with recovery? Unless you're going to fsync the meta page whenever the file is extended you'll have to xlog any updates to it and treat the values in memory as authoritative. But when replaying xlog you'll see obsolete inconsistent versions on disk and won't have the correct values in memory either. It seems to me that if you want to fix the linked lists of files that's orthogonal to whether the file lengths on disk are authoritative. You can always keep the lengths or at least the number of files cached and updated in shared memory in a more efficient storage format. -- greg
On 2015-12-10 17:55:37 +0000, Greg Stark wrote: > It seems to me that if you want to fix the linked lists of files > that's orthogonal to whether the file lengths on disk are > authoritative. You can always keep the lengths or at least the number > of files cached and updated in shared memory in a more efficient > storage format. FWIW, I've a very preliminary patch doing exactly that, keeping the file size in shared memory. I'm not sure it'd end up nice if relation extension would have to grab an exclusive lock on the metapage.
On Thu, Dec 10, 2015 at 12:55 PM, Greg Stark <stark@mit.edu> wrote: > On Thu, Dec 10, 2015 at 4:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> It's not straightforward, but I don't think that's the reason. What >> we could do is look at the call sites that use >> RelationGetNumberOfBlocks() and change some of them to get the >> information some other way instead. I believe get_relation_info() and >> initscan() are the primary culprits, accounting for some enormous >> percentage of the system calls we do on a read-only pgbench workload. >> Those functions certainly know enough to consult a metapage if we had >> such a thing. > > Would this not run into a chicken and egg problem with recovery? No. > Unless you're going to fsync the meta page whenever the file is > extended you'll have to xlog any updates to it and treat the values in > memory as authoritative. But when replaying xlog you'll see obsolete > inconsistent versions on disk and won't have the correct values in > memory either. All true, but irrelevant. The places I just mentioned wouldn't get called during recovery. They're only invoked from queries. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10 December 2015 at 16:47, Robert Haas <robertmhaas@gmail.com> wrote:
--
On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> wrote:
>> In fact, having no way to get the relation length other than scanning
>> 1000 files doesn't seem like an especially good choice even if we used
>> a better data structure. Putting a header page in the heap would make
>> getting the length of a relation O(1) instead of O(segments), and for
>> a bonus, we'd be able to reliably detect it if a relation file
>> disappeared out from under us. That's a difficult project and
>> definitely not my top priority, but this code is old and crufty all
>> the same.)
>
> The md layer doesn't really know whether it's dealing with an index, or
> with an index, or ... So handling this via a metapage doesn't seem
> particularly straightforward.
It's not straightforward, but I don't think that's the reason. What
we could do is look at the call sites that use
RelationGetNumberOfBlocks() and change some of them to get the
information some other way instead. I believe get_relation_info() and
initscan() are the primary culprits, accounting for some enormous
percentage of the system calls we do on a read-only pgbench workload.
Those functions certainly know enough to consult a metapage if we had
such a thing.
It looks pretty straightforward to me...
The number of relations with >1 file is likely to be fairly small, so we can just have an in-memory array to record that. 8 bytes per relation >1 GB isn't going to take much shmem, but we can extend using dynshmem as needed. We can seq scan the array at relcache build time and invalidate relcache when we extend. WAL log any extension to a new segment and write the table to disk at checkpoint.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 10 December 2015 at 16:47, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Dec 10, 2015 at 11:36 AM, Andres Freund <andres@anarazel.de> >> wrote: >> >> In fact, having no way to get the relation length other than scanning >> >> 1000 files doesn't seem like an especially good choice even if we used >> >> a better data structure. Putting a header page in the heap would make >> >> getting the length of a relation O(1) instead of O(segments), and for >> >> a bonus, we'd be able to reliably detect it if a relation file >> >> disappeared out from under us. That's a difficult project and >> >> definitely not my top priority, but this code is old and crufty all >> >> the same.) >> > >> > The md layer doesn't really know whether it's dealing with an index, or >> > with an index, or ... So handling this via a metapage doesn't seem >> > particularly straightforward. >> >> It's not straightforward, but I don't think that's the reason. What >> we could do is look at the call sites that use >> RelationGetNumberOfBlocks() and change some of them to get the >> information some other way instead. I believe get_relation_info() and >> initscan() are the primary culprits, accounting for some enormous >> percentage of the system calls we do on a read-only pgbench workload. >> Those functions certainly know enough to consult a metapage if we had >> such a thing. > > It looks pretty straightforward to me... > > The number of relations with >1 file is likely to be fairly small, so we can > just have an in-memory array to record that. 8 bytes per relation >1 GB > isn't going to take much shmem, but we can extend using dynshmem as needed. > We can seq scan the array at relcache build time and invalidate relcache > when we extend. WAL log any extension to a new segment and write the table > to disk at checkpoint. Invaliding the relcache when we extend would be extremely expensive, but we could probably come up with some variant of this that would work. I'm not very excited about this design, though; I think actually putting a metapage on each relation would be better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> We can seq scan the array at relcache build time and invalidate relcache >> when we extend. WAL log any extension to a new segment and write the table >> to disk at checkpoint. > Invaliding the relcache when we extend would be extremely expensive, ... and I think it would be too late anyway, if backends are relying on the relcache to tell the truth. You can't require an exclusive lock on a rel just to extend it, which means there cannot be a guarantee that what a backend has in its relcache will be up to date with current reality. I really don't like Robert's proposal of a metapage though. We've got too darn many forks per relation already. It strikes me that this discussion is perhaps conflating two different issues. Robert seems to be concerned about how we'd detect (not recover from, just detect) filesystem misfeasance in the form of complete loss of a non-last segment file. The other issue is a desire to reduce the cost of mdnblocks() calls. It may be worth thinking about those two things together, but we shouldn't lose sight of these being separate goals, assuming that anybody besides Robert thinks that the segment file loss issue is worth worrying about. regards, tom lane
On Thu, Dec 10, 2015 at 1:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Dec 10, 2015 at 1:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> We can seq scan the array at relcache build time and invalidate relcache >>> when we extend. WAL log any extension to a new segment and write the table >>> to disk at checkpoint. > >> Invaliding the relcache when we extend would be extremely expensive, > > ... and I think it would be too late anyway, if backends are relying on > the relcache to tell the truth. You can't require an exclusive lock on > a rel just to extend it, which means there cannot be a guarantee that > what a backend has in its relcache will be up to date with current reality. True. > I really don't like Robert's proposal of a metapage though. We've got too > darn many forks per relation already. Oh, I wasn't thinking of adding a fork, just repurposing block 0 of the main fork, as we do for some index types. > It strikes me that this discussion is perhaps conflating two different > issues. Robert seems to be concerned about how we'd detect (not recover > from, just detect) filesystem misfeasance in the form of complete loss > of a non-last segment file. The other issue is a desire to reduce the > cost of mdnblocks() calls. It may be worth thinking about those two > things together, but we shouldn't lose sight of these being separate > goals, assuming that anybody besides Robert thinks that the segment > file loss issue is worth worrying about. Don't get me wrong, I'm not willing to expend *any* extra cycles to notice a problem here. But all things being equal, code that notices broken stuff is better than code that doesn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 10, 2015 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I really don't like Robert's proposal of a metapage though. We've got too >> darn many forks per relation already. > > Oh, I wasn't thinking of adding a fork, just repurposing block 0 of > the main fork, as we do for some index types. Not to mention that it should probably be done for all forks. In practice none of the other forks are going to go over a gigabyte these days but having different extension logic depending on the fork would be a pain and only limit future uses of forks. It doesn't seem like an easy change to make work with binary upgrades though. I suppose you could just support the metapage being absent in perpetuity but then it's hard to test and would make the logic in md even more complex than now rather than simpler. Incidentally, for comparison Oracle has a header page on every data file with various information. Data files are added like we add segments to tables but they are actually pieces of a tablespace rather than a table. The most significant piece of metadata that they store in this header block is actually the commit sequence number -- effectively equivalent to our LSN -- that the file is consistent with. That lets you see how far back you need to start recovery and you can narrow down which data files actually need recovery. -- greg
On 2015-12-09 16:50:06 -0500, Robert Haas wrote: > Now, if subsequent to this an index scan happens to sweep through and > try to fetch a block in 123456.2, it will work! This happens because > _mdfd_getseg() doesn't care about the length of the segments; it only > cares whether or not they exist. Unless in recovery in the startup process, or when EXTENSION_CREATE is passed to it. Depending on whether it or mdnblocks were called first, and depending on which segment is missing. In that case it'd *possibly* pad the last block in a sgment with zeroes, Yes, only the last block: * * We have to maintain the invariantthat segments before the last * active segment are of size RELSEG_SIZE; therefore, pad them out * with zeroes if needed. (This only matters if caller is * extending the relation discontiguously, but that canhappen in * hash indexes.) */ if (behavior == EXTENSION_CREATE || InRecovery) { if (_mdnblocks(reln, forknum, v) < RELSEG_SIZE) { char *zerobuf = palloc0(BLCKSZ); mdextend(reln, forknum, nextsegno * ((BlockNumber) RELSEG_SIZE) - 1, zerobuf, skipFsync); pfree(zerobuf); } v->mdfd_chain = _mdfd_openseg(reln,forknum, +nextsegno, O_CREAT); } > If 123456.1 were actually missing, > then we'd never test whether 123456.2 exists and we'd get an error. > But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite > happy to fetch blocks in 123456.2; it considers the empty 123456.1 > file to be a sufficient reason to look for 123456.2, and seeing that > file, and finding the block it wants therein, it happily returns that > block, blithely ignoring the fact that it passed over a completely .1 > segment before returning a block from .2. This is maybe not the > smartest thing ever. I think it might be worthwhile to add a check against this. It'd not be very hard to ensure the segment is indeed large enough iff further segments exist. Possibly not an ERROR, but a WARNING might be useful. To avoid overhead and absurd log spamming we should make that check only if the the segment isn't already open. > The comment in mdnblocks.c says this: > > * Because we pass O_CREAT, we will create the > next segment (with > * zero length) immediately, if the last > segment is of length > * RELSEG_SIZE. While perhaps not strictly > necessary, this keeps > * the logic simple. > > I don't really see how this "keeps the logic simple". What it does is > allow sequential scans and index scans to have two completely > different notions of how many accessible blocks there are in the > relation. Granted, this kind of thing really shouldn't happen, but > sometimes bad things do happen. Therefore, I propose the attached > patch. I'm in favor of changing this behaviour - it's indeed weird and somewhat error prone. Do you want to change this in master, or backpatch a change? To me the code in md.c is subtle and complicated enough that I'm rather inclined to only change this in master. This code is really hard to test effectively :(. Greetings, Andres Freund
On Tue, Dec 15, 2015 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-09 16:50:06 -0500, Robert Haas wrote: >> Now, if subsequent to this an index scan happens to sweep through and >> try to fetch a block in 123456.2, it will work! This happens because >> _mdfd_getseg() doesn't care about the length of the segments; it only >> cares whether or not they exist. > > Unless in recovery in the startup process, or when EXTENSION_CREATE is > passed to it. Depending on whether it or mdnblocks were called first, > and depending on which segment is missing. In that case it'd *possibly* > pad the last block in a sgment with zeroes, Yes, only the last block: Yes, that's clearly inadequate. I think the fact that it only pads the last block, possibly creating a sparse file, should also be fixed, but as a separate patch. >> If 123456.1 were actually missing, >> then we'd never test whether 123456.2 exists and we'd get an error. >> But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite >> happy to fetch blocks in 123456.2; it considers the empty 123456.1 >> file to be a sufficient reason to look for 123456.2, and seeing that >> file, and finding the block it wants therein, it happily returns that >> block, blithely ignoring the fact that it passed over a completely .1 >> segment before returning a block from .2. This is maybe not the >> smartest thing ever. > > I think it might be worthwhile to add a check against this. It'd not be > very hard to ensure the segment is indeed large enough iff further > segments exist. Possibly not an ERROR, but a WARNING might be useful. To > avoid overhead and absurd log spamming we should make that check only if > the the segment isn't already open. We could: 1. Before extending a relation to RELSEG_SIZE, verify that the next segment of the relation doesn't already exist on disk. If we find out that it does, then throw an error and refuse the extension. 2. Teach _mdfd_getseg() that a segment of a size less than RELSEG_SIZE is, by definition, the last segment, and don't even check whether further files exist on disk. mdnblocks() already behaves this way, so it would just be making things consistent. If we did these things, then if you got a hole in your relation, the entire system would agree that the data after the hole doesn't exist any more, and if the segment prior to the hole got re-extended, it would be prevented from "bridging" the hole. We wouldn't be incurring any additional overhead, either, because I bet #2 would save more cycles than #1 costs. However, I don't think this is exactly what you are proposing. I'm skeptical of the idea that _mdfd_getseg() should probe ahead to see whether we're dealing with a malformed relation where the intermediate segments still exist but have zero length. If we fix mdnblocks() as I've proposed, then that scenario will become much less probable. Of course, the operating system could still decide to truncate a segment other than the last for some reason or other, but it could also decide to remove the file completely, or to remove several files, or lots of other things. >> The comment in mdnblocks.c says this: >> >> * Because we pass O_CREAT, we will create the >> next segment (with >> * zero length) immediately, if the last >> segment is of length >> * RELSEG_SIZE. While perhaps not strictly >> necessary, this keeps >> * the logic simple. >> >> I don't really see how this "keeps the logic simple". What it does is >> allow sequential scans and index scans to have two completely >> different notions of how many accessible blocks there are in the >> relation. Granted, this kind of thing really shouldn't happen, but >> sometimes bad things do happen. Therefore, I propose the attached >> patch. > > I'm in favor of changing this behaviour - it's indeed weird and somewhat > error prone. > > Do you want to change this in master, or backpatch a change? To me the > code in md.c is subtle and complicated enough that I'm rather inclined > to only change this in master. Yeah, that's my inclination as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-12-15 10:53:58 -0500, Robert Haas wrote: > On Tue, Dec 15, 2015 at 9:51 AM, Andres Freund <andres@anarazel.de> wrote: > > Unless in recovery in the startup process, or when EXTENSION_CREATE is > > passed to it. Depending on whether it or mdnblocks were called first, > > and depending on which segment is missing. In that case it'd *possibly* > > pad the last block in a sgment with zeroes, Yes, only the last block: > > Yes, that's clearly inadequate. I think the fact that it only pads > the last block, possibly creating a sparse file, should also be fixed, > but as a separate patch. I was wondering about that - but after a bit of thinking I'm disinclined to got hat way. Consider the scenario where we write to the last block in a humongous relation, drop that relation, and then crash, before a checkpoint. Right now we'll write the last block of each segment during recovery. After such a change we'd end up rewriting the whole file... I think we should primarily update that comment. > 1. Before extending a relation to RELSEG_SIZE, verify that the next > segment of the relation doesn't already exist on disk. If we find out > that it does, then throw an error and refuse the extension. That bit wouldn't work directly that way IIRC - the segment is allowed to exist as an 'inactive' segment. We only truncate segments to 0 in mdtruncate()... But a size 0 check should do the trick. > 2. Teach _mdfd_getseg() that a segment of a size less than RELSEG_SIZE > is, by definition, the last segment, and don't even check whether > further files exist on disk. mdnblocks() already behaves this way, so > it would just be making things consistent. > However, I don't think this is exactly what you are proposing. I'm > skeptical of the idea that _mdfd_getseg() should probe ahead to see > whether we're dealing with a malformed relation where the intermediate > segments still exist but have zero length. That's not exactly what I was thinking of. I'm was thinking of doing a _mdnblocks(reln, forknum, v) == RELSEG_SIZE check in _mdfd_getseg()'s main loop, whenever nextsegno < targetseg. That'll make that check rather cheap. Which sounds pretty much like your 2). Andres
On Tue, Dec 15, 2015 at 11:08 AM, Andres Freund <andres@anarazel.de> wrote: >> However, I don't think this is exactly what you are proposing. I'm >> skeptical of the idea that _mdfd_getseg() should probe ahead to see >> whether we're dealing with a malformed relation where the intermediate >> segments still exist but have zero length. > > That's not exactly what I was thinking of. I'm was thinking of doing a > _mdnblocks(reln, forknum, v) == RELSEG_SIZE check in _mdfd_getseg()'s > main loop, whenever nextsegno < targetseg. That'll make that check > rather cheap. Which sounds pretty much like your 2). Hmm, yes it does. But now that I think about it, we're not otherwise doing _mdnblocks() in that loop. So that would add a system call per loop iteration. That doesn't seem like such a swell idea. If you're OK with it, I think I'll commit the original patch. That seems like a good thing to do regardless of what we decide about the rest of this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-12-15 11:28:10 -0500, Robert Haas wrote: > Hmm, yes it does. But now that I think about it, we're not otherwise > doing _mdnblocks() in that loop. So that would add a system call per > loop iteration. That doesn't seem like such a swell idea. We'd do that only when intially open the relation (that far). And after we've already executed a, much more expensive, open() on the same file. So I'm not particularly bothered by that. > If you're OK with it, I think I'll commit the original patch. That > seems like a good thing to do regardless of what we decide about the > rest of this. Ok. I thought for a moment that that patch would need a bit more editing because of: /* * This is the last segment we want to keep. Truncate the file to * the right length,and clear chain link that points to any * remaining segments (which we shall zap). NOTE: if nblocks is * exactly a multiple K of RELSEG_SIZE, we will truncate the K+1st * segment to 0 length but keep it. This adheresto the invariant * given in the header comments. */ in mdtruncate(). But afaics this doesn't change:* On disk, a relation must consist of consecutively numbered segment* files in the pattern* -- Zero or more full segments of exactly RELSEG_SIZE blocks each* -- Exactlyone partial segment of size 0 <= size < RELSEG_SIZE blocks* -- Optionally, any number of inactive segmentsof size 0 blocks. at all, so I think we're good. Andres