Обсуждение: ALTER DATABASE SET TABLESPACE vs crash safety
So I'm looking at the patch for ALTER DATABASE SET TABLESPACE, and wondering about what happens if there's a system crash midway through. The answer doesn't look too good: if the deletion pass has started, your database is hosed. I think we can fix this along the following lines: 1. Copy all files to new directory.2. Checkpoint (this is to prevent needing to replay XLOG_DBASE_CREATE, for the samereasons mentioned in createdb()).3. Update the pg_database tuple.4. Commit transaction, start a new one.5. Remove olddirectory and files. If the system crashes before the commit in step 4, the ALTER SET didn't happen, and the only bad consequence is some orphaned files copied to the new directory. If the system crashes after the commit, the ALTER SET is committed, and the only bad consequence is some orphaned files left behind in the old directory. That is, that's true as long as the filesystem copy in fact pushed everything to disk. copydir() does an fsync() on each file it copies, so I think we have done as much as we can to protect the data being copied, but I wonder if anyone feels it's too dangerous? Note that this is significantly more dangerous than CREATE DATABASE (where presumably you have no great investment yet in the new DB) or DROP DATABASE (where presumably you don't want the data anyway). ALTER DATABASE SET TABLESPACE is pushing around what might be your only copy of valuable data. Comments? regards, tom lane
Tom Lane wrote: > That is, that's true as long as the filesystem copy in fact pushed > everything to disk. copydir() does an fsync() on each file it copies, > so I think we have done as much as we can to protect the data being > copied, but I wonder if anyone feels it's too dangerous? Do we need to fsync the directory itself? My fsync(2) manpage says Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reacheddisk. For that an explicit fsync() on a file descriptor for the directory is also needed. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Do we need to fsync the directory itself? My fsync(2) manpage says > Calling fsync() does not necessarily ensure that the entry in the directory > containing the file has also reached disk. For that an explicit fsync() on a > file descriptor for the directory is also needed. Hmm ... I see that in the Linux manpage, but not on Darwin, HPUX, or in the Single Unix Spec. I'm inclined to argue that we've always expected the filesystem to take care of its own metadata, and we've never seen any indication that that's unsafe. We don't try to "fsync the directory" after a normal table create for instance. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Do we need to fsync the directory itself? My fsync(2) manpage says > > > Calling fsync() does not necessarily ensure that the entry in the directory > > containing the file has also reached disk. For that an explicit fsync() on a > > file descriptor for the directory is also needed. > > Hmm ... I see that in the Linux manpage, but not on Darwin, HPUX, or in > the Single Unix Spec. I'm inclined to argue that we've always expected > the filesystem to take care of its own metadata, and we've never seen > any indication that that's unsafe. We don't try to "fsync the > directory" after a normal table create for instance. I dimly recall the Postfix guys got burned by this some time ago (mails got lost after a crash because they didn't fsync the directory on which they had just created the files before acknowledging the email delivery to the remote server). I guess this is the reason we require a filesystem that journals metadata. http://osdir.com/ml/file-systems.reiserfs.general/2003-09/msg00120.html -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Nov 7, 2008, at 9:53 AM, Tom Lane wrote: > So I'm looking at the patch for ALTER DATABASE SET TABLESPACE, and > wondering about what happens if there's a system crash midway through. > The answer doesn't look too good: if the deletion pass has started, > your database is hosed. FWIW, I don't see this patch as being terribly useful in the real world until it can take place in the background, without locking stuff for a huge amount of time. That tells me that we should have a way to move objects to a new tablespace a little bit at a time. My guess is that such a facility would be something that runs in the background over many different transactions. Once everything had been moved, only then would it go and delete the old files. But it's too late to get that kind of functionality into 8.4. :( So, is there enough demand for this feature to get it into 8.4 and possibly paint ourselves into a corner, or should we just wait until 8.5? -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
--On Sonntag, November 09, 2008 18:25:50 -0600 Decibel! <decibel@decibel.org> wrote: > On Nov 7, 2008, at 9:53 AM, Tom Lane wrote: > FWIW, I don't see this patch as being terribly useful in the real world > until it can take place in the background, without locking stuff for a > huge amount of time. That tells me that we should have a way to move > objects to a new tablespace a little bit at a time. My guess is that such > a facility would be something that runs in the background over many > different transactions. Once everything had been moved, only then would > it go and delete the old files. Of course, such a facility is much more complicater than what this patch does. If you don't want to exclusive lock the database you need to track all changes during copying the relations and later merge them into the new ones in the worst case. I don't see how you want to preserve a consistent state of the database otherwise. > > But it's too late to get that kind of functionality into 8.4. :( So, is > there enough demand for this feature to get it into 8.4 and possibly > paint ourselves into a corner, or should we just wait until 8.5? This patch is already committed. -- Thanks Bernd