Обсуждение: Why does PostgreSQL ftruncate before unlink?
When dropping lots of tables, I noticed postgresql taking longer than I would have expected. strace seems to report that the largest contributor is the ftruncate and not the unlink. I'm curious what the logic is behind using ftruncate before unlink. I'm using an ext4 filesystem. -- Jon
On Fri, Feb 21, 2014 at 4:14 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > When dropping lots of tables, I noticed postgresql taking longer than > I would have expected. > > strace seems to report that the largest contributor is the ftruncate > and not the unlink. I'm curious what the logic is behind using > ftruncate before unlink. > > I'm using an ext4 filesystem. I'm guessing that this is so that it can be rolled back. Unlink is likely issued at commit;
On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Fri, Feb 21, 2014 at 4:14 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> When dropping lots of tables, I noticed postgresql taking longer than
> I would have expected.
>
> strace seems to report that the largest contributor is the ftruncate
> and not the unlink. I'm curious what the logic is behind using
> ftruncate before unlink.
>
> I'm using an ext4 filesystem.
I'm guessing that this is so that it can be rolled back. Unlink is
likely issued at commit;
I would hope that ftruncate is issued at commit as well. That doesn't sound undoable.
Cheers,
Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote: >> I'm guessing that this is so that it can be rolled back. Unlink is >> likely issued at commit; > I would hope that ftruncate is issued at commit as well. That doesn't > sound undoable. It's more subtle than that. I'm too lazy to look at the comments in md.c right now, but basically the reason for not doing an instant unlink is to ensure that if a relation is truncated and then re-extended, open file pointers held by other backends will still be valid. The ftruncate is done to ensure that allocated disk space goes away as soon as that's safe (ie, at commit of the truncation); but immediate unlink would require forcing more cross-backend synchronization than we want to have. If memory serves, the inode should get removed during the next checkpoint. regards, tom lane
On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote: >>> I'm guessing that this is so that it can be rolled back. Unlink is >>> likely issued at commit; > >> I would hope that ftruncate is issued at commit as well. That doesn't >> sound undoable. > > It's more subtle than that. I'm too lazy to look at the comments in md.c > right now, but basically the reason for not doing an instant unlink is > to ensure that if a relation is truncated and then re-extended, open file > pointers held by other backends will still be valid. The ftruncate is > done to ensure that allocated disk space goes away as soon as that's safe > (ie, at commit of the truncation); but immediate unlink would require > forcing more cross-backend synchronization than we want to have. > > If memory serves, the inode should get removed during the next checkpoint. I was moments away from commenting to say that I had traced the flow of the code to md.c and found the comments there quite illuminating. I wonder if there is a different way to solve the underlying issue without relying on ftruncate (which seems to be somewhat expensive). -- Jon
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If memory serves, the inode should get removed during the next checkpoint. > I was moments away from commenting to say that I had traced the flow > of the code to md.c and found the comments there quite illuminating. I > wonder if there is a different way to solve the underlying issue > without relying on ftruncate (which seems to be somewhat expensive). Hm. The code is designed the way it is on the assumption that ftruncate doesn't do anything that unlink wouldn't have to do anyway. If it really is significantly slower on popular filesystems, maybe we need to revisit that. regards, tom lane
On Sun, Feb 23, 2014 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If memory serves, the inode should get removed during the next checkpoint. > >> I was moments away from commenting to say that I had traced the flow >> of the code to md.c and found the comments there quite illuminating. I >> wonder if there is a different way to solve the underlying issue >> without relying on ftruncate (which seems to be somewhat expensive). > > Hm. The code is designed the way it is on the assumption that ftruncate > doesn't do anything that unlink wouldn't have to do anyway. If it really > is significantly slower on popular filesystems, maybe we need to revisit > that. > Here is an example. % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.95 3.207681 4182 767 ftruncate 0.05 0.001579 1 2428 2301 unlink -- Jon
On Mon, Feb 24, 2014 at 6:38 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > Here is an example. > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 99.95 3.207681 4182 767 ftruncate > 0.05 0.001579 1 2428 2301 unlink Are this times for unlink after ftruncate? Because ( in linux which is the one I use in the desktops and I'm familiar with ) unlinks of big files are slow too, so to have a more meaningful comparison you would need to time ftruncate+unlink and plain unlink of same files, IIRC they take nearly equal time. Francisco Olarte.