Обсуждение: Why does PostgreSQL ftruncate before unlink?

Поиск
Список
Период
Сортировка

Why does PostgreSQL ftruncate before unlink?

От
Jon Nelson
Дата:
When dropping lots of tables, I noticed postgresql taking longer than
I would have expected.

strace seems to report that the largest contributor is the ftruncate
and not the unlink. I'm curious what the logic is behind using
ftruncate before unlink.

I'm using an ext4 filesystem.

--
Jon


Re: Why does PostgreSQL ftruncate before unlink?

От
Scott Marlowe
Дата:
On Fri, Feb 21, 2014 at 4:14 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> When dropping lots of tables, I noticed postgresql taking longer than
> I would have expected.
>
> strace seems to report that the largest contributor is the ftruncate
> and not the unlink. I'm curious what the logic is behind using
> ftruncate before unlink.
>
> I'm using an ext4 filesystem.

I'm guessing that this is so that it can be rolled back. Unlink is
likely issued at commit;


Re: Why does PostgreSQL ftruncate before unlink?

От
Jeff Janes
Дата:
On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Fri, Feb 21, 2014 at 4:14 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> When dropping lots of tables, I noticed postgresql taking longer than
> I would have expected.
>
> strace seems to report that the largest contributor is the ftruncate
> and not the unlink. I'm curious what the logic is behind using
> ftruncate before unlink.
>
> I'm using an ext4 filesystem.

I'm guessing that this is so that it can be rolled back. Unlink is
likely issued at commit;

I would hope that ftruncate is issued at commit as well.  That doesn't sound undoable.

Cheers,

Jeff

Re: Why does PostgreSQL ftruncate before unlink?

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> I'm guessing that this is so that it can be rolled back. Unlink is
>> likely issued at commit;

> I would hope that ftruncate is issued at commit as well.  That doesn't
> sound undoable.

It's more subtle than that.  I'm too lazy to look at the comments in md.c
right now, but basically the reason for not doing an instant unlink is
to ensure that if a relation is truncated and then re-extended, open file
pointers held by other backends will still be valid.  The ftruncate is
done to ensure that allocated disk space goes away as soon as that's safe
(ie, at commit of the truncation); but immediate unlink would require
forcing more cross-backend synchronization than we want to have.

If memory serves, the inode should get removed during the next checkpoint.

            regards, tom lane


Re: Why does PostgreSQL ftruncate before unlink?

От
Jon Nelson
Дата:
On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Sunday, February 23, 2014, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>>> I'm guessing that this is so that it can be rolled back. Unlink is
>>> likely issued at commit;
>
>> I would hope that ftruncate is issued at commit as well.  That doesn't
>> sound undoable.
>
> It's more subtle than that.  I'm too lazy to look at the comments in md.c
> right now, but basically the reason for not doing an instant unlink is
> to ensure that if a relation is truncated and then re-extended, open file
> pointers held by other backends will still be valid.  The ftruncate is
> done to ensure that allocated disk space goes away as soon as that's safe
> (ie, at commit of the truncation); but immediate unlink would require
> forcing more cross-backend synchronization than we want to have.
>
> If memory serves, the inode should get removed during the next checkpoint.

I was moments away from commenting to say that I had traced the flow
of the code to md.c and found the comments there quite illuminating. I
wonder if there is a different way to solve the underlying issue
without relying on ftruncate (which seems to be somewhat expensive).

--
Jon


Re: Why does PostgreSQL ftruncate before unlink?

От
Tom Lane
Дата:
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If memory serves, the inode should get removed during the next checkpoint.

> I was moments away from commenting to say that I had traced the flow
> of the code to md.c and found the comments there quite illuminating. I
> wonder if there is a different way to solve the underlying issue
> without relying on ftruncate (which seems to be somewhat expensive).

Hm.  The code is designed the way it is on the assumption that ftruncate
doesn't do anything that unlink wouldn't have to do anyway.  If it really
is significantly slower on popular filesystems, maybe we need to revisit
that.

            regards, tom lane


Re: Why does PostgreSQL ftruncate before unlink?

От
Jon Nelson
Дата:
On Sun, Feb 23, 2014 at 10:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> On Sun, Feb 23, 2014 at 9:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If memory serves, the inode should get removed during the next checkpoint.
>
>> I was moments away from commenting to say that I had traced the flow
>> of the code to md.c and found the comments there quite illuminating. I
>> wonder if there is a different way to solve the underlying issue
>> without relying on ftruncate (which seems to be somewhat expensive).
>
> Hm.  The code is designed the way it is on the assumption that ftruncate
> doesn't do anything that unlink wouldn't have to do anyway.  If it really
> is significantly slower on popular filesystems, maybe we need to revisit
> that.
>

Here is an example.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.95    3.207681        4182       767           ftruncate
  0.05    0.001579           1      2428      2301 unlink

--
Jon


Re: Why does PostgreSQL ftruncate before unlink?

От
Francisco Olarte
Дата:
On Mon, Feb 24, 2014 at 6:38 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> Here is an example.
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  99.95    3.207681        4182       767           ftruncate
>   0.05    0.001579           1      2428      2301 unlink

Are this times for unlink after ftruncate? Because ( in linux which is
the one I use in the desktops and I'm familiar with ) unlinks of big
files are slow too, so to have a more meaningful comparison you would
need to time ftruncate+unlink and plain unlink of same files, IIRC
they take nearly equal time.

Francisco Olarte.