Обсуждение: Improve behavior of concurrent TRUNCATE

Поиск
Список
Период
Сортировка

Improve behavior of concurrent TRUNCATE

От
Michael Paquier
Дата:
Hi all,

This is a second thread I am spawning for the previous thread "Canceling
authentication due to timeout aka Denial of Service Attack":
https://www.postgresql.org/message-id/152512087100.19803.12733865831237526317%40wrigleys.postgresql.org

And this time the discussion is about TRUNCATE, as the former thread
discussed about a family of failures hence it is harder to catch the
proper attention.  The problem is that when we look for the relation OID
of the relation to truncate, we don't use a callback with
RangeVarGetRelidExtended, causing a lock acquire attempt to happen
before checking the privileges on the relation for the user running the
command.

Attached is a patch I have been working on which refactors the code of
TRUNCATE in such a way that we check for privileges before trying to
acquire a lock, without any user-facing impact (I have reworked a couple
of comments compared to the last version).  This includes a set of tests
showing the new behavior.

Like cbe24a6, perhaps we would not want to back-patch it?  Based on the
past history (and the consensus being reached for the REINDEX case would
be to patch only HEAD), I would be actually incline to not back-patch
this stuff and qualify that as an improvement.  That's also less work
for me at commit :)

Thoughts?
--
Michael

Вложения

Re: Improve behavior of concurrent TRUNCATE

От
"Bossart, Nathan"
Дата:
Hi,

On 8/6/18, 11:59 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
> Attached is a patch I have been working on which refactors the code of
> TRUNCATE in such a way that we check for privileges before trying to
> acquire a lock, without any user-facing impact (I have reworked a couple
> of comments compared to the last version).  This includes a set of tests
> showing the new behavior.

Here are some comments for the latest version of the patch.

-truncate_check_rel(Relation rel)
+truncate_check_rel(Oid relid, Form_pg_class reltuple)

Could we use HeapTupleGetOid(reltuple) instead of passing the OID
separately?

-    if (rel->rd_rel->relkind != RELKIND_RELATION &&
-        rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+    if (reltuple->relkind != RELKIND_RELATION &&
+
+        reltuple->relkind != RELKIND_PARTITIONED_TABLE)

There appears to be an extra empty line here.

+# Test for lock lookup conflicts with TRUNCATE when working on unowned
+# relations, particularly catalogs should trigger an ERROR for all the
+# scenarios present here.

If possible, it might be worth it to check that TRUNCATE still blocks
when a role has privileges, too.

> Like cbe24a6, perhaps we would not want to back-patch it?  Based on the
> past history (and the consensus being reached for the REINDEX case would
> be to patch only HEAD), I would be actually incline to not back-patch
> this stuff and qualify that as an improvement.  That's also less work
> for me at commit :)

Since the only behavior this patch changes is the order of locking and
permission checks, my vote would be to back-patch.  However, since the
REINDEX piece won't be back-patched, I can see why we might not here,
either.

Nathan


Re: Improve behavior of concurrent TRUNCATE

От
Michael Paquier
Дата:
On Wed, Aug 08, 2018 at 03:38:57PM +0000, Bossart, Nathan wrote:
> Here are some comments for the latest version of the patch.

Thanks for the review, Nathan!

> -truncate_check_rel(Relation rel)
> +truncate_check_rel(Oid relid, Form_pg_class reltuple)
>
> Could we use HeapTupleGetOid(reltuple) instead of passing the OID
> separately?

If that was a HeapTuple.  And it seems to me that we had better make
truncate_check_rel() depend directly on a Form_pg_class instead of
allowing the caller pass anything and deform the tuple within the
routine.

> -    if (rel->rd_rel->relkind != RELKIND_RELATION &&
> -        rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
> +    if (reltuple->relkind != RELKIND_RELATION &&
> +
> +        reltuple->relkind != RELKIND_PARTITIONED_TABLE)
>
> There appears to be an extra empty line here.

Fixed.  I don't know where this has come from.

> +# Test for lock lookup conflicts with TRUNCATE when working on unowned
> +# relations, particularly catalogs should trigger an ERROR for all the
> +# scenarios present here.
>
> If possible, it might be worth it to check that TRUNCATE still blocks
> when a role has privileges, too.

Good idea.  I have added more tests for that.  Doing a truncate on
pg_authid for installcheck was a very bad idea anyway (even if it failed
all the time), so I have switched to a custom table, with a GRANT
command to control the access with a custom role.

> Since the only behavior this patch changes is the order of locking and
> permission checks, my vote would be to back-patch.  However, since the
> REINDEX piece won't be back-patched, I can see why we might not here,
> either.

This patch is a bit more invasive than the REINDEX one to be honest, and
as this is getting qualified as an improvement, at least on this thread,
I would be inclined to be more restrictive and just patch HEAD, but not
v11.

Attached is an updated patch with all your suggestions added.
--
Michael

Вложения

Re: Improve behavior of concurrent TRUNCATE

От
"Bossart, Nathan"
Дата:
On 8/9/18, 5:29 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
>> -truncate_check_rel(Relation rel)
>> +truncate_check_rel(Oid relid, Form_pg_class reltuple)
>> 
>> Could we use HeapTupleGetOid(reltuple) instead of passing the OID
>> separately?
>
> If that was a HeapTuple.  And it seems to me that we had better make
> truncate_check_rel() depend directly on a Form_pg_class instead of
> allowing the caller pass anything and deform the tuple within the
> routine.

Got it.  I agree, it makes sense to force the caller to provide a
Form_pg_class.

>> +# Test for lock lookup conflicts with TRUNCATE when working on unowned
>> +# relations, particularly catalogs should trigger an ERROR for all the
>> +# scenarios present here.
>> 
>> If possible, it might be worth it to check that TRUNCATE still blocks
>> when a role has privileges, too.
>
> Good idea.  I have added more tests for that.  Doing a truncate on
> pg_authid for installcheck was a very bad idea anyway (even if it failed
> all the time), so I have switched to a custom table, with a GRANT
> command to control the access with a custom role.

Good call.  Accidentally truncating pg_authid might have led to some
interesting test results...

>> Since the only behavior this patch changes is the order of locking and
>> permission checks, my vote would be to back-patch.  However, since the
>> REINDEX piece won't be back-patched, I can see why we might not here,
>> either.
>
> This patch is a bit more invasive than the REINDEX one to be honest, and
> as this is getting qualified as an improvement, at least on this thread,
> I would be inclined to be more restrictive and just patch HEAD, but not
> v11.

Alright.

> Attached is an updated patch with all your suggestions added.

Thanks!  This patch builds cleanly, the new tests pass, and my manual
testing hasn't uncovered any issues.  Notably, I cannot reproduce the
originally reported authentication issue by using TRUNCATE after this
change.  Beyond a few small comment wording suggestions below, it
looks good to me.

+        /*
+         * RangeVarGetRelidExtended has done some basic checks with its
+         * callback, and only remain session-level checks.
+         */

This is definitely a nitpick, but I might rephrase this to "...so only
session-level checks remain" or to "...but we still need to do
session-level checks."

+/*
+ * Set of extra sanity checks to check if a given relation is safe to
+ * truncate.  This is split with truncate_check_rel as
+ * RangeVarCallbackForTruncate can only call the former.
+ */
+static void
+truncate_check_activity(Relation rel)

It might be worth mentioning why RangeVarCallbackForTruncate can only
call truncate_check_rel() (we haven't opened the relation yet).

+# Test for lock lookup conflicts with TRUNCATE when working on unowned
+# relations, particularly catalogs should trigger an ERROR for all the
+# scenarios present here.

Since we are testing a few more things now, perhaps this could just be
"Test for locking conflicts with TRUNCATE commands."

+# TRUNCATE will directly fail

Maybe we could say something more like this:
        If the role doesn't have privileges to TRUNCATE the table,
        TRUNCATE should immediately fail without waiting for a lock.

+# TRUNCATE will block
+permutation "s1_begin" "s1_tab_lookup" "s2_grant" "s2_auth" "s2_truncate" "s1_commit" "s2_reset"
+permutation "s1_begin" "s2_grant" "s2_auth" "s2_truncate" "s1_tab_lookup" "s1_commit" "s2_reset"
+permutation "s1_begin" "s2_grant" "s2_auth" "s1_tab_lookup" "s2_truncate" "s1_commit" "s2_reset"
+permutation "s2_grant" "s2_auth" "s2_truncate" "s1_begin" "s1_tab_lookup" "s1_commit" "s2_reset"

The second and fourth tests don't seem to actually block.  Perhaps we
could reword the comment to say something like this:
        If the role has privileges to TRUNCATE the table, TRUNCATE
        will block if another session holds a lock on the table.

Nathan


Re: Improve behavior of concurrent TRUNCATE

От
Michael Paquier
Дата:
On Thu, Aug 09, 2018 at 03:27:04PM +0000, Bossart, Nathan wrote:
> Thanks!  This patch builds cleanly, the new tests pass, and my manual
> testing hasn't uncovered any issues.  Notably, I cannot reproduce the
> originally reported authentication issue by using TRUNCATE after this
> change.  Beyond a few small comment wording suggestions below, it
> looks good to me.

Thanks, I have updated the patch as you suggested.  Any more
improvements to it that you can foresee?

> The second and fourth tests don't seem to actually block.

Yeah, that's intentional.
--
Michael

Вложения

Re: Improve behavior of concurrent TRUNCATE

От
"Bossart, Nathan"
Дата:
On 8/9/18, 11:31 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
> Thanks, I have updated the patch as you suggested.  Any more
> improvements to it that you can foresee?

Looks good to me.

Nathan


Re: Improve behavior of concurrent TRUNCATE

От
Michael Paquier
Дата:
On Thu, Aug 09, 2018 at 05:55:54PM +0000, Bossart, Nathan wrote:
> On 8/9/18, 11:31 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
>> Thanks, I have updated the patch as you suggested.  Any more
>> improvements to it that you can foresee?
>
> Looks good to me.

Okay, pushed to HEAD.  Now remains the cases for VACUUM and we will be
good.  I still need to look more deeply at the last patch sent..
--
Michael

Вложения

Re: Improve behavior of concurrent TRUNCATE

От
Alvaro Herrera
Дата:
On 2018-Aug-06, Michael Paquier wrote:

> Attached is a patch I have been working on which refactors the code of
> TRUNCATE in such a way that we check for privileges before trying to
> acquire a lock, without any user-facing impact (I have reworked a couple
> of comments compared to the last version).  This includes a set of tests
> showing the new behavior.
> 
> Like cbe24a6, perhaps we would not want to back-patch it?  Based on the
> past history (and the consensus being reached for the REINDEX case would
> be to patch only HEAD), I would be actually incline to not back-patch
> this stuff and qualify that as an improvement.  That's also less work
> for me at commit :)

I'm not sure I understand your arguments for not back-patching this.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Improve behavior of concurrent TRUNCATE

От
Michael Paquier
Дата:
On Fri, Aug 10, 2018 at 02:03:28PM -0400, Alvaro Herrera wrote:
> On 2018-Aug-06, Michael Paquier wrote:
>> Like cbe24a6, perhaps we would not want to back-patch it?  Based on the
>> past history (and the consensus being reached for the REINDEX case would
>> be to patch only HEAD), I would be actually incline to not back-patch
>> this stuff and qualify that as an improvement.  That's also less work
>> for me at commit :)
>
> I'm not sure I understand your arguments for not back-patching this.

Mainly consistency.  Looking at the git history for such cases we have
not really bothered back-patching fixes and those have been qualified as
improvements.  If we were to close all the holes mentioned in the
original DOS thread a back-patch to v11 could be thought as acceptable?
That's where the REINDEX fix has found its way after all, but that was
way less code churn, and we are post beta 3 for v11...
--
Michael

Вложения

Re: Improve behavior of concurrent TRUNCATE

От
Alvaro Herrera
Дата:
On 2018-Aug-10, Michael Paquier wrote:

> On Fri, Aug 10, 2018 at 02:03:28PM -0400, Alvaro Herrera wrote:
> > On 2018-Aug-06, Michael Paquier wrote:
> >> Like cbe24a6, perhaps we would not want to back-patch it?  Based on the
> >> past history (and the consensus being reached for the REINDEX case would
> >> be to patch only HEAD), I would be actually incline to not back-patch
> >> this stuff and qualify that as an improvement.  That's also less work
> >> for me at commit :)
> > 
> > I'm not sure I understand your arguments for not back-patching this.
> 
> Mainly consistency.  Looking at the git history for such cases we have
> not really bothered back-patching fixes and those have been qualified as
> improvements.  If we were to close all the holes mentioned in the
> original DOS thread a back-patch to v11 could be thought as acceptable?
> That's where the REINDEX fix has found its way after all, but that was
> way less code churn, and we are post beta 3 for v11...

I was actually thinking in applying to all back-branches, not just pg11,
considering it a fix for a pretty serious bug.  But checking the
history, it seems that Robert patched this is 9.2 as new development
(2ad36c4e4, 1489e2f26, cbe24a6dd, 1da5c1195, 74a1d4fe7); holes remained,
but none was patched until 94da2a6a in pg10 -- took some time!  And then
nobody cared about the ones you're patching now.

So I withdraw my argumentation, mostly because there's clearly not as
much interest in seeing this fixed as all that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Improve behavior of concurrent TRUNCATE

От
Robert Haas
Дата:
On Fri, Aug 10, 2018 at 5:05 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I was actually thinking in applying to all back-branches, not just pg11,
> considering it a fix for a pretty serious bug.  But checking the
> history, it seems that Robert patched this is 9.2 as new development
> (2ad36c4e4, 1489e2f26, cbe24a6dd, 1da5c1195, 74a1d4fe7); holes remained,
> but none was patched until 94da2a6a in pg10 -- took some time!  And then
> nobody cared about the ones you're patching now.
>
> So I withdraw my argumentation, mostly because there's clearly not as
> much interest in seeing this fixed as all that.

The original patches would, I think, have been pretty scary to
back-patch, since the infrastructure didn't exist in older branches
and we were churning a fairly large amount of code.  Now that most
places are fixed and things have had five years to bake, we could
conceivably back-patch the remaining fixes.  However, I wonder if
we've really looked into how many instances of this problem remain.
If there's still ten more that we haven't bothered to fix,
back-patching one or two that we've gotten around to doing something
about doesn't make a lot of sense to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Improve behavior of concurrent TRUNCATE

От
Michael Paquier
Дата:
On Mon, Aug 13, 2018 at 01:39:06PM -0400, Robert Haas wrote:
> The original patches would, I think, have been pretty scary to
> back-patch, since the infrastructure didn't exist in older branches
> and we were churning a fairly large amount of code.  Now that most
> places are fixed and things have had five years to bake, we could
> conceivably back-patch the remaining fixes.  However, I wonder if
> we've really looked into how many instances of this problem remain.
> If there's still ten more that we haven't bothered to fix,
> back-patching one or two that we've gotten around to doing something
> about doesn't make a lot of sense to me.

If we are confident enough to say that all the holes have been patched,
then we could only back-patch down to v11 in my opinion as REINDEX
needed a change of behavior for the handling of shared catalog.  FWIW, I
have spent some time fixing all the issues reported on the original
thread, but I did not double-check all commands using an exclusive
lock, hence all the issues I have known of are:
- REINDEX with shared catalogs, fixed by 661dd23.
- TRUNCATE, with something commit only on HEAD with f841ceb2.
- VACUUM FULL, for which I have submitted a patch proposal:
https://www.postgresql.org/message-id/20180812222142.GA6097%40paquier.xyz
--
Michael

Вложения

Re: Improve behavior of concurrent TRUNCATE

От
Alvaro Herrera
Дата:
On 2018-Aug-16, Michael Paquier wrote:

> On Mon, Aug 13, 2018 at 01:39:06PM -0400, Robert Haas wrote:
> > The original patches would, I think, have been pretty scary to
> > back-patch, since the infrastructure didn't exist in older branches
> > and we were churning a fairly large amount of code.  Now that most
> > places are fixed and things have had five years to bake, we could
> > conceivably back-patch the remaining fixes.  However, I wonder if
> > we've really looked into how many instances of this problem remain.
> > If there's still ten more that we haven't bothered to fix,
> > back-patching one or two that we've gotten around to doing something
> > about doesn't make a lot of sense to me.
> 
> If we are confident enough to say that all the holes have been patched,
> then we could only back-patch down to v11 in my opinion as REINDEX
> needed a change of behavior for the handling of shared catalog.

I searched for uses of RangeVarGetRelid, as well as heap_openrv and
relation_openrv, and there are a couple that looks very suspicious.  I
don't think we can claim yet that all holes are fixed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services