Re: [HACKERS] Shaky coding for vacuuming partitioned relations

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [HACKERS] Shaky coding for vacuuming partitioned relations
Дата
Msg-id CAB7nPqQa37OUne_rJzpmWC4EXQaLX9f27-Ma_-rSFL_3mj+HFQ@mail.gmail.com
обсуждение исходный текст
Ответ на [HACKERS] Shaky coding for vacuuming partitioned relations  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [HACKERS] Shaky coding for vacuuming partitioned relations
Re: [HACKERS] Shaky coding for vacuuming partitioned relations
Список pgsql-hackers
On Sat, Sep 23, 2017 at 4:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Somebody inserted this into vacuum.c's get_rel_oids():
>
>         tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
>         if (!HeapTupleIsValid(tuple))
>             elog(ERROR, "cache lookup failed for relation %u", relid);
>
> apparently without having read the very verbose comment two lines above,
> which points out that we're not taking any lock on the target relation.
> So, if that relation is concurrently being dropped, you're likely to
> get "cache lookup failed for relation NNNN" rather than anything more
> user-friendly.

This has been overlooked during the lookups of 3c3bb99, and by
multiple people including me. elog() should never be things users can
face, as well as cache lookups.

> A minimum-change fix would be to replace the elog() with an ereport
> that produces the same "relation does not exist" error you'd have
> gotten from RangeVarGetRelid, had the concurrent DROP TABLE committed
> a few microseconds earlier.  But that feels like its's band-aiding
> around the problem.

Yeah, that's not right. This is a cache lookup error at the end.

> What I'm wondering about is changing the RangeVarGetRelid call to take
> ShareUpdateExclusiveLock rather than no lock.  That would protect the
> syscache lookup, and it would also make the find_all_inheritors call
> a lot more meaningful.
>
> If we're doing a VACUUM, the ShareUpdateExclusiveLock would be dropped
> as soon as we close the caller's transaction, and then we'd acquire
> the same or stronger lock inside vacuum_rel().  So that seems fine.
> If we're doing an ANALYZE, then the lock would continue to be held
> and analyze_rel would merely be acquiring it an extra time, so we'd
> actually be removing a race-condition failure scenario for ANALYZE.
> This would mean a few more cycles in lock management, but since this
> only applies to a manual VACUUM or ANALYZE that specifies a table
> name, I'm not too concerned about that.

I think that I am +1 on that. Testing such a thing I am not seeing
anything wrong either. The call to find_all_inheritors should also use
ShareUpdateExclusiveLock, and the lock on top of RangeVarGetRelid()
needs to be reworked.

Attached is a proposal of patch.

> Thoughts?

As long as I don't forget... Another thing currently on HEAD and
REL_10_STABLE is that OIDs of partitioned tables are used, but the
RangeVar of the parent is used for error reports. This leads to
incorrect reports if a partition gets away in the middle of autovacuum
as only information about the parent is reported to the user. I am not
saying that this needs to be fixed in REL_10_STABLE at this stage
though as this would require some refactoring similar to what the
patch adding support for VACUUM with multiple relations does. But I
digress here.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] What's with all the fflush(stderr) calls in pg_standby.c?
Следующее
От: Andres Freund
Дата:
Сообщение: Re: [HACKERS] What's with all the fflush(stderr) calls inpg_standby.c?