Re: [HACKERS] Shaky coding for vacuuming partitioned relations

Поиск

Список

Период

Сортировка

От	Masahiko Sawada
Тема	Re: [HACKERS] Shaky coding for vacuuming partitioned relations
Дата	25 сентября 2017 г. 13:57:31
Msg-id	CAD21AoAVwsyfVpYGMnETZWNH02=nc9C=N8viakagFdX7Jij9Lw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] Shaky coding for vacuuming partitioned relations (Michael Paquier <michael.paquier@gmail.com>)
Ответы	Re: [HACKERS] Shaky coding for vacuuming partitioned relations
Список	pgsql-hackers

Дерево обсуждения

On Mon, Sep 25, 2017 at 12:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Sep 23, 2017 at 4:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Somebody inserted this into vacuum.c's get_rel_oids():
>>
>>         tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
>>         if (!HeapTupleIsValid(tuple))
>>             elog(ERROR, "cache lookup failed for relation %u", relid);
>>
>> apparently without having read the very verbose comment two lines above,
>> which points out that we're not taking any lock on the target relation.
>> So, if that relation is concurrently being dropped, you're likely to
>> get "cache lookup failed for relation NNNN" rather than anything more
>> user-friendly.
>
> This has been overlooked during the lookups of 3c3bb99, and by
> multiple people including me. elog() should never be things users can
> face, as well as cache lookups.

FWIW, the same thing can happen when specifying an invalid replication
origin name to pg_replication_origin_advance() and
pg_replication_origin_progress(). These probably should fixed as well.

>
>> A minimum-change fix would be to replace the elog() with an ereport
>> that produces the same "relation does not exist" error you'd have
>> gotten from RangeVarGetRelid, had the concurrent DROP TABLE committed
>> a few microseconds earlier.  But that feels like its's band-aiding
>> around the problem.
>
> Yeah, that's not right. This is a cache lookup error at the end.
>
>> What I'm wondering about is changing the RangeVarGetRelid call to take
>> ShareUpdateExclusiveLock rather than no lock.  That would protect the
>> syscache lookup, and it would also make the find_all_inheritors call
>> a lot more meaningful.
>>
>> If we're doing a VACUUM, the ShareUpdateExclusiveLock would be dropped
>> as soon as we close the caller's transaction, and then we'd acquire
>> the same or stronger lock inside vacuum_rel().  So that seems fine.
>> If we're doing an ANALYZE, then the lock would continue to be held
>> and analyze_rel would merely be acquiring it an extra time, so we'd
>> actually be removing a race-condition failure scenario for ANALYZE.
>> This would mean a few more cycles in lock management, but since this
>> only applies to a manual VACUUM or ANALYZE that specifies a table
>> name, I'm not too concerned about that.
>
> I think that I am +1 on that. Testing such a thing I am not seeing
> anything wrong either. The call to find_all_inheritors should also use
> ShareUpdateExclusiveLock, and the lock on top of RangeVarGetRelid()
> needs to be reworked.
>
> Attached is a proposal of patch.

FWIW, I agreed the approach of this patch, and found no problems in the patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Shaky coding for vacuuming partitioned relations