Обсуждение: Concurrency bug with vacuum full (cluster) and toast

Поиск
Список
Период
Сортировка

Concurrency bug with vacuum full (cluster) and toast

От
Alexander Korotkov
Дата:
Hi all,

I've discovered bug, when vacuum full fails with error, because it
couldn't find toast chunks deleted by itself.  That happens because
cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
and independently.  That causes heap_page_prune_opt() to clean chunks,
which rebuild_relation() expects to exist.  This bug very rarely
happens on busy systems which actively update toast values.  But I
found way to reliably reproduce it using debugger.

*Setup*

CREATE FUNCTION random_string(seed integer, length integer) RETURNS text
    AS $$
        SELECT substr(
                    string_agg(
                        substr(
                            encode(
                                decode(
                                    md5(seed::text || '-' || i::text),
                                    'hex'),
                                'base64'),
                            1, 21),
                        ''),
                    1, length)
        FROM generate_series(1, (length + 20) / 21) i; $$
LANGUAGE SQL;

CREATE TABLE test (val text);
INSERT INTO test (random_string(1,100000));

*Reproduction steps*

s1-s3 are three parallel PostgreSQL sessions
s3lldb is lldb connected to s1

At first s1 acquires snapshot and holds it.

s1# begin transaction isolation level repeatable read;
s1# select 1;

Then s2 makes multiple updates of our toasted value.

s2# update test set val = random_string(2,100000);
s2# update test set val = random_string(3,100000);
s2# update test set val = random_string(4,100000);
s2# update test set val = random_string(5,100000);
s2# update test set val = random_string(6,100000);
s2# update test set val = random_string(7,100000);

Then s3 starting vacuum full stopping on vacuum_set_xid_limits().

s3lldb# b vacuum_set_xid_limits
s3# vacuum full test;

We pass vacuum_set_xid_limits() making sure old tuple versions made by
s2 would be recently dead for vacuum full.

s3lldb# finish

Then s1 releases snapshot.  Then heap_page_prune_opt() called from
toast accessed would cleanup toast chunks, which vacuum full expects
to be recently dead.

s1# commit;

Finally, we continue our vacuum full and get error!

s3lldb# continue
s3#
ERROR:  unexpected chunk number 50 (expected 2) for toast value 16429
in pg_toast_16387

Attached patch contains dirty fix of this bug, which just prevents
heap_page_prune_opt() from clean tuples, when it's called from
rebuild_relation().  Actually, it's not something I'm proposing to
commit or even review, it might be just some start point for thoughts.

Any ideas?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Concurrency bug with vacuum full (cluster) and toast

От
Robert Haas
Дата:
On Mon, Mar 18, 2019 at 12:53 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I've discovered bug, when vacuum full fails with error, because it
> couldn't find toast chunks deleted by itself.  That happens because
> cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
> and independently.  That causes heap_page_prune_opt() to clean chunks,
> which rebuild_relation() expects to exist.  This bug very rarely
> happens on busy systems which actively update toast values.  But I
> found way to reliably reproduce it using debugger.

Boy, I really feel like we've talked about this before.  These are
somewhat-related discussions, but none of them are exactly the same
thing:

http://postgr.es/m/1335.1304187758@sss.pgh.pa.us
http://postgr.es/m/20362.1359747327@sss.pgh.pa.us
http://postgr.es/m/87in8nec96.fsf@news-spur.riddles.org.uk

I don't know whether we've actually talked about this precise problem
before and I just can't find the thread, or whether I'm confusing what
you've found here with some closely-related issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Concurrency bug with vacuum full (cluster) and toast

От
Alexander Korotkov
Дата:
On Tue, Mar 19, 2019 at 6:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 18, 2019 at 12:53 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > I've discovered bug, when vacuum full fails with error, because it
> > couldn't find toast chunks deleted by itself.  That happens because
> > cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
> > and independently.  That causes heap_page_prune_opt() to clean chunks,
> > which rebuild_relation() expects to exist.  This bug very rarely
> > happens on busy systems which actively update toast values.  But I
> > found way to reliably reproduce it using debugger.
>
> Boy, I really feel like we've talked about this before.  These are
> somewhat-related discussions, but none of them are exactly the same
> thing:
>
> http://postgr.es/m/1335.1304187758@sss.pgh.pa.us
> http://postgr.es/m/20362.1359747327@sss.pgh.pa.us
> http://postgr.es/m/87in8nec96.fsf@news-spur.riddles.org.uk
>
> I don't know whether we've actually talked about this precise problem
> before and I just can't find the thread, or whether I'm confusing what
> you've found here with some closely-related issue.

Thank you for pointing, but none of the threads you pointed describe
this exact problem.  Now I see this bug have a set of cute siblings :)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Concurrency bug with vacuum full (cluster) and toast

От
Robert Haas
Дата:
On Tue, Mar 19, 2019 at 1:37 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for pointing, but none of the threads you pointed describe
> this exact problem.  Now I see this bug have a set of cute siblings :)

Yeah.  I really thought this precise issue -- the interlocking between
the VACUUM of the main table and the VACUUM of the TOAST table -- had
been discussed somewhere before.  But I couldn't find that discussion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Concurrency bug with vacuum full (cluster) and toast

От
Justin Pryzby
Дата:
Hi,

On Fri, Mar 22, 2019 at 02:27:07PM -0400, Robert Haas wrote:
> On Tue, Mar 19, 2019 at 1:37 PM Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
> > Thank you for pointing, but none of the threads you pointed describe
> > this exact problem.  Now I see this bug have a set of cute siblings :)
> 
> Yeah.  I really thought this precise issue -- the interlocking between
> the VACUUM of the main table and the VACUUM of the TOAST table -- had
> been discussed somewhere before.  But I couldn't find that discussion.

That also describes the longstanding issue with pg_statistic / pg_toast_2619,
no ?

I think that's maybe what Robert is remembering, and searching for
pg_toast_2619 gives a good number of results (including my own problem report).

Is this an "Opened Item" ?



Re: Concurrency bug with vacuum full (cluster) and toast

От
Stephen Frost
Дата:
Greetings,

* Justin Pryzby (pryzby@telsasoft.com) wrote:
> On Fri, Mar 22, 2019 at 02:27:07PM -0400, Robert Haas wrote:
> > On Tue, Mar 19, 2019 at 1:37 PM Alexander Korotkov
> > <a.korotkov@postgrespro.ru> wrote:
> > > Thank you for pointing, but none of the threads you pointed describe
> > > this exact problem.  Now I see this bug have a set of cute siblings :)
> >
> > Yeah.  I really thought this precise issue -- the interlocking between
> > the VACUUM of the main table and the VACUUM of the TOAST table -- had
> > been discussed somewhere before.  But I couldn't find that discussion.
>
> That also describes the longstanding issue with pg_statistic / pg_toast_2619,
> no ?
>
> I think that's maybe what Robert is remembering, and searching for
> pg_toast_2619 gives a good number of results (including my own problem report).
>
> Is this an "Opened Item" ?

If you're referring to the v12 open items list, then, no, I wouldn't
think it would be as it's not a new issue (unless I've misunderstood).
Only regressions from prior versions are appropriate for the v12 open
items list, long-standing bugs/issues should be addressed and fixed, of
course, but those would be fixed and then back-patched.

Thanks!

Stephen

Вложения

Re: Concurrency bug with vacuum full (cluster) and toast

От
Michael Paquier
Дата:
On Wed, Apr 03, 2019 at 11:26:20AM -0400, Stephen Frost wrote:
> If you're referring to the v12 open items list, then, no, I wouldn't
> think it would be as it's not a new issue (unless I've misunderstood).
> Only regressions from prior versions are appropriate for the v12 open
> items list, long-standing bugs/issues should be addressed and fixed, of
> course, but those would be fixed and then back-patched.

Please no open items which do not apply directly and only to v12.
There is a section on the page for older bugs however, which could
prove to be useful for this case (items listed in this section do not
have any impact on the release normally):
https://wiki.postgresql.org/wiki/PostgreSQL_12_Open_Items#Older_Bugs
--
Michael

Вложения