Обсуждение: BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)

Поиск
Список
Период
Сортировка

BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)

От
hannes.janetzek@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      10542
Logged by:          Hannes Janetzek
Email address:      hannes.janetzek@gmail.com
PostgreSQL version: 9.3.4
Operating system:   Linux, Ubuntu 12.04
Description:

While trying to get our database working again after a forced shutdown the
reindexing of the system tables in single user mode went into an infinite
loop. I can just roughly guess that between the lines:
https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/catalog/index.c#L2260-
L2385
the function is assuming that another process tries to delete a tuple that
is about to be indexed (even though in single user mode this should probably
not be possible)

I found these lines by attaching gdb and stepping through. The command to
reindex was:

sudo -u postgres /usr/lib/postgresql/9.3/bin/postgres --single -D
/etc/postgresql/9.3/main -d 5 osm
backend> REINDEX SYSTEM osm;

2014-06-05 23:36:27 CEST STATEMENT:  REINDEX SYSTEM osm;

2014-06-05 23:36:27 CEST DEBUG:  ProcessUtility
2014-06-05 23:36:27 CEST DEBUG:  CommitTransaction
2014-06-05 23:36:27 CEST DEBUG:  name: unnamed; blockState:       STARTED;
state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children:
2014-06-05 23:36:27 CEST DEBUG:  StartTransaction
2014-06-05 23:36:27 CEST DEBUG:  name: unnamed; blockState:       DEFAULT;
state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children:
2014-06-05 23:36:27 CEST DEBUG:  building index "pg_class_oid_index" on
table "pg_class"
2014-06-05 23:36:27 CEST DEBUG:  building index "pg_class_relname_nsp_index"
on table "pg_class"
2014-06-05 23:36:27 CEST NOTICE:  table "pg_catalog.pg_class" was reindexed
2014-06-05 23:36:27 CEST DEBUG:  CommitTransaction
2014-06-05 23:36:27 CEST DEBUG:  name: unnamed; blockState:       STARTED;
state: INPROGR, xid/subid/cid: 196462163/1/4, nestlvl: 1, children:
2014-06-05 23:36:27 CEST DEBUG:  StartTransaction
2014-06-05 23:36:27 CEST DEBUG:  name: unnamed; blockState:       DEFAULT;
state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children:
2014-06-05 23:36:27 CEST DEBUG:  building index
"pg_statistic_relid_att_inh_index" on table "pg_statistic"
Hi,

On 2014-06-05 23:00:56 +0000, hannes.janetzek@gmail.com wrote:
> While trying to get our database working again after a forced shutdown the
> reindexing of the system tables in single user mode went into an infinite
> loop.

what happened in that infinite loop? The log excerpt below doesn't show
one? Is it constantly echoing a message?

could you explain how you got into the bad state? Are you using
fsync=off?

> I can just roughly guess that between the lines:
> https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/catalog/index.c#L2260-
> L2385
> the function is assuming that another process tries to delete a tuple that
> is about to be indexed (even though in single user mode this should probably
> not be possible)

We've recently fixed a bug around that - but I don't immediately see how
it could be applicable in that scenario.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)

От
"hannes.janetzek@gmail.com"
Дата:
Hi,


On Fri, Jun 6, 2014 at 6:34 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

> Hi,
>
> On 2014-06-05 23:00:56 +0000, hannes.janetzek@gmail.com wrote:
> > While trying to get our database working again after a forced shutdown
> the
> > reindexing of the system tables in single user mode went into an infinite
> > loop.
>
> what happened in that infinite loop? The log excerpt below doesn't show
> one? Is it constantly echoing a message?
>

there were no messages from postgres while looping. I attached gdb and
stepped through the lines and found the range around L2260-L2385 repeating.
perf also showed only activity below IndexBuildHeapScan while tracing for a
few minutes. From looking at the source my guess was that a tuple that is
being indexed has a stale 'about-to-be-deleted-state'.  The *very* well
documented source states that 'we wait for the deleting transaction to
finish and check again' I wonder if a deleting transaction can be in
progress in single-user-mode while reindex in running - Though I really
don't have any clue about pg internals :)


> could you explain how you got into the bad state? Are you using
> fsync=off?
>


Yes, the instance was running without fsync. We use it for rendering
openstreetmap map tiles so content is not that critical.


Regards,
Hannes


> > I can just roughly guess that between the lines:
> >
> https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/catalog/index.c#L2260-
> > L2385
> > the function is assuming that another process tries to delete a tuple
> that
> > is about to be indexed (even though in single user mode this should
> probably
> > not be possible)
>
> We've recently fixed a bug around that - but I don't immediately see how
> it could be applicable in that scenario.
>
>



> Greetings,
>
> Andres Freund
>
> --
>  Andres Freund                     http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>



--
Hannes
Hi,

On 2014-06-07 18:11:01 +0200, hannes.janetzek@gmail.com wrote:
> On Fri, Jun 6, 2014 at 6:34 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> > On 2014-06-05 23:00:56 +0000, hannes.janetzek@gmail.com wrote:
> > > While trying to get our database working again after a forced shutdown
> > the
> > > reindexing of the system tables in single user mode went into an infinite
> > > loop.
> >
> > what happened in that infinite loop? The log excerpt below doesn't show
> > one? Is it constantly echoing a message?

> there were no messages from postgres while looping. I attached gdb and
> stepped through the lines and found the range around L2260-L2385 repeating.
> perf also showed only activity below IndexBuildHeapScan while tracing for a
> few minutes. From looking at the source my guess was that a tuple that is
> being indexed has a stale 'about-to-be-deleted-state'.  The *very* well
> documented source states that 'we wait for the deleting transaction to
> finish and check again' I wonder if a deleting transaction can be in
> progress in single-user-mode while reindex in running - Though I really
> don't have any clue about pg internals :)

It probably wasn't actually in progress - but when you're using
fsync=off and experience a OS level crash the data directory can get
into an inconsistent state. Since transactions ids that are thought to
not yet be assigned are treated as being in progress an issue like what
you describe certainly is possible.
There's only a limited amount of defense one can build in against
basically arbitrary corruption.

> > could you explain how you got into the bad state? Are you using
> > fsync=off?

> Yes, the instance was running without fsync. We use it for rendering
> openstreetmap map tiles so content is not that critical.

I'd suggest writing off a database whose machine has crashed while being
written to when using fsync=off. The chance of hard to
diagnose/undetected corruption is just too high.

Greetings,

Andres Freund