Обсуждение: BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)
BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)
От
hannes.janetzek@gmail.com
Дата:
The following bug has been logged on the website: Bug reference: 10542 Logged by: Hannes Janetzek Email address: hannes.janetzek@gmail.com PostgreSQL version: 9.3.4 Operating system: Linux, Ubuntu 12.04 Description: While trying to get our database working again after a forced shutdown the reindexing of the system tables in single user mode went into an infinite loop. I can just roughly guess that between the lines: https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/catalog/index.c#L2260- L2385 the function is assuming that another process tries to delete a tuple that is about to be indexed (even though in single user mode this should probably not be possible) I found these lines by attaching gdb and stepping through. The command to reindex was: sudo -u postgres /usr/lib/postgresql/9.3/bin/postgres --single -D /etc/postgresql/9.3/main -d 5 osm backend> REINDEX SYSTEM osm; 2014-06-05 23:36:27 CEST STATEMENT: REINDEX SYSTEM osm; 2014-06-05 23:36:27 CEST DEBUG: ProcessUtility 2014-06-05 23:36:27 CEST DEBUG: CommitTransaction 2014-06-05 23:36:27 CEST DEBUG: name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: 2014-06-05 23:36:27 CEST DEBUG: StartTransaction 2014-06-05 23:36:27 CEST DEBUG: name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: 2014-06-05 23:36:27 CEST DEBUG: building index "pg_class_oid_index" on table "pg_class" 2014-06-05 23:36:27 CEST DEBUG: building index "pg_class_relname_nsp_index" on table "pg_class" 2014-06-05 23:36:27 CEST NOTICE: table "pg_catalog.pg_class" was reindexed 2014-06-05 23:36:27 CEST DEBUG: CommitTransaction 2014-06-05 23:36:27 CEST DEBUG: name: unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 196462163/1/4, nestlvl: 1, children: 2014-06-05 23:36:27 CEST DEBUG: StartTransaction 2014-06-05 23:36:27 CEST DEBUG: name: unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children: 2014-06-05 23:36:27 CEST DEBUG: building index "pg_statistic_relid_att_inh_index" on table "pg_statistic"
Hi, On 2014-06-05 23:00:56 +0000, hannes.janetzek@gmail.com wrote: > While trying to get our database working again after a forced shutdown the > reindexing of the system tables in single user mode went into an infinite > loop. what happened in that infinite loop? The log excerpt below doesn't show one? Is it constantly echoing a message? could you explain how you got into the bad state? Are you using fsync=off? > I can just roughly guess that between the lines: > https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/catalog/index.c#L2260- > L2385 > the function is assuming that another process tries to delete a tuple that > is about to be indexed (even though in single user mode this should probably > not be possible) We've recently fixed a bug around that - but I don't immediately see how it could be applicable in that scenario. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: BUG #10542: infinite loop in index.c when trying to reindex system tables (probably corrupted db state)
От
"hannes.janetzek@gmail.com"
Дата:
Hi, On Fri, Jun 6, 2014 at 6:34 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2014-06-05 23:00:56 +0000, hannes.janetzek@gmail.com wrote: > > While trying to get our database working again after a forced shutdown > the > > reindexing of the system tables in single user mode went into an infinite > > loop. > > what happened in that infinite loop? The log excerpt below doesn't show > one? Is it constantly echoing a message? > there were no messages from postgres while looping. I attached gdb and stepped through the lines and found the range around L2260-L2385 repeating. perf also showed only activity below IndexBuildHeapScan while tracing for a few minutes. From looking at the source my guess was that a tuple that is being indexed has a stale 'about-to-be-deleted-state'. The *very* well documented source states that 'we wait for the deleting transaction to finish and check again' I wonder if a deleting transaction can be in progress in single-user-mode while reindex in running - Though I really don't have any clue about pg internals :) > could you explain how you got into the bad state? Are you using > fsync=off? > Yes, the instance was running without fsync. We use it for rendering openstreetmap map tiles so content is not that critical. Regards, Hannes > > I can just roughly guess that between the lines: > > > https://github.com/postgres/postgres/blob/REL9_3_STABLE/src/backend/catalog/index.c#L2260- > > L2385 > > the function is assuming that another process tries to delete a tuple > that > > is about to be indexed (even though in single user mode this should > probably > > not be possible) > > We've recently fixed a bug around that - but I don't immediately see how > it could be applicable in that scenario. > > > Greetings, > > Andres Freund > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- Hannes
Hi, On 2014-06-07 18:11:01 +0200, hannes.janetzek@gmail.com wrote: > On Fri, Jun 6, 2014 at 6:34 PM, Andres Freund <andres@2ndquadrant.com> > wrote: > > On 2014-06-05 23:00:56 +0000, hannes.janetzek@gmail.com wrote: > > > While trying to get our database working again after a forced shutdown > > the > > > reindexing of the system tables in single user mode went into an infinite > > > loop. > > > > what happened in that infinite loop? The log excerpt below doesn't show > > one? Is it constantly echoing a message? > there were no messages from postgres while looping. I attached gdb and > stepped through the lines and found the range around L2260-L2385 repeating. > perf also showed only activity below IndexBuildHeapScan while tracing for a > few minutes. From looking at the source my guess was that a tuple that is > being indexed has a stale 'about-to-be-deleted-state'. The *very* well > documented source states that 'we wait for the deleting transaction to > finish and check again' I wonder if a deleting transaction can be in > progress in single-user-mode while reindex in running - Though I really > don't have any clue about pg internals :) It probably wasn't actually in progress - but when you're using fsync=off and experience a OS level crash the data directory can get into an inconsistent state. Since transactions ids that are thought to not yet be assigned are treated as being in progress an issue like what you describe certainly is possible. There's only a limited amount of defense one can build in against basically arbitrary corruption. > > could you explain how you got into the bad state? Are you using > > fsync=off? > Yes, the instance was running without fsync. We use it for rendering > openstreetmap map tiles so content is not that critical. I'd suggest writing off a database whose machine has crashed while being written to when using fsync=off. The chance of hard to diagnose/undetected corruption is just too high. Greetings, Andres Freund