Обсуждение: Can someone help explain what's going on from the attached logs?

Поиск
Список
Период
Сортировка

Can someone help explain what's going on from the attached logs?

От
Samuel Hwang
Дата:
The log is getting from PostgreSQL 9.0.4
Basically we set up streaming replication hot-standby slave while master is under heavy load
The slave started but not accepting read-only queries,
every request will trigger the "FATAL:  the database system is starting up" error.

The slave will eventually be able to accept read-only queries after the load on master is removed.
We don't see the same problem if we set up hot-standby slave while master is not under heavy load.

the logs follow:

2011-10-18 10:34:06 MDT [17570]: [1-1] LOG:  database system was interrupted; last known up at 2011-10-18 10:33:57 MDT
2011-10-18 10:34:06 MDT [17570]: [2-1] LOG:  creating missing WAL directory "pg_xlog/archive_status"
2011-10-18 10:34:06 MDT [17570]: [3-1] LOG:  entering standby mode
2011-10-18 10:34:06 MDT [17573]: [1-1] LOG:  streaming replication successfully connected to primary
2011-10-18 10:34:06 MDT [17570]: [4-1] DEBUG:  checkpoint record is at 3/6F1CBD68
2011-10-18 10:34:06 MDT [17570]: [5-1] DEBUG:  redo record is at 3/6F0000E8; shutdown FALSE
2011-10-18 10:34:06 MDT [17570]: [6-1] DEBUG:  next transaction ID: 0/5245499; next OID: 1470356
2011-10-18 10:34:06 MDT [17570]: [7-1] DEBUG:  next MultiXactId: 703; next MultiXactOffset: 1413
2011-10-18 10:34:06 MDT [17570]: [8-1] DEBUG:  oldest unfrozen transaction ID: 654, in database 1
2011-10-18 10:34:06 MDT [17570]: [9-1] DEBUG:  transaction ID wrap limit is 2147484301, limited by database with OID 1
2011-10-18 10:34:06 MDT [17570]: [10-1] DEBUG:  initializing for hot standby
2011-10-18 10:34:06 MDT [17570]: [11-1] LOG:  consistent recovery state reached at 3/6F1CBDC0
2011-10-18 10:34:06 MDT [17570]: [12-1] LOG:  redo starts at 3/6F0000E8
2011-10-18 10:34:06 MDT [17570]: [13-1] LOG:  consistent state delayed because recovery snapshot incomplete
2011-10-18 10:34:06 MDT [17570]: [14-1] CONTEXT:  xlog redo  running xacts: nextXid 5245499 latestCompletedXid 5245497 oldestRunningXid 5244950; 300 xacts: 5245124 5245125 5245126 5245127 5245128 5245129 5245130 5245131 5245132 5245133 5245134 5245135 5245136 5245137 5245138 5245139 5245140 5245141 5245142 5245143 5245144 5245145 5245146 5245147 5245148 5245149 5245150 5245151 5245152 5245153 5245154 5245155 5245156 5245157 5245426 5245427 5245428 5245429 5245430 5245431 5245432 5245433 5245434 5245435 5245436 5245437 5245438 5245439 5245440 5245441 5245442 5245443 5245444 5245445 5245446 5245447 5245448 5245449 5245450 5245451 5245452 5245453 5245454 5245455 5245456 5245498 5244950 5244951 5244952 5244953 5244954 5244955 5244956 5244957 5244958 5244959 5244960 5244961 5244962 5244963 5244964 5244965 5244966 5244967 5244968 5244969 5244970 5244971 5244972 5244973 5244974 5244975 5244976 5244977 5244978 5244979 5244980 5244981 5244982 5244983 5245159 5245160 5245161 5245162 5245163 5245164 5245165 5245166 5245167 5245168 5245169 5245170 5245171 5245172 5245173 5245174 5245175 5245176 5245177 5245178 5245179 5245180 5245181 5245182 5245183 5245184 5245185 5245186 5245187 5245188 5245189 5245190 5245191 5245192 5245193 5245194 5245195 5245196 5245197 5245198 5245199 5245200 5245201 5245202 5245203 5245204 5245205 5245206 5245207 5245208 5245209 5245210 5245211 5245212 5245213 5245214 5245215 5245216 5245217 5245218 5245219 5245222 5245223 5245224 5245225 5244987 5244988 5244989 5244990 5244991 5244992 5244993 5244994 5244995 5244996 5244997 5244998 5244999 5245000 5245001 5245002 5245003 5245004 5245005 5245006 5245007 5245008 5245009 5245010 5245011 5245012 5245013 5245014 5245015 5245016 5245017 5245018 5245019 5245020 5245307 5245308 5245309 5245310 5245311 5245312 5245313 5245314 5245315 5245316 5245317 5245318 5245319 5245320 5245321 5245322 5245323 5245324 5245325 5245326 5245327 5245328 5245329 5245330 5245331 5245332 5245333 5245334 5245335 5245336 5245337 5245242 5245243 5245244 5245245 5245246 5245247 5245248 5245249 5245250 5245251 5245252 5245253 5245254 5245255 5245256 5245257 5245258 5245259 5245260 5245261 5245262 5245263 5245264 5245265 5245266 5245267 5245268 5245269 5245270 5245271 5245272 5245273 5245274 5245275 5245276 5245277 5245062 5245063 5245064 5245065 5245066 5245067 5245068 5245069 5245070 5245071 5245072 5245073 5245074 5245075 5245076 5245077 5245078 5245079 5245080 5245081 5245082 5245083 5245084 5245085 5245086 5245087 5245088 5245089 5245090 5245091 5245092 5245093 5245094 5245095; subxid ovf
2011-10-18 10:34:07 MDT [17593]: [1-1] FATAL:  the database system is starting up
2011-10-18 10:34:08 MDT [17633]: [1-1] FATAL:  the database system is starting up
2011-10-18 10:34:09 MDT [17669]: [1-1] FATAL:  the database system is starting up

Samuel

Re: Can someone help explain what's going on from the attached logs?

От
Craig Ringer
Дата:
On 19/10/11 01:39, Samuel Hwang wrote:
> The log is getting from PostgreSQL 9.0.4
> Basically we set up streaming replication hot-standby slave while master
> is under heavy load
> The slave started but not accepting read-only queries,
> every request will trigger the "FATAL:  the database system is starting
> up" error.
>
> The slave will eventually be able to accept read-only queries after the
> load on master is removed.
> We don't see the same problem if we set up hot-standby slave while
> master is not under heavy load.

It sounds like the standby isn't becoming ready to accept connections
until it first catches up with the master, consuming all available WAL
segments. If this never happens, it never starts accepting connections.

I'm not sure if that's by design or not and if there's any reason it's
that way. Perhaps someone more familiar with recovery can explain?

--
Craig Ringer

Re: Can someone help explain what's going on from the attached logs?

От
Simon Riggs
Дата:
On Tue, Oct 18, 2011 at 6:39 PM, Samuel Hwang <samuel@replicon.com> wrote:

> The log is getting from PostgreSQL 9.0.4
> Basically we set up streaming replication hot-standby slave while master is
> under heavy load
> The slave started but not accepting read-only queries,
> every request will trigger the "FATAL:  the database system is starting up"
> error.
>
> The slave will eventually be able to accept read-only queries after the load
> on master is removed.
> We don't see the same problem if we set up hot-standby slave while master is
> not under heavy load.
>
> the logs follow:

> 2011-10-18 10:34:06 MDT [17570]: [13-1] LOG:  consistent state delayed
> because recovery snapshot incomplete

Caveat #2 applies here
http://developer.postgresql.org/pgdocs/postgres/hot-standby.html#HOT-STANDBY-CAVEATS

The consistent state is delayed until your long running transactions
end, which is workload dependent but transient.

It's possible we will find another way of doing this in a future release.

Until then, I suggest starting base backup to create the standby when
not running both long transactions and transactions with many
subtransactions.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Can someone help explain what's going on from the attached logs?

От
Chris Redekop
Дата:
Caveat #2 applies here
http://developer.postgresql.org/pgdocs/postgres/hot-standby.html#HOT-STANDBY-CAVEATS

The consistent state is delayed until your long running transactions
end, which is workload dependent but transient.

I'm not quite sure how this correlates to what I'm seeing in 9.1.1.  When attempting to start a hotstandby while the primary is under load it seems to get stuck in that 'starting up' mode even when there are no open transactions.  If I get it into this state, and then remove all the load from the primary it still will not finish starting up.  If I select from pg_stat_activity on the primary it shows a couple connections, but they all have null 'xact_start's and <IDLE> 'current_query's.  Even if I then kill all the connections to the primary (via pg_terminate_backend) the hotstandby still will not finish starting up.  I would assume there can't be any transactions in progress if there are no connections to the primary.  Attempting to restart the hotstandby when in this state produces the same result.  If there is a transaction in progress for the duration of the backup (or something like that) can it cause it to get into this state?


Here is a debug3 log of this startup from 9.1.1:

2011-10-26 11:25:32.931 MDT [22640]: [1-1] LOG:  database system was shut down in recovery at 2011-10-26 11:25:02 MDT
2011-10-26 11:25:32.931 MDT [22640]: [2-1] DEBUG:  standby_mode = 'on'
2011-10-26 11:25:32.931 MDT [22640]: [3-1] DEBUG:  primary_conninfo = 'host=othernode port=5432 user=postgres application_name=sync_rep_test'
2011-10-26 11:25:32.931 MDT [22640]: [4-1] DEBUG:  trigger_file = '/db/data/trigger_file'
2011-10-26 11:25:32.931 MDT [22640]: [5-1] LOG:  entering standby mode
2011-10-26 11:25:32.931 MDT [22640]: [6-1] DEBUG:  checkpoint record is at 7/96667628
2011-10-26 11:25:32.931 MDT [22640]: [7-1] DEBUG:  redo record is at 7/960000A8; shutdown FALSE
2011-10-26 11:25:32.931 MDT [22640]: [8-1] DEBUG:  next transaction ID: 0/11855637; next OID: 3219844
2011-10-26 11:25:32.931 MDT [22640]: [9-1] DEBUG:  next MultiXactId: 1895; next MultiXactOffset: 3810
2011-10-26 11:25:32.931 MDT [22640]: [10-1] DEBUG:  oldest unfrozen transaction ID: 1669, in database 1
2011-10-26 11:25:32.931 MDT [22640]: [11-1] DEBUG:  transaction ID wrap limit is 2147485316, limited by database with OID 1
2011-10-26 11:25:32.932 MDT [22640]: [12-1] DEBUG:  resetting unlogged relations: cleanup 1 init 0
2011-10-26 11:25:32.939 MDT [22640]: [13-1] DEBUG:  initializing for hot standby
2011-10-26 11:25:32.939 MDT [22640]: [14-1] LOG:  redo starts at 7/960000A8
2011-10-26 11:25:33.027 MDT [22640]: [15-1] DEBUG:  running transaction data initialized
2011-10-26 11:25:33.027 MDT [22640]: [16-1] CONTEXT:  xlog redo  running xacts: nextXid 11855637 latestCompletedXid 11855636 oldestRunningXid 11855250; 195 xacts: 11855359 11855360 ... 11855444 11855445; subxid ovf
2011-10-26 11:25:33.028 MDT [22640]: [17-1] DEBUG:  0 KnownAssignedXids (num=0 tail=0 head=0)
2011-10-26 11:25:33.028 MDT [22640]: [18-1] CONTEXT:  xlog redo  running xacts: nextXid 11855637 latestCompletedXid 11855636 oldestRunningXid 11855250; 195 xacts: 11855359 11855360 ... 11855444 11855445; subxid ovf
2011-10-26 11:25:33.028 MDT [22640]: [19-1] LOG:  consistent state delayed because recovery snapshot incomplete
2011-10-26 11:25:33.028 MDT [22640]: [20-1] CONTEXT:  xlog redo  running xacts: nextXid 11855637 latestCompletedXid 11855636 oldestRunningXid 11855250; 195 xacts: 11855359 11855360 ... 11855444 11855445; subxid ovf
2011-10-26 11:25:33.514 MDT [22640]: [21-1] LOG:  consistent recovery state reached at 7/98480E28
2011-10-26 11:25:33.515 MDT [22640]: [22-1] LOG:  record with zero length at 7/98480E28
2011-10-26 11:25:33.515 MDT [22640]: [23-1] DEBUG:  could not open file "pg_xlog/000000010000000700000098" (log file 7, segment 152): No such file or directory
2011-10-26 11:25:33.515 MDT [22642]: [1-1] DEBUG:  find_in_dynamic_libpath: trying "/usr/local/postgresql-9.1.1/lib/libpqwalreceiver"
2011-10-26 11:25:33.515 MDT [22642]: [2-1] DEBUG:  find_in_dynamic_libpath: trying "/usr/local/postgresql-9.1.1/lib/libpqwalreceiver.so"
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 7/98000000
2011-10-26 11:25:33.519 MDT [22642]: [3-1] LOG:  streaming replication successfully connected to primary
2011-10-26 11:25:33.519 MDT [22642]: [4-1] DEBUG:  sending write 0/0 flush 0/0 apply 7/98480E28
2011-10-26 11:25:33.522 MDT [22642]: [5-1] DEBUG:  sending write 7/98020000 flush 0/0 apply 7/98480E28
2011-10-26 11:25:33.568 MDT [22642]: [6-1] DEBUG:  sending write 7/98020000 flush 7/98020000 apply 7/98480E28
2011-10-26 11:25:33.569 MDT [22642]: [7-1] DEBUG:  sending write 7/98040000 flush 7/98020000 apply 7/98480E28
...
2011-10-26 11:25:33.924 MDT [22642]: [49-1] DEBUG:  sending write 7/98460000 flush 7/98400000 apply 7/98480E28
2011-10-26 11:25:33.943 MDT [22642]: [50-1] DEBUG:  sending write 7/98460000 flush 7/98460000 apply 7/98480E28
DEBUG:  standby "sync_rep_test" has now caught up with primary
2011-10-26 11:25:33.943 MDT [22642]: [51-1] DEBUG:  sending write 7/98480E28 flush 7/98460000 apply 7/98480E28
2011-10-26 11:25:33.959 MDT [22642]: [52-1] DEBUG:  sending write 7/98480E28 flush 7/98480E28 apply 7/98480E28
2011-10-26 11:25:34.192 MDT [22634]: [33-1] DEBUG:  forked new backend, pid=22644 socket=7
2011-10-26 11:25:34.192 MDT [22644]: [1-1] FATAL:  the database system is starting up
2011-10-26 11:25:34.192 MDT [22644]: [2-1] DEBUG:  shmem_exit(1): 0 callbacks to make
2011-10-26 11:25:34.192 MDT [22644]: [3-1] DEBUG:  proc_exit(1): 1 callbacks to make
2011-10-26 11:25:34.192 MDT [22644]: [4-1] DEBUG:  exit(1)
2011-10-26 11:25:34.192 MDT [22644]: [5-1] DEBUG:  shmem_exit(-1): 0 callbacks to make
2011-10-26 11:25:34.192 MDT [22644]: [6-1] DEBUG:  proc_exit(-1): 0 callbacks to make
2011-10-26 11:25:34.192 MDT [22634]: [34-1] DEBUG:  server process (PID 22644) exited with exit code 1
2011-10-26 11:25:35.171 MDT [22634]: [35-1] DEBUG:  forked new backend, pid=22655 socket=7
2011-10-26 11:25:35.171 MDT [22655]: [1-1] FATAL:  the database system is starting up
2011-10-26 11:25:35.171 MDT [22655]: [2-1] DEBUG:  shmem_exit(1): 0 callbacks to make
2011-10-26 11:25:35.171 MDT [22655]: [3-1] DEBUG:  proc_exit(1): 1 callbacks to make
2011-10-26 11:25:35.171 MDT [22655]: [4-1] DEBUG:  exit(1)
2011-10-26 11:25:35.171 MDT [22655]: [5-1] DEBUG:  shmem_exit(-1): 0 callbacks to make
2011-10-26 11:25:35.171 MDT [22655]: [6-1] DEBUG:  proc_exit(-1): 0 callbacks to make
2011-10-26 11:25:35.172 MDT [22634]: [36-1] DEBUG:  server process (PID 22655) exited with exit code 1
and so on...

Re: Can someone help explain what's going on from the attached logs?

От
Simon Riggs
Дата:
On Wed, Oct 26, 2011 at 6:41 PM, Chris Redekop <chris@replicon.com> wrote:
>> Caveat #2 applies here
>>
>> http://developer.postgresql.org/pgdocs/postgres/hot-standby.html#HOT-STANDBY-CAVEATS
>>
>> The consistent state is delayed until your long running transactions
>> end, which is workload dependent but transient.
>
> I'm not quite sure how this correlates to what I'm seeing in 9.1.1.  When
> attempting to start a hotstandby while the primary is under load it seems to
> get stuck in that 'starting up' mode even when there are no open
> transactions.  If I get it into this state, and then remove all the load
> from the primary it still will not finish starting up.  If I select from
> pg_stat_activity on the primary it shows a couple connections, but they all
> have null 'xact_start's and <IDLE> 'current_query's.  Even if I then kill
> all the connections to the primary (via pg_terminate_backend) the hotstandby
> still will not finish starting up.  I would assume there can't be any
> transactions in progress if there are no connections to the primary.
>  Attempting to restart the hotstandby when in this state produces the same
> result.  If there is a transaction in progress for the duration of the
> backup (or something like that) can it cause it to get into this state?

There's nothing in the log you've shown to indicate any problems.

Yes, when that caveat applies we may wait for some time to find a good
starting point. That could be anywhere from seconds to hours,
depending upon the exact load on the master, but shouldn't be any
longer than your longest running write transaction executing at that
time.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services