Problem with FSM & streaming replication

Поиск
Список
Период
Сортировка
От Philip Scott
Тема Problem with FSM & streaming replication
Дата
Msg-id CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
обсуждение исходный текст
Ответы Re: Problem with FSM & streaming replication  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-bugs
Hi Folks,

Earlier this week we simulated failing over to our secondary data center.
We have been using streaming replication, with the slaves available for
read-only queries.

Shortly after failing over, we began getting the following message when
inserting into one particular table:

could not read block 11 in file "base/16401/16748": read only 0 of 8192
bytes

This particular relation is only 10 blocks long.

I got us running again by COPY-ing the table, truncating it, and COPY-ing
the data back in. However I bought up a backup to investigate the problem.

You can replicate it by downloading the following database, and running
'stress.py' (though you will need python & psycopg2 to do so)

https://www.dropbox.com/s/vid55gahlu07tm0/2016-07-04-basic3.tar.bz2?dl=0

After a deep dive into the postgres source code, finding things I never
knew I could do with GDB (record and rewind, wow!), and with help from the
guys at 2ndquadrant, we have determined that the cause is a free space map
that contains an entry for the non-existent block 11.
RecordAndGetPageWithFreeSpace is looking at the FSM and returning block 11.
We then try and load that block and then it's game over. I have stepped
through the code to verify this, and even built a graphviz graph of a dump
of the FSM structure :)

After discovering the wonders of pageinspect, I was able to look at the
entire database and found 23 tables where there are non-zero FSM entries
for pages numbers which do not exist. Running vacuum on them fixes it. This
is the query I used to check:

WITH stuff
     AS (SELECT nspname,
                relname,
                relpages,
                *pg_freespace*(nspname
                             || '."'
                             || relname
                             || '"', relpages) AS freespace
         FROM   pg_class
                JOIN pg_catalog.pg_namespace AS n
                  ON n.oid = pg_class.relnamespace)
SELECT *
FROM   stuff
WHERE  freespace != 0

So I guess I have a couple of questions:

   - Is it valid for the FSM to contain an non-zero leaf entry for a block
   that does not exist? If so, should I submit a patch which checks before
   loading that the block we looked up in the FSM is < relpages
   - If it is not valid, is there possibly a problem with streaming
   replication with regards to the FSM? I am at a bit of a loss as to where to
   continue investigating that.
   - Is it perhaps that in some point in history, possibly with an older
   version of postgres, these FSMs got messed up but are unlikely to in the
   future. I should just vacuum them (or pg_dump/restore?) and carry on with
   business as usual.
   - As an aside, is VACUUM the only process that truncates a filenode and
   reduces the number of pages in a relation?

I note with interest that the FSM is not WAL logged, and wonder if
something like the following could occur:

   - On the master, we have 11 blocks.
   - We rsync / pg_basebackup the whole pg_data directory to the slave
   - On the master, we delete some stuff, vacuum the table, now we have
   only 10 blocks
   - Table is WAL logged, so the main relation data file is truncated on
   the master and the slave
   - However, since the FSM is not WAL logged, the FSM on the slave is not
   updated to know this and still talks about block 11
   - We go looking for a space to put a lovely new tuple, and by chance
   start somewhere in the tree that leads us eventually to block 11
   - Try and read block 11
   - Sad times

Some possibly useful facts:

   - We are on postgres 9.5.3, running RHEL 6. The database has never to my
   knowledge crashed; they are 'proper' server boxes, battery backed cache,
   UPS, in a data center etc..
   - The tables that are effected are mostly ones where there are many
   INSERTs and DELETEs, but the table does not change much in size over time.

Thank you very much for your time.

Kind Regards,

Philip

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby
Следующее
От: Marco Nenciarini
Дата:
Сообщение: Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby