WAL archiving file name collision

Поиск
Список
Период
Сортировка
От Berge Schwebs Bjørlo
Тема WAL archiving file name collision
Дата
Msg-id 20090909101732.GD13172@trivini.no
обсуждение исходный текст
Ответы Re: WAL archiving file name collision  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
Greetings!

We've got a moderately busy DB host running PostgreSQL 8.3.7 from Debian Lenny
with WAL archiving. It's been rsyncing its WAL files to a backup host for the
last year or so, with the following archive_command:

ssh pgbackup@backup test ! -f pgbackup-cirkus/%f && rsync -az %p pgbackup@backup:pgbackup-cirkus/%f

Recently, WAL archiving begain failing on the test which checks wether the
file exists. This first occured two hours after an incident where someone
edited pg_hba.conf and left it with permissions denying Postgres read access
to it.  Upon SIGHUP the cluster naturally shut down. It was discovered
promptly, and according to this person, there were "some processes named
postgres still running". He ran "/etc/init.d/postgresql-8.3 start" anyway,
which brought up the cluster:

2009-09-07 20:39:55 CEST 5782   LOG:  received SIGHUP, reloading configuration files
2009-09-07 20:39:55 CEST 5782   FATAL:  could not open configuration file "/etc/postgresql/8.3/main/pg_hba.conf":
Permissiondenied 
2009-09-07 20:40:07 CEST 14398 mdb2 mdb2web LOG:  could not receive data from client: Connection reset by peer
2009-09-07 20:40:07 CEST 14398 mdb2 mdb2web LOG:  unexpected EOF on client connection
2009-09-07 20:40:15 CEST 14485 mdb2 billig_web LOG:  could not receive data from client: Connection reset by peer
2009-09-07 20:40:15 CEST 14485 mdb2 billig_web LOG:  unexpected EOF on client connection
2009-09-07 20:41:29 CEST 16197   LOG:  could not load root certificate file "root.crt": no SSL error reported
2009-09-07 20:41:29 CEST 16197   DETAIL:  Will not verify client certificates.
2009-09-07 20:41:29 CEST 16197   FATAL:  could not open configuration file "/etc/postgresql/8.3/main/pg_hba.conf":
Permissiondenied 
2009-09-07 20:42:04 CEST 16748   LOG:  could not load root certificate file "root.crt": no SSL error reported
2009-09-07 20:42:04 CEST 16748   DETAIL:  Will not verify client certificates.
2009-09-07 20:42:04 CEST 16749   LOG:  database system was interrupted; last known up at 2009-09-07 20:37:38 CEST
2009-09-07 20:42:04 CEST 16749   LOG:  database system was not properly shut down; automatic recovery in progress
2009-09-07 20:42:04 CEST 16749   LOG:  redo starts at 65/F00718D0
2009-09-07 20:42:04 CEST 16749   LOG:  record with zero length at 65/F1039488
2009-09-07 20:42:04 CEST 16749   LOG:  redo done at 65/F1039458
2009-09-07 20:42:04 CEST 16749   LOG:  last completed transaction was at log time 2009-09-07 20:39:52.010594+02
2009-09-07 20:42:04 CEST 16749   LOG:  checkpoint starting: shutdown immediate
2009-09-07 20:42:04 CEST 16750 [unknown] [unknown] LOG:  connection received: host=[local]
2009-09-07 20:42:04 CEST 16750 [unknown] [unknown] LOG:  incomplete startup packet
2009-09-07 20:42:04 CEST 16749   LOG:  checkpoint complete: wrote 43 buffers (0.2%); 0 transaction log file(s) added, 0
removed,6 recycled; write=0.001 s, sync=0.429 s, total=0.464 s 
2009-09-07 20:42:04 CEST 16756 [unknown] [unknown] LOG:  connection received: host=2001:700:300:1800::1917 port=50140
2009-09-07 20:42:04 CEST 16749   LOG:  recovering prepared transaction 809084
2009-09-07 20:42:04 CEST 16756 mdb2 uka_web FATAL:  the database system is starting up
2009-09-07 20:42:04 CEST 16757 [unknown] [unknown] LOG:  connection received: host=2001:700:300:1800::1917 port=50141
2009-09-07 20:42:04 CEST 16757 mdb2 uka_web FATAL:  the database system is starting up
2009-09-07 20:42:05 CEST 16762 [unknown] [unknown] LOG:  connection received: host=2001:700:300:1800::1919 port=36523
2009-09-07 20:42:05 CEST 16762 greylisting exim FATAL:  the database system is starting up
2009-09-07 20:42:05 CEST 16763 [unknown] [unknown] LOG:  connection received: host=2001:700:300:1800::1919 port=36524
2009-09-07 20:42:05 CEST 16763 greylisting exim FATAL:  the database system is starting up
2009-09-07 20:42:05 CEST 16764 [unknown] [unknown] LOG:  connection received: host=2001:700:300:1800::1919 port=36525
2009-09-07 20:42:05 CEST 16764 greylisting exim FATAL:  the database system is starting up
2009-09-07 20:42:05 CEST 16765 [unknown] [unknown] LOG:  connection received: host=2001:700:300:1800::1919 port=36526
2009-09-07 20:42:05 CEST 16765 greylisting exim FATAL:  the database system is starting up
2009-09-07 20:42:05 CEST 16768 [unknown] [unknown] LOG:  connection received: host=[local]
2009-09-07 20:42:05 CEST 16768 postgres postgres FATAL:  the database system is starting up
2009-09-07 20:42:05 CEST 16776   LOG:  autovacuum launcher started
2009-09-07 20:42:05 CEST 16748   LOG:  database system is ready to accept connections
2009-09-07 20:42:05 CEST 16786 [unknown] [unknown] LOG:  connection received: host=[local]
2009-09-07 20:42:05 CEST 16786 postgres postgres LOG:  connection authorized: user=postgres database=postgres

The cluster recovered, which strikes me as odd - shouldn't Postgres at least
rollback uncommited transactions and shut down cleanly upon an unreadable
configuration file?

Anyway, the cluster ran fine, handling several transactions per second, until
one and a half hour later, where WAL archiving suddenly stopped. (The cluster
runs on happily anyway, though.)

2009-09-07 22:12:53 CEST 2691 mdb2_ukadev uka_mdb LOG:  statement: ABORT
2009-09-07 22:12:54 CEST 16774   LOG:  checkpoint complete: wrote 279 buffers (1.4%); 0 transaction log file(s) added,
0removed, 6 recycled; write=48.665 s, sync=0.631 s, total=49.339 s 
2009-09-07 22:12:54 CEST 16777   LOG:  archive command failed with exit code 1
2009-09-07 22:12:54 CEST 16777   DETAIL:  The failed archive command was: ssh pgbackup@backup test ! -f
pgbackup-cirkus/0000000100000065000000F1&& rsync -az pg_xlog/0000000100000065000000F1
pgbackup@backup:pgbackup-cirkus/0000000100000065000000F1

[..]

2009-09-07 22:12:55 CEST 16777   LOG:  archive command failed with exit code 1
2009-09-07 22:12:55 CEST 16777   DETAIL:  The failed archive command was: ssh pgbackup@backup test ! -f
pgbackup-cirkus/0000000100000065000000F1&& rsync -az pg_xlog/0000000100000065000000F1
pgbackup@backup:pgbackup-cirkus/0000000100000065000000F1

[..]

2009-09-07 22:12:56 CEST 16777   LOG:  archive command failed with exit code 1
2009-09-07 22:12:56 CEST 16777   DETAIL:  The failed archive command was: ssh pgbackup@backup test ! -f
pgbackup-cirkus/0000000100000065000000F1&& rsync -az pg_xlog/0000000100000065000000F1
pgbackup@backup:pgbackup-cirkus/0000000100000065000000F1
2009-09-07 22:12:56 CEST 16777   WARNING:  transaction log file "0000000100000065000000F1" could not be archived: too
manyfailures 

On the backup host (receiving the WAL files), named kolje:

pgbackup@kolje:~$ ls -la pgbackup-cirkus/0000000100000065000000EE pgbackup-cirkus/0000000100000065000000EF
pgbackup-cirkus/0000000100000065000000F0pgbackup-cirkus/0000000100000065000000F1
pgbackup-cirkus/0000000100000065000000F2pgbackup-cirkus/0000000100000065000000F4 
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:27 pgbackup-cirkus/0000000100000065000000EE
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:32 pgbackup-cirkus/0000000100000065000000EF
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:37 pgbackup-cirkus/0000000100000065000000F0
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:47 pgbackup-cirkus/0000000100000065000000F1
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:52 pgbackup-cirkus/0000000100000065000000F2
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 21:02 pgbackup-cirkus/0000000100000065000000F4

On the DB host, called cirkus:

cirkus:~# ls -la /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1
/var/lib/postgresql/8.3/main/pg_xlog/archive_status/0000000100000065000000F1.ready
-rw------- 1 postgres postgres 16777216 2009-09-07 22:03 /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1
-rw------- 1 postgres postgres        0 2009-09-07 22:12
/var/lib/postgresql/8.3/main/pg_xlog/archive_status/0000000100000065000000F1.ready

There's no 0000000100000065000000F0 or 0000000100000065000000F2:

cirkus:~# ls -lart /var/lib/postgresql/8.3/main/pg_xlog/ | head
total 7365516
-rw-------  1 postgres postgres 16777216 2009-09-07 22:03 0000000100000065000000F1
-rw-------  1 postgres postgres 16777216 2009-09-07 22:17 000000010000006600000004
-rw-------  1 postgres postgres 16777216 2009-09-07 22:22 000000010000006600000005
-rw-------  1 postgres postgres 16777216 2009-09-07 22:27 000000010000006600000006
-rw-------  1 postgres postgres 16777216 2009-09-07 22:32 000000010000006600000007
-rw-------  1 postgres postgres 16777216 2009-09-07 22:37 000000010000006600000008
-rw-------  1 postgres postgres 16777216 2009-09-07 22:42 000000010000006600000009
-rw-------  1 postgres postgres 16777216 2009-09-07 22:47 00000001000000660000000A
-rw-------  1 postgres postgres 16777216 2009-09-07 22:52 00000001000000660000000B

The files in question do not share MD5 sum:

pgbackup@kolje:~$ md5sum pgbackup-cirkus/0000000100000065000000F1
341f308f3269176873f93b19496b2923  pgbackup-cirkus/0000000100000065000000F1
cirkus:~# md5sum /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1
8010976a8a95480916ff2bae9203a2cf  /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1

How do I continue debugging this? How did a file name collision occur?

-B


Server configuration:

              name               |                                               setting
               

---------------------------------+-----------------------------------------------------------------------------------------------------
 add_missing_from                | off
 allow_system_table_mods         | off
 archive_command                 | ssh pgbackup@backup test ! -f pgbackup-cirkus/%f && rsync -az %p
pgbackup@backup:pgbackup-cirkus/%f
 archive_mode                    | on
 archive_timeout                 | 300
 array_nulls                     | on
 authentication_timeout          | 60
 autovacuum                      | on
 autovacuum_analyze_scale_factor | 0.3
 autovacuum_analyze_threshold    | 100
 autovacuum_freeze_max_age       | 200000000
 autovacuum_max_workers          | 3
 autovacuum_naptime              | 60
 autovacuum_vacuum_cost_delay    | -1
 autovacuum_vacuum_cost_limit    | -1
 autovacuum_vacuum_scale_factor  | 0.4
 autovacuum_vacuum_threshold     | 400
 backslash_quote                 | safe_encoding
 bgwriter_delay                  | 200
 bgwriter_lru_maxpages           | 100
 bgwriter_lru_multiplier         | 2
 block_size                      | 8192
 bonjour_name                    |
 check_function_bodies           | on
 checkpoint_completion_target    | 0.5
 checkpoint_segments             | 15
 checkpoint_timeout              | 1800
 checkpoint_warning              | 30
 client_encoding                 | UTF8
 client_min_messages             | notice
 commit_delay                    | 0
 commit_siblings                 | 5
 config_file                     | /etc/postgresql/8.3/main/postgresql.conf
 constraint_exclusion            | off
 cpu_index_tuple_cost            | 0.005
 cpu_operator_cost               | 0.0025
 cpu_tuple_cost                  | 0.01
 custom_variable_classes         |
 data_directory                  | /var/lib/postgresql/8.3/main
 DateStyle                       | ISO, MDY
 db_user_namespace               | off
 deadlock_timeout                | 1000
 debug_assertions                | off
 debug_pretty_print              | on
 debug_print_parse               | off
 debug_print_plan                | on
 debug_print_rewritten           | off
 default_statistics_target       | 10
 default_tablespace              |
 default_text_search_config      | pg_catalog.simple
 default_transaction_isolation   | read committed
 default_transaction_read_only   | off
 default_with_oids               | off
 dynamic_library_path            | $libdir
 effective_cache_size            | 393216
 enable_bitmapscan               | on
 enable_hashagg                  | on
 enable_hashjoin                 | on
 enable_indexscan                | on
 enable_mergejoin                | on
 enable_nestloop                 | on
 enable_seqscan                  | on
 enable_sort                     | on
 enable_tidscan                  | on
 escape_string_warning           | on
 explain_pretty_print            | on
 external_pid_file               | /var/run/postgresql/8.3-main.pid
 extra_float_digits              | 0
 from_collapse_limit             | 8
 fsync                           | on
 full_page_writes                | on
 geqo                            | on
 geqo_effort                     | 5
 geqo_generations                | 0
 geqo_pool_size                  | 0
 geqo_selection_bias             | 2
 geqo_threshold                  | 12
 gin_fuzzy_search_limit          | 0
 hba_file                        | /etc/postgresql/8.3/main/pg_hba.conf
 ident_file                      | /etc/postgresql/8.3/main/pg_ident.conf
 ignore_system_indexes           | off
 integer_datetimes               | on
 join_collapse_limit             | 8
 krb_caseins_users               | off
 krb_realm                       |
 krb_server_hostname             |
 krb_server_keyfile              | /etc/sql.keytab
 krb_srvname                     | postgres
 lc_collate                      | en_US.UTF-8
 lc_ctype                        | en_US.UTF-8
 lc_messages                     |
 lc_monetary                     | C
 lc_numeric                      | C
 lc_time                         | C
 listen_addresses                | *
 local_preload_libraries         |
 log_autovacuum_min_duration     | -1
 log_checkpoints                 | on
 log_connections                 | on
 log_destination                 | stderr
 log_directory                   | pg_log
 log_disconnections              | off
 log_duration                    | off
 log_error_verbosity             | default
 log_executor_stats              | off
 log_filename                    | postgresql-%Y-%m-%d_%H%M%S.log
 log_hostname                    | off
 log_line_prefix                 | %t %p %d %u
 log_lock_waits                  | on
 log_min_duration_statement      | 100
 log_min_error_statement         | FATAL
 log_min_messages                | INFO
 log_parser_stats                | off
 log_planner_stats               | off
 log_rotation_age                | 1440
 log_rotation_size               | 10240
 log_statement                   | ddl
 log_statement_stats             | off
 log_temp_files                  | 1
 log_timezone                    | localtime
 log_truncate_on_rotation        | off
 logging_collector               | off
 maintenance_work_mem            | 16384
 max_connections                 | 256
 max_files_per_process           | 1000
 max_fsm_pages                   | 2000000
 max_fsm_relations               | 35000
 max_function_args               | 100
 max_identifier_length           | 63
 max_index_keys                  | 32
 max_locks_per_transaction       | 64
 max_prepared_transactions       | 100
 max_stack_depth                 | 2048
 password_encryption             | on
 port                            | 5432
 post_auth_delay                 | 0
 pre_auth_delay                  | 0
 random_page_cost                | 4
 regex_flavor                    | advanced
 search_path                     | "$user",public
 seq_page_cost                   | 1
 server_encoding                 | UTF8
 server_version                  | 8.3.7
 server_version_num              | 80307
 session_replication_role        | origin
 shared_buffers                  | 20000
 shared_preload_libraries        |
 silent_mode                     | off
 sql_inheritance                 | on
 ssl                             | on
 ssl_ciphers                     | ALL:!ADH:!LOW:!EXP:!MD5:@STRENGTH
 standard_conforming_strings     | off
 statement_timeout               | 0
 superuser_reserved_connections  | 3
 synchronize_seqscans            | on
 synchronous_commit              | on
 syslog_facility                 | LOCAL0
 syslog_ident                    | postgres
 tcp_keepalives_count            | 0
 tcp_keepalives_idle             | 0
 tcp_keepalives_interval         | 0
 temp_buffers                    | 1024
 temp_tablespaces                |
 TimeZone                        | localtime
 timezone_abbreviations          | Default
 trace_notify                    | off
 trace_sort                      | off
 track_activities                | on
 track_counts                    | on
 transaction_isolation           | read committed
 transaction_read_only           | off
 transform_null_equals           | off
 unix_socket_directory           | /var/run/postgresql
 unix_socket_group               |
 unix_socket_permissions         | 511
 update_process_title            | on
 vacuum_cost_delay               | 0
 vacuum_cost_limit               | 200
 vacuum_cost_page_dirty          | 20
 vacuum_cost_page_hit            | 1
 vacuum_cost_page_miss           | 10
 vacuum_freeze_min_age           | 100000000
 wal_buffers                     | 8
 wal_sync_method                 | open_sync
 wal_writer_delay                | 200
 work_mem                        | 81920
 xmlbinary                       | base64
 xmloption                       | content
 zero_damaged_pages              | off
(189 rows)

--
Berge Schwebs Bjørlo
Alegría!

В списке pgsql-general по дате отправления:

Предыдущее
От: Korry
Дата:
Сообщение: Re: Error log fillingup
Следующее
От: David W Noon
Дата:
Сообщение: Re: where clauses and multiple tables