Обсуждение: Unable to Connect to DB Instance
Forwarded from pgadmin-support – wrong list
From: Ashesh Vashi [mailto:ashesh.vashi@enterprisedb.com]
Sent: Mittwoch, 4. Juli 2018 12:14
To: Boblitz John <john.boblitz@bertschi.com>
Cc: pgadmin-support <pgadmin-support@postgresql.org>
Subject: Re: Unable to Connect to DB Instance
On Wed, Jul 4, 2018, 15:19 Boblitz John <john.boblitz@bertschi.com> wrote:
Good Morning,
Beginning yesterday morning, users have been unable to fully connect to our DB Instance.
1. At the time of the initial report – I was connected to the DB via pgAdmin and could perform queries without problem.
2. Users reported messages similar to “could not open file "global/11801": No such file or directory”
3. At that time, connection logging was turned off and there were no messages in the log files.
4. As this is a development environment, I turned logging on in the config and restarted the DB
5. After restart, neither I, nor the Users could fully reconnect.
6. I have performed a SYS Level backup (tar of the whole postgres directory tree)
7. I cannot perform a DB level Backup (same errors occur)
System Details
Linux Debian 7.11
Postgres 9.1 (9.1.24lts-0+d)
Please send your queries to pgsql-general@postgresql.org for database server issues.
This is a pgAdmin support list.
-- Thanks, Ashesh
It appears that we can connect to the DB Server itself as I get “connection received” and “connection authorized” – but when trying to access the DB itself, several errors are raised (see below).
I am assuming that some internals are no longer consistent – the file “global/11801” for instance really does not exist on the system.
Questions:
1. Is there any way to recover from this (backup is unfortunately rather old)
2. What are possible causes? I’d like to prevent this from happening on my production servers.
** I am aware that we are on older releases, and yes, we plan to migrate to more current releases “soon” ™ …
Thanks in advance.
John Boblitz
Exceprt from Log:
2018-07-04 09:15:13 CEST 192.168.250.50(28559) [unknown]LOG: connection received: host=192.168.250.50 port=28559
2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresLOG: connection authorized: user=dbadmin database=postgres
2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresERROR: could not open file "global/11801": No such file or directory
2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresSTATEMENT: SELECT usecreatedb, usesuper, CASE WHEN usesuper THEN pg_postmaster_start_time() ELSE NULL END as upsince, CASE WHEN usesuper THEN pg_conf_load_time() ELSE NULL END as confloadedsince, CASE WHEN usesuper THEN pg_is_in_recovery() ELSE NULL END as inrecovery, CASE WHEN usesuper THEN pg_last_xlog_receive_location() ELSE NULL END as receiveloc, CASE WHEN usesuper THEN pg_last_xlog_replay_location() ELSE NULL END as replayloc, CASE WHEN usesuper THEN pg_last_xact_replay_timestamp() ELSE NULL END as replay_timestamp, CASE WHEN usesuper AND pg_is_in_recovery() THEN pg_is_xlog_replay_paused() ELSE NULL END as isreplaypaused
FROM pg_user WHERE usename=current_user
2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresERROR: could not open file "global/11801": No such file or directory
2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresSTATEMENT: SELECT rolcreaterole, rolcreatedb FROM pg_roles WHERE rolname = current_user;
2018-07-04 09:15:22 CEST 192.168.250.50(28561) [unknown]LOG: connection received: host=192.168.250.50 port=28561
2018-07-04 09:15:22 CEST 192.168.250.50(28561) g11BaseLOG: connection authorized: user=dbadmin database=g11Base
2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseERROR: could not open file "pg_tblspc/24579/PG_9.1_201105231/24580/11866": No such file or directory
2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseSTATEMENT: SELECT CASE WHEN nspname LIKE E'pg\\_temp\\_%' THEN 1
WHEN (nspname LIKE E'pg\\_%') THEN 0
ELSE 3 END AS nsptyp,
nsp.nspname, nsp.oid, pg_get_userbyid(nspowner) AS namespaceowner, nspacl, description, has_schema_privilege(nsp.oid, 'CREATE') as cancreate,
(SELECT array_agg(label) FROM pg_seclabels sl1 WHERE sl1.objoid=nsp.oid) AS labels,
(SELECT array_agg(provider) FROM pg_seclabels sl2 WHERE sl2.objoid=nsp.oid) AS providers
FROM pg_namespace nsp
LEFT OUTER JOIN pg_description des ON (des.objoid=nsp.oid AND des.classoid='pg_namespace'::regclass)
WHERE NOT ((nspname = 'pg_catalog' AND EXISTS (SELECT 1 FROM pg_class WHERE relname = 'pg_class' AND relnamespace = nsp.oid LIMIT 1)) OR
(nspname = 'pgagent' AND EXISTS (SELECT 1 FROM pg_class WHERE relname = 'pga_job' AND relnamespace = nsp.oid LIMIT 1)) OR
(nspname = 'information_schema' AND EXISTS (SELECT 1 FROM pg_class WHERE relname = 'tables' AND relnamespace = nsp.oid LIMIT 1)) OR
(nspname LIKE '_%' AND EXISTS (SELECT 1 FROM pg_proc WHERE proname='slonyversion' AND pronamespace = nsp.oid LIMIT 1))
) AND nspname NOT LIKE E'pg\\_temp\\_%'AND nspname NOT LIKE E'pg\\_toast_temp\\_%' ORDER BY 1, nspname
Boblitz John <john.boblitz@bertschi.com> writes: > 2. Users reported messages similar to "could not open file "global/11801": No such file or directory" I'd try "select relname from pg_class where pg_relation_filenode(oid) = 11801" to see if you can identify the problematic relation that way. If you're lucky, this is just loss of some system catalog index in which case reindexing will fix it. The fact that you're able to get through connecting, and the errors only show up with queries, is somewhat promising given that the problem looks like it's related to pg_authid or pg_db_role_setting. regards, tom lane
On 07/04/2018 04:08 AM, Boblitz John wrote: > > Good Morning, > > Beginning yesterday morning, users have been unable to fully connect > to our DB Instance. > > 1.At the time of the initial report – I was connected to the DB via > pgAdmin and could perform queries without problem. > > 2.Users reported messages similar to “could not open file > "global/11801": No such file or directory” > > 3.At that time, connection logging was turned off and there were no > messages in the log files. > > 4.As this is a development environment, I turned logging on in the > config and restarted the DB > > 5.After restart, neither I, nor the Users could fully reconnect. > > 6.I have performed a SYS Level backup (tar of the whole postgres > directory tree) > > 7.I cannot perform a DB level Backup (same errors occur) > > System Details > > Linux Debian 7.11 > > Postgres 9.1 (9.1.24lts-0+d) > > It appears that we can connect to the DB Server itself as I get > “connection received” and “connection authorized” – but when trying > to access the DB itself, several errors are raised (see below). > > I am assuming that some internals are no longer consistent – the > file “global/11801” for instance really does not exist on the system. > > Questions: > > 1.Is there any way to recover from this (backup is unfortunately > rather old) > > 2.What are possible causes? I’d like to prevent this from happening > on my production servers. Looks like something/someone deleted files from portions of the $DATA directory. In particular from ~/global and ~/pg_tblspc. Without those files you can't really proceed. Can you recover by creating a new instance and restoring from a dump of the production server? As to exact cause the only thing I can think of is to to look at the system logs at the time of the initial failure and see if there is anything there that would shed light. > > ** I am aware that we are on older releases, and yes, we plan to > migrate to more current releases “soon” ™ … > > Thanks in advance. > > John Boblitz > > Exceprt from Log: > > 2018-07-04 09:15:13 CEST 192.168.250.50(28559) [unknown]LOG: > connection received: host=192.168.250.50 port=28559 > > 2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresLOG: > connection authorized: user=dbadmin database=postgres > > 2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresERROR: could > not open file "global/11801": No such file or directory > > 2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresSTATEMENT: > SELECT usecreatedb, usesuper, CASE WHEN usesuper THEN > pg_postmaster_start_time() ELSE NULL END as upsince, CASE WHEN > usesuper THEN pg_conf_load_time() ELSE NULL END as confloadedsince, > CASE WHEN usesuper THEN pg_is_in_recovery() ELSE NULL END as > inrecovery, CASE WHEN usesuper THEN pg_last_xlog_receive_location() > ELSE NULL END as receiveloc, CASE WHEN usesuper THEN > pg_last_xlog_replay_location() ELSE NULL END as replayloc, CASE WHEN > usesuper THEN pg_last_xact_replay_timestamp() ELSE NULL END as > replay_timestamp, CASE WHEN usesuper AND pg_is_in_recovery() THEN > pg_is_xlog_replay_paused() ELSE NULL END as isreplaypaused > > FROM pg_user WHERE usename=current_user > > 2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresERROR: could > not open file "global/11801": No such file or directory > > 2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresSTATEMENT: > SELECT rolcreaterole, rolcreatedb FROM pg_roles WHERE rolname = > current_user; > > 2018-07-04 09:15:22 CEST 192.168.250.50(28561) [unknown]LOG: > connection received: host=192.168.250.50 port=28561 > > 2018-07-04 09:15:22 CEST 192.168.250.50(28561) g11BaseLOG: > connection authorized: user=dbadmin database=g11Base > > 2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseERROR: could > not open file "pg_tblspc/24579/PG_9.1_201105231/24580/11866": No > such file or directory > > 2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseSTATEMENT: > SELECT CASE WHEN nspname LIKE E'pg\\_temp\\_%' THEN 1 > > WHEN (nspname LIKE E'pg\\_%') THEN 0 > > ELSE 3 END AS nsptyp, > > nsp.nspname, nsp.oid, > pg_get_userbyid(nspowner) AS namespaceowner, nspacl, > description, has_schema_privilege(nsp.oid, 'CREATE') as cancreate, > > (SELECT array_agg(label) FROM pg_seclabels sl1 > WHERE sl1.objoid=nsp.oid) AS labels, > > (SELECT array_agg(provider) FROM pg_seclabels sl2 > WHERE sl2.objoid=nsp.oid) AS providers > > FROM pg_namespace nsp > > LEFT OUTER JOIN pg_description des ON > (des.objoid=nsp.oid AND des.classoid='pg_namespace'::regclass) > > WHERE NOT ((nspname = 'pg_catalog' AND EXISTS > (SELECT 1 FROM pg_class WHERE relname = 'pg_class' AND relnamespace > = nsp.oid LIMIT 1)) OR > > (nspname = 'pgagent' AND EXISTS (SELECT 1 FROM > pg_class WHERE relname = 'pga_job' AND relnamespace = nsp.oid LIMIT > 1)) OR > > (nspname = 'information_schema' AND EXISTS (SELECT > 1 FROM pg_class WHERE relname = 'tables' AND relnamespace = nsp.oid > LIMIT 1)) OR > > (nspname LIKE '_%' AND EXISTS (SELECT 1 FROM > pg_proc WHERE proname='slonyversion' AND pronamespace = nsp.oid > LIMIT 1)) > > ) AND nspname NOT LIKE E'pg\\_temp\\_%'AND nspname > NOT LIKE E'pg\\_toast_temp\\_%' ORDER BY 1, nspname > -- Adrian Klaver adrian.klaver@aklaver.com
Hello Tom, Thanks - I get "pg_db_role_setting" as a response. I have already attempted to reindex system but get: NOTICE: table "pg_catalog.pg_class" was reindexed NOTICE: table "pg_catalog.pg_statistic" was reindexed NOTICE: table "pg_catalog.pg_type" was reindexed NOTICE: table "pg_catalog.pg_attribute" was reindexed NOTICE: table "pg_catalog.pg_authid" was reindexed ERROR: could not open file "base/11919/11680": No such file or directory John > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Mittwoch, 4. Juli 2018 17:50 > To: Boblitz John <john.boblitz@bertschi.com> > Cc: pgsql-general@postgresql.org > Subject: Re: Unable to Connect to DB Instance > > Boblitz John <john.boblitz@bertschi.com> writes: > > 2. Users reported messages similar to "could not open file > "global/11801": No such file or directory" > > I'd try "select relname from pg_class where pg_relation_filenode(oid) = > 11801" to see if you can identify the problematic relation that way. > > If you're lucky, this is just loss of some system catalog index in which case > reindexing will fix it. The fact that you're able to get through connecting, > and the errors only show up with queries, is somewhat promising given that > the problem looks like it's related to pg_authid or pg_db_role_setting. > > regards, tom lane
Boblitz John <john.boblitz@bertschi.com> writes: > Thanks - I get "pg_db_role_setting" as a response. Hm ... not its index? If the table itself is gone, it's surprising that you can get through session startup. > I have already attempted to reindex system but get: I had in mind just reindexing the specific table you're having trouble with ... but this: > ERROR: could not open file "base/11919/11680": No such file or directory shows that there's another table that also has a problem, and there may be more :-(. I don't know what the odds are that you can get out of this completely. I would NOT recommend "reindex system" as a blunderbuss solution. You do not know how much is corrupted and there's a significant chance of making things worse by tromping over the whole database using catalogs of uncertain reliability. Did you identify which table 11680 is? In the case of pg_db_role_setting, a possible solution is to "touch" the missing file so it exists; it'll be empty, which means that you'll have lost any ALTER DATABASE/ROLE SET settings, but that's better than not being able to dump at all. (You might then need to REINDEX pg_db_role_setting to get its indexes in sync with it being empty.) Whether an equally drastic answer is tolerable for your other missing table(s) depends on what they are... regards, tom lane
On 07/04/2018 12:36 PM, Tom Lane wrote: > Boblitz John <john.boblitz@bertschi.com> writes: >> Thanks - I get "pg_db_role_setting" as a response. > > Hm ... not its index? If the table itself is gone, it's surprising > that you can get through session startup. > >> I have already attempted to reindex system but get: > > I had in mind just reindexing the specific table you're having trouble > with ... but this: > >> ERROR: could not open file "base/11919/11680": No such file or directory > > shows that there's another table that also has a problem, and there > may be more :-(. I don't know what the odds are that you can get out > of this completely. I would NOT recommend "reindex system" as a > blunderbuss solution. You do not know how much is corrupted and there's > a significant chance of making things worse by tromping over the whole > database using catalogs of uncertain reliability. > > Did you identify which table 11680 is? There is also this from the OP: 2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseERROR: could not open file "pg_tblspc/24579/PG_9.1_201105231/24580/11866": No such file or directory Which failed on a query that references pgAgent and Slony. So are you using either or both of those programs? > > In the case of pg_db_role_setting, a possible solution is to "touch" the > missing file so it exists; it'll be empty, which means that you'll have > lost any ALTER DATABASE/ROLE SET settings, but that's better than not > being able to dump at all. (You might then need to REINDEX > pg_db_role_setting to get its indexes in sync with it being empty.) > > Whether an equally drastic answer is tolerable for your other missing > table(s) depends on what they are... > > regards, tom lane > > -- Adrian Klaver adrian.klaver@aklaver.com