Обсуждение: Unable to Connect to DB Instance

Поиск
Список
Период
Сортировка

Unable to Connect to DB Instance

От
Boblitz John
Дата:

Forwarded from pgadmin-support – wrong list

 

From: Ashesh Vashi [mailto:ashesh.vashi@enterprisedb.com]
Sent: Mittwoch, 4. Juli 2018 12:14
To: Boblitz John <john.boblitz@bertschi.com>
Cc: pgadmin-support <pgadmin-support@postgresql.org>
Subject: Re: Unable to Connect to DB Instance

 

 

On Wed, Jul 4, 2018, 15:19 Boblitz John <john.boblitz@bertschi.com> wrote:

Good Morning,

 

Beginning yesterday morning, users have been unable to fully connect to our DB Instance.

 

1.       At the time of the initial report – I was connected to the DB via pgAdmin and could perform queries without problem.

2.       Users reported messages similar to “could not open file "global/11801": No such file or directory”

3.       At that time, connection logging was turned off and there were no messages in the log files.

4.       As this is a development environment, I turned logging on in the config and restarted the DB

5.       After restart, neither I, nor the Users could fully reconnect.

6.       I have performed a SYS Level backup (tar of the whole postgres directory tree)

7.       I cannot perform a DB level Backup (same errors occur)

 

System Details

                Linux Debian      7.11

                Postgres              9.1 (9.1.24lts-0+d)

 

Please send your queries to pgsql-general@postgresql.org for database server issues.

This is a pgAdmin support list.

 

-- Thanks, Ashesh

It appears that we can connect to the DB Server itself as I get “connection received” and “connection authorized” – but when trying to access the DB itself, several errors are raised (see below).

I am assuming that some internals are no longer consistent – the file “global/11801” for instance really does not exist on the system.

 

Questions:

 

1.       Is there any way to recover from this (backup is unfortunately rather old)

2.       What are possible causes?  I’d like to prevent this from happening on my production servers.

 

** I am aware that we are on older releases, and yes, we plan to migrate to more current releases “soon” ™ …

 

 

Thanks in advance.

 

John Boblitz

 

 

Exceprt from Log:

2018-07-04 09:15:13 CEST 192.168.250.50(28559) [unknown]LOG:  connection received: host=192.168.250.50 port=28559

2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresLOG:  connection authorized: user=dbadmin database=postgres

2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresERROR:  could not open file "global/11801": No such file or directory

2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresSTATEMENT:  SELECT usecreatedb, usesuper, CASE WHEN usesuper THEN pg_postmaster_start_time() ELSE NULL END as upsince, CASE WHEN usesuper THEN pg_conf_load_time() ELSE NULL END as confloadedsince, CASE WHEN usesuper THEN pg_is_in_recovery() ELSE NULL END as inrecovery, CASE WHEN usesuper THEN pg_last_xlog_receive_location() ELSE NULL END as receiveloc, CASE WHEN usesuper THEN pg_last_xlog_replay_location() ELSE NULL END as replayloc, CASE WHEN usesuper THEN pg_last_xact_replay_timestamp() ELSE NULL END as replay_timestamp, CASE WHEN usesuper AND pg_is_in_recovery() THEN pg_is_xlog_replay_paused() ELSE NULL END as isreplaypaused

                  FROM pg_user WHERE usename=current_user

2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresERROR:  could not open file "global/11801": No such file or directory

2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresSTATEMENT:  SELECT rolcreaterole, rolcreatedb FROM pg_roles WHERE rolname = current_user;

2018-07-04 09:15:22 CEST 192.168.250.50(28561) [unknown]LOG:  connection received: host=192.168.250.50 port=28561

2018-07-04 09:15:22 CEST 192.168.250.50(28561) g11BaseLOG:  connection authorized: user=dbadmin database=g11Base

2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseERROR:  could not open file "pg_tblspc/24579/PG_9.1_201105231/24580/11866": No such file or directory

2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseSTATEMENT:  SELECT CASE WHEN nspname LIKE E'pg\\_temp\\_%' THEN 1

                            WHEN (nspname LIKE E'pg\\_%') THEN 0

                            ELSE 3 END AS nsptyp,

                       nsp.nspname, nsp.oid, pg_get_userbyid(nspowner) AS namespaceowner, nspacl, description,       has_schema_privilege(nsp.oid, 'CREATE') as cancreate,

                (SELECT array_agg(label) FROM pg_seclabels sl1 WHERE sl1.objoid=nsp.oid) AS labels,

                (SELECT array_agg(provider) FROM pg_seclabels sl2 WHERE sl2.objoid=nsp.oid) AS providers

                  FROM pg_namespace nsp

                  LEFT OUTER JOIN pg_description des ON (des.objoid=nsp.oid AND des.classoid='pg_namespace'::regclass)

                WHERE NOT ((nspname = 'pg_catalog' AND EXISTS (SELECT 1 FROM pg_class WHERE relname = 'pg_class' AND relnamespace = nsp.oid LIMIT 1)) OR

                (nspname = 'pgagent' AND EXISTS (SELECT 1 FROM pg_class WHERE relname = 'pga_job' AND relnamespace = nsp.oid LIMIT 1)) OR

                (nspname = 'information_schema' AND EXISTS (SELECT 1 FROM pg_class WHERE relname = 'tables' AND relnamespace = nsp.oid LIMIT 1)) OR

                (nspname LIKE '_%' AND EXISTS (SELECT 1 FROM pg_proc WHERE proname='slonyversion' AND pronamespace = nsp.oid LIMIT 1))

                )  AND nspname NOT LIKE E'pg\\_temp\\_%'AND nspname NOT LIKE E'pg\\_toast_temp\\_%' ORDER BY 1, nspname

Re: Unable to Connect to DB Instance

От
Tom Lane
Дата:
Boblitz John <john.boblitz@bertschi.com> writes:
> 2.       Users reported messages similar to "could not open file "global/11801": No such file or directory"

I'd try "select relname from pg_class where pg_relation_filenode(oid) =
11801" to see if you can identify the problematic relation that way.

If you're lucky, this is just loss of some system catalog index in which
case reindexing will fix it.  The fact that you're able to get through
connecting, and the errors only show up with queries, is somewhat
promising given that the problem looks like it's related to pg_authid
or pg_db_role_setting.

            regards, tom lane


Re: Unable to Connect to DB Instance

От
Adrian Klaver
Дата:
On 07/04/2018 04:08 AM, Boblitz John wrote:

> 
>     Good Morning,
> 
>     Beginning yesterday morning, users have been unable to fully connect
>     to our DB Instance.
> 
>     1.At the time of the initial report – I was connected to the DB via
>     pgAdmin and could perform queries without problem.
> 
>     2.Users reported messages similar to “could not open file
>     "global/11801": No such file or directory”
> 
>     3.At that time, connection logging was turned off and there were no
>     messages in the log files.
> 
>     4.As this is a development environment, I turned logging on in the
>     config and restarted the DB
> 
>     5.After restart, neither I, nor the Users could fully reconnect.
> 
>     6.I have performed a SYS Level backup (tar of the whole postgres
>     directory tree)
> 
>     7.I cannot perform a DB level Backup (same errors occur)
> 
>     System Details
> 
>                      Linux Debian      7.11
> 
>                      Postgres              9.1 (9.1.24lts-0+d)
> 

>     It appears that we can connect to the DB Server itself as I get
>     “connection received” and “connection authorized” – but when trying
>     to access the DB itself, several errors are raised (see below).
> 
>     I am assuming that some internals are no longer consistent – the
>     file “global/11801” for instance really does not exist on the system.
> 
>     Questions:
> 
>     1.Is there any way to recover from this (backup is unfortunately
>     rather old) >
>     2.What are possible causes?  I’d like to prevent this from happening
>     on my production servers.

Looks like something/someone deleted files from portions of  the $DATA 
directory. In particular from ~/global and ~/pg_tblspc. Without those 
files you can't really proceed.

Can you recover by creating a new instance and restoring from a dump of 
the production server?

As to exact cause the only thing I can think of is to to look at the 
system logs at the time of the initial failure and see if there is 
anything there that would shed light.

> 
>     ** I am aware that we are on older releases, and yes, we plan to
>     migrate to more current releases “soon” ™ …

> 
>     Thanks in advance.
> 
>     John Boblitz
> 
>     Exceprt from Log:
> 
>     2018-07-04 09:15:13 CEST 192.168.250.50(28559) [unknown]LOG: 
>     connection received: host=192.168.250.50 port=28559
> 
>     2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresLOG: 
>     connection authorized: user=dbadmin database=postgres
> 
>     2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresERROR:  could
>     not open file "global/11801": No such file or directory
> 
>     2018-07-04 09:15:14 CEST 192.168.250.50(28559) postgresSTATEMENT: 
>     SELECT usecreatedb, usesuper, CASE WHEN usesuper THEN
>     pg_postmaster_start_time() ELSE NULL END as upsince, CASE WHEN
>     usesuper THEN pg_conf_load_time() ELSE NULL END as confloadedsince,
>     CASE WHEN usesuper THEN pg_is_in_recovery() ELSE NULL END as
>     inrecovery, CASE WHEN usesuper THEN pg_last_xlog_receive_location()
>     ELSE NULL END as receiveloc, CASE WHEN usesuper THEN
>     pg_last_xlog_replay_location() ELSE NULL END as replayloc, CASE WHEN
>     usesuper THEN pg_last_xact_replay_timestamp() ELSE NULL END as
>     replay_timestamp, CASE WHEN usesuper AND pg_is_in_recovery() THEN
>     pg_is_xlog_replay_paused() ELSE NULL END as isreplaypaused
> 
>                        FROM pg_user WHERE usename=current_user
> 
>     2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresERROR:  could
>     not open file "global/11801": No such file or directory
> 
>     2018-07-04 09:15:19 CEST 192.168.250.50(28559) postgresSTATEMENT: 
>     SELECT rolcreaterole, rolcreatedb FROM pg_roles WHERE rolname =
>     current_user;
> 
>     2018-07-04 09:15:22 CEST 192.168.250.50(28561) [unknown]LOG: 
>     connection received: host=192.168.250.50 port=28561
> 
>     2018-07-04 09:15:22 CEST 192.168.250.50(28561) g11BaseLOG: 
>     connection authorized: user=dbadmin database=g11Base
> 
>     2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseERROR:  could
>     not open file "pg_tblspc/24579/PG_9.1_201105231/24580/11866": No
>     such file or directory
> 
>     2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseSTATEMENT: 
>     SELECT CASE WHEN nspname LIKE E'pg\\_temp\\_%' THEN 1
> 
>                                  WHEN (nspname LIKE E'pg\\_%') THEN 0
> 
>                                  ELSE 3 END AS nsptyp,
> 
>                             nsp.nspname, nsp.oid,
>     pg_get_userbyid(nspowner) AS namespaceowner, nspacl,
>     description,       has_schema_privilege(nsp.oid, 'CREATE') as cancreate,
> 
>                      (SELECT array_agg(label) FROM pg_seclabels sl1
>     WHERE sl1.objoid=nsp.oid) AS labels,
> 
>                      (SELECT array_agg(provider) FROM pg_seclabels sl2
>     WHERE sl2.objoid=nsp.oid) AS providers
> 
>                        FROM pg_namespace nsp
> 
>                        LEFT OUTER JOIN pg_description des ON
>     (des.objoid=nsp.oid AND des.classoid='pg_namespace'::regclass)
> 
>                      WHERE NOT ((nspname = 'pg_catalog' AND EXISTS
>     (SELECT 1 FROM pg_class WHERE relname = 'pg_class' AND relnamespace
>     = nsp.oid LIMIT 1)) OR
> 
>                      (nspname = 'pgagent' AND EXISTS (SELECT 1 FROM
>     pg_class WHERE relname = 'pga_job' AND relnamespace = nsp.oid LIMIT
>     1)) OR
> 
>                      (nspname = 'information_schema' AND EXISTS (SELECT
>     1 FROM pg_class WHERE relname = 'tables' AND relnamespace = nsp.oid
>     LIMIT 1)) OR
> 
>                      (nspname LIKE '_%' AND EXISTS (SELECT 1 FROM
>     pg_proc WHERE proname='slonyversion' AND pronamespace = nsp.oid
>     LIMIT 1))
> 
>                      )  AND nspname NOT LIKE E'pg\\_temp\\_%'AND nspname
>     NOT LIKE E'pg\\_toast_temp\\_%' ORDER BY 1, nspname
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


RE: Unable to Connect to DB Instance

От
Boblitz John
Дата:
Hello Tom,

Thanks - I get "pg_db_role_setting" as a response.

I have already attempted to reindex system but get:

NOTICE:  table "pg_catalog.pg_class" was reindexed
NOTICE:  table "pg_catalog.pg_statistic" was reindexed
NOTICE:  table "pg_catalog.pg_type" was reindexed
NOTICE:  table "pg_catalog.pg_attribute" was reindexed
NOTICE:  table "pg_catalog.pg_authid" was reindexed
ERROR:  could not open file "base/11919/11680": No such file or directory


John


> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Mittwoch, 4. Juli 2018 17:50
> To: Boblitz John <john.boblitz@bertschi.com>
> Cc: pgsql-general@postgresql.org
> Subject: Re: Unable to Connect to DB Instance
>
> Boblitz John <john.boblitz@bertschi.com> writes:
> > 2.       Users reported messages similar to "could not open file
> "global/11801": No such file or directory"
>
> I'd try "select relname from pg_class where pg_relation_filenode(oid) =
> 11801" to see if you can identify the problematic relation that way.
>
> If you're lucky, this is just loss of some system catalog index in which case
> reindexing will fix it.  The fact that you're able to get through connecting,
> and the errors only show up with queries, is somewhat promising given that
> the problem looks like it's related to pg_authid or pg_db_role_setting.
>
>             regards, tom lane


Re: Unable to Connect to DB Instance

От
Tom Lane
Дата:
Boblitz John <john.boblitz@bertschi.com> writes:
> Thanks - I get "pg_db_role_setting" as a response.

Hm ... not its index?  If the table itself is gone, it's surprising
that you can get through session startup.

> I have already attempted to reindex system but get:

I had in mind just reindexing the specific table you're having trouble
with ... but this:

> ERROR:  could not open file "base/11919/11680": No such file or directory

shows that there's another table that also has a problem, and there
may be more :-(.  I don't know what the odds are that you can get out
of this completely.  I would NOT recommend "reindex system" as a
blunderbuss solution.  You do not know how much is corrupted and there's
a significant chance of making things worse by tromping over the whole
database using catalogs of uncertain reliability.

Did you identify which table 11680 is?

In the case of pg_db_role_setting, a possible solution is to "touch" the
missing file so it exists; it'll be empty, which means that you'll have
lost any ALTER DATABASE/ROLE SET settings, but that's better than not
being able to dump at all.  (You might then need to REINDEX
pg_db_role_setting to get its indexes in sync with it being empty.)

Whether an equally drastic answer is tolerable for your other missing
table(s) depends on what they are...

            regards, tom lane


Re: Unable to Connect to DB Instance

От
Adrian Klaver
Дата:
On 07/04/2018 12:36 PM, Tom Lane wrote:
> Boblitz John <john.boblitz@bertschi.com> writes:
>> Thanks - I get "pg_db_role_setting" as a response.
> 
> Hm ... not its index?  If the table itself is gone, it's surprising
> that you can get through session startup.
> 
>> I have already attempted to reindex system but get:
> 
> I had in mind just reindexing the specific table you're having trouble
> with ... but this:
> 
>> ERROR:  could not open file "base/11919/11680": No such file or directory
> 
> shows that there's another table that also has a problem, and there
> may be more :-(.  I don't know what the odds are that you can get out
> of this completely.  I would NOT recommend "reindex system" as a
> blunderbuss solution.  You do not know how much is corrupted and there's
> a significant chance of making things worse by tromping over the whole
> database using catalogs of uncertain reliability.
> 
> Did you identify which table 11680 is?

There is also this from the OP:

2018-07-04 09:15:23 CEST 192.168.250.50(28561) g11BaseERROR:  could not 
open file "pg_tblspc/24579/PG_9.1_201105231/24580/11866": No such file 
or directory

Which failed on a query that references pgAgent and Slony.

So are you using either or both of those programs?

> 
> In the case of pg_db_role_setting, a possible solution is to "touch" the
> missing file so it exists; it'll be empty, which means that you'll have
> lost any ALTER DATABASE/ROLE SET settings, but that's better than not
> being able to dump at all.  (You might then need to REINDEX
> pg_db_role_setting to get its indexes in sync with it being empty.)
> 
> Whether an equally drastic answer is tolerable for your other missing
> table(s) depends on what they are...
> 
>             regards, tom lane
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com