Обсуждение: Cannot login for short period of time

Поиск
Список
Период
Сортировка

Cannot login for short period of time

От
Henry
Дата:
Greets,

Pg:  8.3.7

I'm trying to diagnose why I cannot login to Pg on occasion.  The psql
command will just hang (so I cannot get in to see what it's doing) and
a telnet into 5432 will give the usual:

Connected to localhost (127.0.0.1).
Escape character is '^]'.

indicating the backend accepting the connection, but not proceeding.

The number of connections at the time are small, file descriptor usage
will be small, etc, server load low.

It will then spontaneously recover as if it's just completed doing
something IO intensive.

My config:

listen_addresses = '1.2.1.1,127.0.0.1'
port = 5432
max_connections = 2048
shared_buffers = 520MB
temp_buffers = 128MB
work_mem = 1512MB
max_fsm_pages = 512000
fsync = off
checkpoint_segments = 32
effective_cache_size = 512MB
constraint_exclusion = on
logging_collector = on
track_counts = on
autovacuum = on
autovacuum_freeze_max_age = 2000000000

I'm rerunning things with statement logging turned on to get a clue.
Incidentally, I did have to reindex the system tables in single-user
mode (got a "Cannot find namespace 0" error at some point).

Anyone have an idea what this strange connectivity-delay could be about?

Cheers
Henry

Вложения

Re: Cannot login for short period of time

От
Thomas Markus
Дата:
Hi,

check your hardware (especially harddrive) for errors.

regards
Thomas

Henry schrieb:
>
> Greets,
>
> Pg:  8.3.7
>
> I'm trying to diagnose why I cannot login to Pg on occasion.  The psql
> command will just hang (so I cannot get in to see what it's doing) and
> a telnet into 5432 will give the usual:
>
> Connected to localhost (127.0.0.1).
> Escape character is '^]'.
>
> indicating the backend accepting the connection, but not proceeding.
>
> The number of connections at the time are small, file descriptor usage
> will be small, etc, server load low.
>
> It will then spontaneously recover as if it's just completed doing
> something IO intensive.
>
> My config:
>
> listen_addresses = '1.2.1.1,127.0.0.1'
> port = 5432
> max_connections = 2048
> shared_buffers = 520MB
> temp_buffers = 128MB
> work_mem = 1512MB
> max_fsm_pages = 512000
> fsync = off
> checkpoint_segments = 32
> effective_cache_size = 512MB
> constraint_exclusion = on
> logging_collector = on
> track_counts = on
> autovacuum = on
> autovacuum_freeze_max_age = 2000000000
>
> I'm rerunning things with statement logging turned on to get a clue.
> Incidentally, I did have to reindex the system tables in single-user
> mode (got a "Cannot find namespace 0" error at some point).
>
> Anyone have an idea what this strange connectivity-delay could be about?
>
> Cheers
> Henry


Re: Cannot login for short period of time

От
Henry
Дата:
Quoting "Thomas Markus" <t.markus@proventis.net>:
> check your hardware (especially harddrive) for errors.

Ja, that was my first suspicion as well, but no.  Using a RAID5 setup,
with smart monitoring, etc ... no errors.

I've also just bumped up max_fsm_relations to 10000 - it was using the
default of 1000 or something.  My database reindex in single-user mode
kindly made the suggestion (we have many, many table partitions with
hordes of indexes - relations approaching 9000+).  reindexing due to
"Cannot find namespace X" error on insert.

Henry


Вложения

Re: Cannot login for short period of time

От
Scott Marlowe
Дата:
On Wed, May 13, 2009 at 1:42 AM, Henry <henry@zen.co.za> wrote:
> Quoting "Thomas Markus" <t.markus@proventis.net>:
>>
>> check your hardware (especially harddrive) for errors.
>
> Ja, that was my first suspicion as well, but no.  Using a RAID5 setup, with
> smart monitoring, etc ... no errors.
>
> I've also just bumped up max_fsm_relations to 10000 - it was using the
> default of 1000 or something.  My database reindex in single-user mode
> kindly made the suggestion (we have many, many table partitions with hordes
> of indexes - relations approaching 9000+).  reindexing due to "Cannot find
> namespace X" error on insert.

Whether or not max relations is the root of the login hang problem,
you likely have gotten a fair bit of bloat in your database if your
setting was too low by a factor of 10 for so long.    You may need to
look at recovering lost space in bloated tables and / or indexes.  If
the bloat is real bad, look at dumping / restoring the database for a
fresh start.  It's a pain because it requires downtime, but it's often
faster than anything else for a badly bloated database.

Re: Cannot login for short period of time

От
Henry
Дата:
Quoting "Scott Marlowe" <scott.marlowe@gmail.com>:
> Whether or not max relations is the root of the login hang problem,
> you likely have gotten a fair bit of bloat in your database if your
> setting was too low by a factor of 10 for so long.    You may need to
> look at recovering lost space in bloated tables and / or indexes.  If
> the bloat is real bad, look at dumping / restoring the database for a
> fresh start.  It's a pain because it requires downtime, but it's often
> faster than anything else for a badly bloated database.

This occurred just after a dump/restore, so things were nice and lean.

I have a suspicion this is related to bgwriter_* and checkpoint I/O.
I've tweaked bgwriter_* and am awaiting an opportunity to test this
again (busy reindexing all) and will report back.  Interestingly, this
behaviour didn't occur pre-partitioning.  It seems the sheer number of
additional relations tipped things over -- suddenly many defaults are
just too low and I'm having to dig into arcane settings.

This thread seems to be related:

http://archives.postgresql.org//pgsql-admin/2008-10/msg00041.php

Cheers
Henry





Вложения