Обсуждение: Cannot login for short period of time
Greets, Pg: 8.3.7 I'm trying to diagnose why I cannot login to Pg on occasion. The psql command will just hang (so I cannot get in to see what it's doing) and a telnet into 5432 will give the usual: Connected to localhost (127.0.0.1). Escape character is '^]'. indicating the backend accepting the connection, but not proceeding. The number of connections at the time are small, file descriptor usage will be small, etc, server load low. It will then spontaneously recover as if it's just completed doing something IO intensive. My config: listen_addresses = '1.2.1.1,127.0.0.1' port = 5432 max_connections = 2048 shared_buffers = 520MB temp_buffers = 128MB work_mem = 1512MB max_fsm_pages = 512000 fsync = off checkpoint_segments = 32 effective_cache_size = 512MB constraint_exclusion = on logging_collector = on track_counts = on autovacuum = on autovacuum_freeze_max_age = 2000000000 I'm rerunning things with statement logging turned on to get a clue. Incidentally, I did have to reindex the system tables in single-user mode (got a "Cannot find namespace 0" error at some point). Anyone have an idea what this strange connectivity-delay could be about? Cheers Henry
Вложения
Hi, check your hardware (especially harddrive) for errors. regards Thomas Henry schrieb: > > Greets, > > Pg: 8.3.7 > > I'm trying to diagnose why I cannot login to Pg on occasion. The psql > command will just hang (so I cannot get in to see what it's doing) and > a telnet into 5432 will give the usual: > > Connected to localhost (127.0.0.1). > Escape character is '^]'. > > indicating the backend accepting the connection, but not proceeding. > > The number of connections at the time are small, file descriptor usage > will be small, etc, server load low. > > It will then spontaneously recover as if it's just completed doing > something IO intensive. > > My config: > > listen_addresses = '1.2.1.1,127.0.0.1' > port = 5432 > max_connections = 2048 > shared_buffers = 520MB > temp_buffers = 128MB > work_mem = 1512MB > max_fsm_pages = 512000 > fsync = off > checkpoint_segments = 32 > effective_cache_size = 512MB > constraint_exclusion = on > logging_collector = on > track_counts = on > autovacuum = on > autovacuum_freeze_max_age = 2000000000 > > I'm rerunning things with statement logging turned on to get a clue. > Incidentally, I did have to reindex the system tables in single-user > mode (got a "Cannot find namespace 0" error at some point). > > Anyone have an idea what this strange connectivity-delay could be about? > > Cheers > Henry
Quoting "Thomas Markus" <t.markus@proventis.net>: > check your hardware (especially harddrive) for errors. Ja, that was my first suspicion as well, but no. Using a RAID5 setup, with smart monitoring, etc ... no errors. I've also just bumped up max_fsm_relations to 10000 - it was using the default of 1000 or something. My database reindex in single-user mode kindly made the suggestion (we have many, many table partitions with hordes of indexes - relations approaching 9000+). reindexing due to "Cannot find namespace X" error on insert. Henry
Вложения
On Wed, May 13, 2009 at 1:42 AM, Henry <henry@zen.co.za> wrote: > Quoting "Thomas Markus" <t.markus@proventis.net>: >> >> check your hardware (especially harddrive) for errors. > > Ja, that was my first suspicion as well, but no. Using a RAID5 setup, with > smart monitoring, etc ... no errors. > > I've also just bumped up max_fsm_relations to 10000 - it was using the > default of 1000 or something. My database reindex in single-user mode > kindly made the suggestion (we have many, many table partitions with hordes > of indexes - relations approaching 9000+). reindexing due to "Cannot find > namespace X" error on insert. Whether or not max relations is the root of the login hang problem, you likely have gotten a fair bit of bloat in your database if your setting was too low by a factor of 10 for so long. You may need to look at recovering lost space in bloated tables and / or indexes. If the bloat is real bad, look at dumping / restoring the database for a fresh start. It's a pain because it requires downtime, but it's often faster than anything else for a badly bloated database.
Quoting "Scott Marlowe" <scott.marlowe@gmail.com>: > Whether or not max relations is the root of the login hang problem, > you likely have gotten a fair bit of bloat in your database if your > setting was too low by a factor of 10 for so long. You may need to > look at recovering lost space in bloated tables and / or indexes. If > the bloat is real bad, look at dumping / restoring the database for a > fresh start. It's a pain because it requires downtime, but it's often > faster than anything else for a badly bloated database. This occurred just after a dump/restore, so things were nice and lean. I have a suspicion this is related to bgwriter_* and checkpoint I/O. I've tweaked bgwriter_* and am awaiting an opportunity to test this again (busy reindexing all) and will report back. Interestingly, this behaviour didn't occur pre-partitioning. It seems the sheer number of additional relations tipped things over -- suddenly many defaults are just too low and I'm having to dig into arcane settings. This thread seems to be related: http://archives.postgresql.org//pgsql-admin/2008-10/msg00041.php Cheers Henry