Обсуждение: Server hangs on multiple connections

Поиск
Список
Период
Сортировка

Server hangs on multiple connections

От
David Christian
Дата:
Hi, I posted this through the web page but it didn't come over the
list, so I am sending it directly.  Hope that's okay.

I am able to get PostgreSQL 7.2.2 built and installed, and postmaster
starts up fine, but when hit with multiple simultaneous connections (2
or more), the server freezes up and can only be stopped with immediate
mode.  The existing postgres processes appear hung (see ps output
below) and running psql to connect also just hangs.

The problem I'm describing is only happening when I install on the
following platform:  Yellow Dog Linux 2.3 on PowerPC (PowerMac G4
QuickSilver, dual processor) with 1.5GB RAM.


$ cat /proc/cpuinfo
processor       : 0
cpu             : 7450, altivec supported
clock           : 799MHz
revision        : 2.1 (pvr 8000 0201)
bogomips        : 797.90

processor       : 1
cpu             : 7450, altivec supported
clock           : 799MHz
revision        : 2.1 (pvr 8000 0201)
bogomips        : 797.90

total bogomips  : 1595.80
machine         : PowerMac3,5
motherboard     : PowerMac3,5 MacRISC2 MacRISC Power Macintosh
detected as     : 69 (PowerMac G4 Silver)
pmac flags      : 00000000
L2 cache        : 256K unified
memory          : 1536MB
pmac-generation : NewWorld


$ uname -a
Linux chef.rdss.com 2.4.19-4asmp #1 SMP Wed Jun 5 00:59:38 EDT 2002 ppc
unknown


I have run PostgreSQL since 7.1 successfully on Red Hat Linux i386 and
Mac OS X 10.2 ppc (the very box I am currently having problems with)
without the lockup problem.  I am currently running PostgreSQL 7.2.2 on
a Red Hat i386 machine, installed from source, and it's working fine.

This problem can be replicated by building PostgreSQL from source and
running the 'make check' sequence.  It also happens when I 'make
install' and then initiate more than one simultaneous connection to the
PostgreSQL server.  The PostgreSQL server log does not show anything
unusual, until I kill the postmaster and then it reports on all the
backend connections that were terminated.


Here is the sequence of steps I use which results in this condition:

$ tar xfz postgresql-7.2.2.tar.gz
$ cd postgresql-7.2.2
$ ./configure
$ make
$ make check

It hangs at this step.  ps output shows:

  5836 pts/0    S      0:00 make check
  5919 pts/0    S      0:00 make -C src/test check
  5920 pts/0    S      0:00 make -C regress check
  5968 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7827 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/po
  7830 pts/0    S      0:00 postgres: stats buffer process
  7832 pts/0    S      0:00 postgres: stats collector process
  7891 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7892 pts/0    S      0:00 tee ./regression.out
  7897 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7898 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7899 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7900 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7901 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7902 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7903 pts/0    S      0:00 postgres: davidc regression [local] DROP
  7904 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7905 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7906 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7907 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7908 pts/0    S      0:00 postgres: davidc regression [local] SELECT
  7909 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7910 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7911 pts/0    S      0:00 postgres: davidc regression [local] SELECT
  7912 pts/0    S      0:00 postgres: davidc regression [local] SELECT
  7913 pts/0    S      0:00 postgres: davidc regression [local] SELECT
  7914 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7915 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7916 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7917 pts/0    S      0:00 postgres: davidc regression [local] SELECT
  7918 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7919 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7920 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7921 pts/0    S      0:00 postgres: davidc regression [local] startup
  7922 pts/0    S      0:00 postgres: davidc regression [local] startup
  7923 pts/0    S      0:00 postgres: davidc regression [local] startup
  7924 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7925 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7926 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7927 pts/0    S      0:00 /bin/sh ./pg_regress --temp-install
--top-builddir=../../.. --schedule=./parallel_schedule --multibyte=
  7928 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7929 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7930 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7931 pts/0    S      0:00
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/./
tmp_check/install//usr/local/pgsql/bin/ps
  7932 pts/0    S      0:00 postgres: davidc regression [local] startup
  7933 pts/0    S      0:00 postgres: davidc regression [local] startup
  7934 pts/0    S      0:00 postgres: davidc regression [local] startup
  7935 pts/0    S      0:00 postgres: davidc regression [local] startup


Some other possibly useful details:

$ gcc --version
2.95.4

$ make --version
GNU Make version 3.79.1, by Richard Stallman and Roland McGrath.
Built for powerpc-yellowdog-linux-gnu

$ autoconf --version
Autoconf version 2.13

$ rpm -qa | grep glibc
glibc-2.2.5-1.2.3a
glibc-devel-2.2.5-1.2.3a
glibc-common-2.2.5-1.2.3a


I tried to include the output from the installation commands, but was
told the message was too long to post to the list.  Please let me know
if it would help to send any of that separately.

PostgreSQL is a fantastic product, and I thoroughly enjoy using it on
other platforms; I would love to get it working on this one as well,
and I am at a loss as to why it appears to be hanging here.

Thank you in advance for your time in considering this submission.

David

Re: Server hangs on multiple connections

От
Tom Lane
Дата:
David Christian <davidc@comtechmobile.com> writes:
> The problem I'm describing is only happening when I install on the
> following platform:  Yellow Dog Linux 2.3 on PowerPC (PowerMac G4
> QuickSilver, dual processor) with 1.5GB RAM.

Hmm ... I'm not sure whether anyone's tried it with a dual-processor PPC
system before.  I wonder if there's some problem with the PPC spinlock
code given multiple CPUs?

Could you build with --enable-debug and --enable-cassert (if you didn't
already), repeat the 'make check' scenario, and then attach to a few of
the stuck backend processes with gdb and get stack traces from them?
That would give us a little more info to work with.

> I have run PostgreSQL since 7.1 successfully on Red Hat Linux i386 and
> Mac OS X 10.2 ppc (the very box I am currently having problems with)
> without the lockup problem.

Have you run 7.2.* on this same box under OS X?  (Ie, could the problem
be specific to YDL?)

            regards, tom lane

Re: Server hangs on multiple connections

От
David Christian
Дата:
On Thursday, Sep 19, 2002, at 17:10 US/Eastern, Tom Lane wrote:

> Could you build with --enable-debug and --enable-cassert (if you didn't
> already), repeat the 'make check' scenario, and then attach to a few of
> the stuck backend processes with gdb and get stack traces from them?
> That would give us a little more info to work with.

Happy to.  Interestingly, when I build with --enable-debug and
--enable-cassert, the server doesn't lock up during 'make check', it
just (very quickly) fails all of the tests and exits.  I tried several
times.

$ ./configure --enable-debug --enable-cassert
$ make
$ make check

Here is the tail end of the 'make check' output in that case:

/bin/sh ./pg_regress --temp-install --top-builddir=../../..
--schedule=./parallel_schedule --multibyte=
============== creating temporary installation        ==============
============== initializing database system           ==============
============== starting postmaster                    ==============
running on port 65432 with pid 7893
============== creating database "regression"         ==============
CREATE DATABASE
============== dropping regression test user accounts ==============
============== installing PL/pgSQL                    ==============
============== running regression test queries        ==============
parallel group (13 tests):  boolean int4 varchar char name text int2
int8 oid float4 bit numeric float8
      boolean              ... FAILED
      char                 ... FAILED
      name                 ... FAILED
      varchar              ... FAILED
      text                 ... FAILED
      int2                 ... FAILED
      int4                 ... FAILED
      int8                 ... FAILED
      oid                  ... FAILED
      float4               ... FAILED
      float8               ... FAILED
      bit                  ... FAILED
      numeric              ... FAILED
test strings              ... FAILED
test numerology           ... FAILED
parallel group (20 tests):  point box lseg path circle polygon time
date timetz timestamp timestamptz interval abstime tinterval reltime
inet comments oidjoins type_sanity opr_sanity
      point                ... FAILED
      lseg                 ... FAILED
      box                  ... FAILED
      path                 ... FAILED
      polygon              ... FAILED
      circle               ... FAILED
      date                 ... FAILED
      time                 ... FAILED
      timetz               ... FAILED
      timestamp            ... FAILED
      timestamptz          ... FAILED
      interval             ... FAILED
      abstime              ... FAILED
      reltime              ... FAILED
      tinterval            ... FAILED
      inet                 ... FAILED
      comments             ... FAILED
      oidjoins             ... FAILED
      type_sanity          ... FAILED
      opr_sanity           ... FAILED
test geometry             ... FAILED
test horology             ... FAILED
test create_function_1    ... FAILED
test create_type          ... FAILED
test create_table         ... FAILED
test create_function_2    ... FAILED
test copy                 ... FAILED
parallel group (7 tests):  constraints triggers create_misc
create_operator create_aggregate create_index inherit
      constraints          ... FAILED
      triggers             ... FAILED
      create_misc          ... FAILED
      create_aggregate     ... FAILED
      create_operator      ... FAILED
      create_index         ... FAILED
      inherit              ... FAILED
test create_view          ... FAILED
test sanity_check         ... FAILED
test errors               ... FAILED
test select               ... FAILED
parallel group (16 tests):  select_distinct select_into
select_distinct_on select_implicit select_having subselect case union
join aggregates transactions portals arrays random btree_index
hash_index
      select_into          ... FAILED
      select_distinct      ... FAILED
      select_distinct_on   ... FAILED
      select_implicit      ... FAILED
      select_having        ... FAILED
      subselect            ... FAILED
      union                ... FAILED
      case                 ... FAILED
      join                 ... FAILED
      aggregates           ... FAILED
      transactions         ... FAILED
      random               ... failed (ignored)
      portals              ... FAILED
      arrays               ... FAILED
      btree_index          ... FAILED
      hash_index           ... FAILED
test privileges           ... ok
test misc                 ... FAILED
parallel group (5 tests):  alter_table select_views portals_p2 rules
foreign_key
      select_views         ... FAILED
      alter_table          ... FAILED
      portals_p2           ... FAILED
      rules                ... FAILED
      foreign_key          ... FAILED
parallel group (3 tests):  plpgsql limit temp
      limit                ... FAILED
      plpgsql              ... FAILED
      temp                 ... FAILED
============== shutting down postmaster               ==============

=====================================================
  78 of 79 tests failed, 1 of these failures ignored.
=====================================================

The differences that caused some tests to fail can be viewed in the
file `./regression.diffs'.  A copy of the test summary that you see
above is saved in the file `./regression.out'.

make[2]: *** [check] Error 1
rm regress.o
make[2]: Leaving directory
`/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress'
make[1]: *** [check] Error 2
make[1]: Leaving directory
`/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test'
make: *** [check] Error 2


I then tried with just ./configure --enable-debug alone.  And it did
hang in the place I described in my first message.  (Between builds, I
rm -rf'd the installation postgresql-7.2.2 directory to be sure I was
using fully clean source each time.)

$ ps auxw | grep 'postgres:'
davidc   15639  0.0  0.0  7212 1380 pts/0    S    21:44   0:00
postgres: stats buffer process
davidc   15641  0.0  0.0  6268 1428 pts/0    S    21:44   0:00
postgres: stats collector process
davidc   15712  0.0  0.2  6664 3176 pts/0    S    21:44   0:00
postgres: davidc regression [local] idle
davidc   15715  0.0  0.1  6660 3040 pts/0    S    21:44   0:00
postgres: davidc regression [local] SELECT
davidc   15716  0.0  0.1  6660 3044 pts/0    S    21:44   0:00
postgres: davidc regression [local] SELECT
davidc   15717  0.0  0.1  6660 2944 pts/0    S    21:44   0:00
postgres: davidc regression [local] idle
davidc   15722  0.0  0.1  6660 2864 pts/0    S    21:44   0:00
postgres: davidc regression [local] SELECT
davidc   15731  0.0  0.1  6572 2140 pts/0    S    21:44   0:00
postgres: davidc regression [local] startup
davidc   15732  0.0  0.1  6568 1944 pts/0    S    21:44   0:00
postgres: davidc regression [local] startup
davidc   15733  0.0  0.1  6620 2524 pts/0    S    21:44   0:00
postgres: davidc regression [local] SELECT
davidc   15737  0.0  0.1  6568 1980 pts/0    S    21:44   0:00
postgres: davidc regression [local] startup
davidc   15738  0.0  0.1  6660 2844 pts/0    S    21:44   0:00
postgres: davidc regression [local] CREATE
davidc   15742  0.0  0.1  6568 1940 pts/0    S    21:44   0:00
postgres: davidc regression [local] startup
davidc   15743  0.0  0.1  6568 1876 pts/0    S    21:44   0:00
postgres: davidc regression [local] startup
davidc   15744  0.0  0.1  6548 1696 pts/0    S    21:44   0:00
postgres: davidc regression [local] startup


I don't really know what I'm doing with gdb, but I scanned the  man
page, and here's what I typed:


$ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postgres
15715
GNU gdb Yellow Dog Linux (5.1.1-1b)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "ppc-yellowdog-linux"...
/home/davidc/src/PostgreSQL/postgresql-7.2.2/15715: No such file or
directory.
Attaching to program:
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/
tmp_check/install/usr/local/pgsql/bin/postgres, process 15715
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /usr/lib/libhistory.so.4...done.
Loaded symbols for /usr/lib/libhistory.so.4
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
0x0fdc297c in __syscall_ipc () at soinit.c:76
76      soinit.c: No such file or directory.
         in soinit.c
(gdb) bt
#0  0x0fdc297c in __syscall_ipc () at soinit.c:76
#1  0x0fdc38c0 in semop (semid=4, sops=0x7fffea18, nsops=1) at
../sysdeps/unix/sysv/linux/semop.c:36
#2  0x100e4424 in IpcSemaphoreLock ()
#3  0x100eb018 in LWLockAcquire ()
#4  0x100e7f3c in LockAcquire ()
#5  0x100e7434 in LockRelation ()
#6  0x1002cc5c in relation_openr ()
#7  0x1002cdac in heap_openr ()
#8  0x100dc100 in fireRIRrules ()
#9  0x100dc878 in QueryRewrite ()
#10 0x100eeee4 in pg_analyze_and_rewrite ()
#11 0x100ef244 in pg_exec_query_string ()
#12 0x100f0688 in PostgresMain ()
#13 0x100cf3dc in DoBackend ()
#14 0x100cec54 in BackendStartup ()
#15 0x100cdaac in ServerLoop ()
#16 0x100cd564 in PostmasterMain ()
#17 0x100a26b8 in main ()
#18 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814,
ubp_ev=0x1, auxvec=0x7ffff8a8, rtld_fini=0x4, stinfo=0x10154c20,
stack_on_entry=0x1)
     at ../sysdeps/powerpc/elf/libc-start.c:119
(gdb) q
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program:
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/
tmp_check/install/usr/local/pgsql/bin/postgres, process 15715


$ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postgres
15738
GNU gdb Yellow Dog Linux (5.1.1-1b)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "ppc-yellowdog-linux"...
/home/davidc/src/PostgreSQL/postgresql-7.2.2/15738: No such file or
directory.
Attaching to program:
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/
tmp_check/install/usr/local/pgsql/bin/postgres, process 15738
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /usr/lib/libhistory.so.4...done.
Loaded symbols for /usr/lib/libhistory.so.4
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
0x0fdc297c in __syscall_ipc () at soinit.c:76
76      soinit.c: No such file or directory.
         in soinit.c
(gdb) bt
#0  0x0fdc297c in __syscall_ipc () at soinit.c:76
#1  0x0fdc38c0 in semop (semid=4, sops=0x7fffe7a8, nsops=1) at
../sysdeps/unix/sysv/linux/semop.c:36
#2  0x100e4424 in IpcSemaphoreLock ()
#3  0x100eb018 in LWLockAcquire ()
#4  0x100e7f3c in LockAcquire ()
#5  0x100e7434 in LockRelation ()
#6  0x1003206c in index_beginscan ()
#7  0x1013d280 in SearchCatCache ()
#8  0x101420c8 in SearchSysCache ()
#9  0x100502c0 in CatalogIndexInsert ()
#10 0x1004baf0 in AddNewRelationTuple ()
#11 0x1004bd10 in heap_create_with_catalog ()
#12 0x1006c86c in DefineRelation ()
#13 0x100f1828 in ProcessUtility ()
#14 0x100ef2f8 in pg_exec_query_string ()
#15 0x100f0688 in PostgresMain ()
#16 0x100cf3dc in DoBackend ()
#17 0x100cec54 in BackendStartup ()
#18 0x100cdaac in ServerLoop ()
#19 0x100cd564 in PostmasterMain ()
#20 0x100a26b8 in main ()
#21 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814,
ubp_ev=0x1, auxvec=0x7ffff8a8, rtld_fini=0x4, stinfo=0x10154c20,
stack_on_entry=0x1)
---Type <return> to continue, or q <return> to quit---
     at ../sysdeps/powerpc/elf/libc-start.c:119
(gdb) q
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program:
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/
tmp_check/install/usr/local/pgsql/bin/postgres, process 15738


$ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postgres
15744
GNU gdb Yellow Dog Linux (5.1.1-1b)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "ppc-yellowdog-linux"...
/home/davidc/src/PostgreSQL/postgresql-7.2.2/15744: No such file or
directory.
Attaching to program:
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/
tmp_check/install/usr/local/pgsql/bin/postgres, process 15744
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /usr/lib/libhistory.so.4...done.
Loaded symbols for /usr/lib/libhistory.so.4
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
0x0fdc297c in __syscall_ipc () at soinit.c:76
76      soinit.c: No such file or directory.
         in soinit.c
(gdb) bt
#0  0x0fdc297c in __syscall_ipc () at soinit.c:76
#1  0x0fdc38c0 in semop (semid=4, sops=0x7fffe738, nsops=1) at
../sysdeps/unix/sysv/linux/semop.c:36
#2  0x100e4424 in IpcSemaphoreLock ()
#3  0x100eb018 in LWLockAcquire ()
#4  0x100e7f3c in LockAcquire ()
#5  0x100e79b4 in XactLockTableInsert ()
#6  0x10040828 in StartTransaction ()
#7  0x10040bb0 in StartTransactionCommand ()
#8  0x1014a610 in InitPostgres ()
#9  0x100f036c in PostgresMain ()
#10 0x100cf3dc in DoBackend ()
#11 0x100cec54 in BackendStartup ()
#12 0x100cdaac in ServerLoop ()
#13 0x100cd564 in PostmasterMain ()
#14 0x100a26b8 in main ()
#15 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814,
ubp_ev=0x1, auxvec=0x7ffff8a8, rtld_fini=0x4, stinfo=0x10154c20,
stack_on_entry=0x1)
     at ../sysdeps/powerpc/elf/libc-start.c:119
(gdb) q
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program:
/home/davidc/src/PostgreSQL/postgresql-7.2.2/src/test/regress/
tmp_check/install/usr/local/pgsql/bin/postgres, process 15744


If I goofed on this, I'm afraid I will need to ask for some
hand-holding with using gdb properly.  I'm happy to go through the
steps to get you what you need to see.


>> I have run PostgreSQL since 7.1 successfully on Red Hat Linux i386 and
>> Mac OS X 10.2 ppc (the very box I am currently having problems with)
>> without the lockup problem.
>
> Have you run 7.2.* on this same box under OS X?  (Ie, could the problem
> be specific to YDL?)

Yes, I have, and I can hammer it all I want without it hanging.
Interestingly, I tried Yellow Dog's RPM also (7.2) and it exhibits the
same behavior (i.e., locking up on multiple connections).

Thanks,
David

Re: Server hangs on multiple connections

От
Tom Lane
Дата:
David Christian <davidc@comtechmobile.com> writes:
> Happy to.  Interestingly, when I build with --enable-debug and
> --enable-cassert, the server doesn't lock up during 'make check', it
> just (very quickly) fails all of the tests and exits.  I tried several
> times.

Oh, that's interesting; that says that an Assert() check is failing.
We should investigate that first.

There should be a core file left in the database subdirectory after
the assert failure --- would you gdb it and get a stack trace from it?
Also, you will probably find some useful messages in the postmaster
log (which should be left in the log/ subdirectory of the regress tests).


> (gdb) bt
> #0  0x0fdc297c in __syscall_ipc () at soinit.c:76
> #1  0x0fdc38c0 in semop (semid=4, sops=0x7fffea18, nsops=1) at
> ../sysdeps/unix/sysv/linux/semop.c:36
> #2  0x100e4424 in IpcSemaphoreLock ()
> #3  0x100eb018 in LWLockAcquire ()
> #4  0x100e7f3c in LockAcquire ()
> #5  0x100e7434 in LockRelation ()

Sure enough, it would seem that everyone's stuck waiting for a lock.
But let's chase the Assert first; that might identify the problem.

            regards, tom lane

Re: Server hangs on multiple connections

От
David Christian
Дата:
On Thursday, Sep 19, 2002, at 18:33 US/Eastern, Tom Lane wrote:

> David Christian <davidc@comtechmobile.com> writes:
>> Happy to.  Interestingly, when I build with --enable-debug and
>> --enable-cassert, the server doesn't lock up during 'make check', it
>> just (very quickly) fails all of the tests and exits.  I tried several
>> times.
>
> Oh, that's interesting; that says that an Assert() check is failing.
> We should investigate that first.
>
> There should be a core file left in the database subdirectory after
> the assert failure --- would you gdb it and get a stack trace from it?
> Also, you will probably find some useful messages in the postmaster
> log (which should be left in the log/ subdirectory of the regress
> tests)

Unfortunately, I see no core file under the source tree after the
assert failure.

The postmaster.log does show entries for failed assertions.  It is 246
lines long, and I am pasting it to the bottom of this message.

>> (gdb) bt
>> #0  0x0fdc297c in __syscall_ipc () at soinit.c:76
>> #1  0x0fdc38c0 in semop (semid=4, sops=0x7fffea18, nsops=1) at
>> ../sysdeps/unix/sysv/linux/semop.c:36
>> #2  0x100e4424 in IpcSemaphoreLock ()
>> #3  0x100eb018 in LWLockAcquire ()
>> #4  0x100e7f3c in LockAcquire ()
>> #5  0x100e7434 in LockRelation ()
>
> Sure enough, it would seem that everyone's stuck waiting for a lock.
> But let's chase the Assert first; that might identify the problem.

Okay, hope this helps.  I really appreciate the time you are taking to
look at this.

David


[davidc@chef ~/src/PostgreSQL/postgresql-7.2.2]$ find . -name '*core*'
./contrib/retep/uk/org/retep/xml/core
./src/interfaces/jdbc/org/postgresql/core


[davidc@chef ~/src/PostgreSQL/postgresql-7.2.2/src/test/regress/log]$
cat postmaster.log
DEBUG:  database system was shut down at 2002-09-20 02:46:51 GMT
DEBUG:  checkpoint record is at 0/113640
DEBUG:  redo record is at 0/113640; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 89; next oid: 16556
DEBUG:  database system is ready
ERROR:  DROP GROUP: group "regressgroup1" does not exist
TRAP: Failed Assertion("!(lock->shared > 0):", File: "lwlock.c", Line:
434)
!(lock->shared > 0) (0) [Success]
DEBUG:  server process (pid 22628) was terminated by signal 6
DEBUG:  terminating any other active server processes
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
FATAL 1:  The database system is in recovery mode
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
DEBUG:  all server processes terminated; reinitializing shared memory
and semaphores
DEBUG:  database system was interrupted at 2002-09-20 02:46:51 GMT
DEBUG:  checkpoint record is at 0/113640
DEBUG:  redo record is at 0/113640; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 89; next oid: 16556
DEBUG:  database system was not properly shut down; automatic recovery
in progress
DEBUG:  redo starts at 0/113680
DEBUG:  ReadRecord: record with zero length at 0/138818
DEBUG:  redo done at 0/1387F0
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
DEBUG:  database system is ready
ERROR:  CREATE USER: user name "regressuser4" already exists
NOTICE:  ALTER GROUP: user "regressuser2" is already in group
"regressgroup2"
ERROR:  atest2: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  LOCK TABLE: permission denied
ERROR:  atest2: Permission denied.
ERROR:  permission denied
ERROR:  atest2: Permission denied.
ERROR:  atest1: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest1: Permission denied.
ERROR:  atest1: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest1: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest2: Permission denied.
ERROR:  atest3: Permission denied.
ERROR:  has_table_privilege: relation "pg_shad" does not exist
ERROR:  user "nosuchuser" does not exist
ERROR:  has_table_privilege: invalid privilege type sel
ERROR:  pg_aclcheck: invalid user id 4293967297
ERROR:  has_table_privilege: invalid relation oid 1
ERROR:  Relation "onek" does not exist
ERROR:  Relation "onek" does not exist
ERROR:  Relation "tmp" does not exist
ERROR:  Relation "tmp" does not exist
ERROR:  table "tmp" does not exist
ERROR:  Relation "onek" does not exist
ERROR:  Relation "onek" does not exist
ERROR:  Relation "onek" does not exist
ERROR:  Relation "onek" does not exist
ERROR:  Relation "onek2" does not exist
ERROR:  Relation "onek2" does not exist
ERROR:  Relation "onek2" does not exist
ERROR:  Relation "stud_emp" does not exist
ERROR:  Relation "stud_emp" does not exist
ERROR:  Relation "stud_emp" does not exist
ERROR:  Relation "stud_emp" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "b_star" does not exist
ERROR:  Relation "c_star" does not exist
ERROR:  Relation "d_star" does not exist
ERROR:  Relation "e_star" does not exist
ERROR:  Relation "f_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "f_star" does not exist
ERROR:  Relation "e_star" does not exist
ERROR:  Relation "d_star" does not exist
ERROR:  Relation "c_star" does not exist
ERROR:  Relation "b_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "f_star" does not exist
ERROR:  Relation "f_star" does not exist
ERROR:  Relation "e_star" does not exist
ERROR:  Relation "e_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "a_star" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Relation "hobbies_r" does not exist
ERROR:  Relation "hobbies_r" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Relation "person" does not exist
ERROR:  Function 'user_relns()' does not exist
         Unable to identify a function that satisfies the given argument
types
         You may need to add explicit typecasts
ERROR:  Function 'hobbies_by_name(unknown)' does not exist
         Unable to identify a function that satisfies the given argument
types
         You may need to add explicit typecasts
ERROR:  Function 'oldstyle_length(int4, text)' does not exist
         Unable to identify a function that satisfies the given argument
types
         You may need to add explicit typecasts
ERROR:  Relation "street" does not exist
ERROR:  Relation "iexit" does not exist
ERROR:  Relation "toyemp" does not exist
TRAP: Failed Assertion("!(lock->shared > 0):", File: "lwlock.c", Line:
434)
!(lock->shared > 0) (0) [Interrupted system call]
DEBUG:  server process (pid 23536) was terminated by signal 6
DEBUG:  terminating any other active server processes
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
NOTICE:  Message from PostgreSQL backend:
         The Postmaster has informed me that some other backend
         died abnormally and possibly corrupted shared memory.
         I have rolled back the current transaction and am
         going to terminate your database system connection and exit.
         Please reconnect to the database system and repeat your query.
DEBUG:  all server processes terminated; reinitializing shared memory
and semaphores
DEBUG:  database system was interrupted at 2002-09-20 02:46:54 GMT
DEBUG:  checkpoint record is at 0/138818
DEBUG:  redo record is at 0/138818; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 140; next oid: 24748
DEBUG:  database system was not properly shut down; automatic recovery
in progress
DEBUG:  redo starts at 0/138858
DEBUG:  ReadRecord: record with zero length at 0/16D820
DEBUG:  redo done at 0/16D7F8
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
FATAL 1:  The database system is starting up
DEBUG:  smart shutdown request
DEBUG:  database system is ready
DEBUG:  shutting down
DEBUG:  database system is shut down

Re: Server hangs on multiple connections

От
Tom Lane
Дата:
David Christian <davidc@comtechmobile.com> writes:
> On Thursday, Sep 19, 2002, at 18:33 US/Eastern, Tom Lane wrote:
>> There should be a core file left in the database subdirectory after
>> the assert failure --- would you gdb it and get a stack trace from it?

> Unfortunately, I see no core file under the source tree after the
> assert failure.

If you are using "make check" then look for
src/test/regress/tmp_check/data/base/*/core

If you don't see one then you must be running with a ulimit setting that
forbids core dumps --- try "ulimit -c unlimited" before starting the
postmaster.

> TRAP: Failed Assertion("!(lock->shared > 0):", File: "lwlock.c", Line:
> 434)

This confirms my suspicion that something is busted in lock handling on
your machine, but there's not enough info here to tell just what.  We
still need a stack trace.

Another interesting line of attack would be to try compiling
src/backend/storage/lmgr/lwlock.c at different optimization levels,
to see if the problem goes away with less optimization.  We saw a
problem on AIX (if memory serves) before 7.2 release that turned out
to be due to overaggressive optimization by the compiler.  We thought
we'd added enough "volatile" keywords to lwlock.c to discourage any
code rearrangement, but maybe we still need more.

            regards, tom lane

Re: Server hangs on multiple connections

От
David Christian
Дата:
On Friday, Sep 20, 2002, at 11:30 US/Eastern, Tom Lane wrote:

> If you are using "make check" then look for
> src/test/regress/tmp_check/data/base/*/core

Thanks.  Here it is:

$ gdb src/test/regress/tmp_check/install/usr/local/pgsql/bin/postmaster
src/test/regress/tmp_check/data/base/16556/core
GNU gdb Yellow Dog Linux (5.1.1-1b)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "ppc-yellowdog-linux"...
Core was generated by `postgres: davidc regression [local] startup
                              '.
Program terminated with signal 6, Aborted.
Reading symbols from /usr/lib/libz.so.1...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /usr/lib/libhistory.so.4...done.
Loaded symbols for /usr/lib/libhistory.so.4
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
#0  0x0fd1be44 in kill () at soinit.c:76
76      soinit.c: No such file or directory.
         in soinit.c
(gdb) bt
#0  0x0fd1be44 in kill () at soinit.c:76
#1  0x0fd1bcc0 in raise (sig=6) at ../sysdeps/posix/raise.c:27
#2  0x0fd1d374 in abort () at ../sysdeps/generic/abort.c:88
#3  0x1016882c in vararg_format (fmt=0x0) at excabort.c:27
#4  0x10168744 in ExcUnCaught (excP=0x101f0968, detail=0, data=0x0,
message=0x101c16b4 "!(lock->shared > 0)") at exc.c:168
#5  0x101687d4 in ExcRaise (excP=0x101f0968, detail=0, data=0x0,
message=0x101c16b4 "!(lock->shared > 0)") at exc.c:185
#6  0x10167a4c in ExceptionalCondition (conditionName=0x101c16b4
"!(lock->shared > 0)", exceptionP=0x101f0968, detail=0x0,
     fileName=0x8 <Address 0x8 out of bounds>, lineNumber=37) at
assert.c:70
#7  0x1010ca84 in LWLockRelease (lockid=BufMgrLock) at lwlock.c:434
#8  0x10108a84 in LockAcquire (lockmethod=1, locktag=0x300eb548,
xid=806273048, lockmode=1, dontWait=0 '\000') at lock.c:723
#9  0x10107640 in LockRelation (relation=0x102507b8, lockmode=1) at
lmgr.c:153
#10 0x100333b0 in relation_openr (relationName=0x101d47a0 "pg_class",
lockmode=1) at heapam.c:524
#11 0x1003355c in heap_openr (relationName=0x0, lockmode=6) at
heapam.c:595
#12 0x101613a0 in scan_pg_rel_ind (buildinfo={infotype = 2, i =
{info_id = 270195424, info_name = 0x101adae0
"pg_trigger_tgrelid_index"}}) at relcache.c:356
#13 0x10161254 in ScanPgRelation (buildinfo={infotype = 2, i = {info_id
= 270195424, info_name = 0x101adae0 "pg_trigger_tgrelid_index"}}) at
relcache.c:284
#14 0x10162854 in RelationBuildDesc (buildinfo={infotype = 2, i =
{info_id = 270195424, info_name = 0x101adae0
"pg_trigger_tgrelid_index"}}, oldrelation=0x0)
     at relcache.c:968
#15 0x1016335c in RelationNameGetRelation (relationName=0x101adae0
"pg_trigger_tgrelid_index") at relcache.c:1493
#16 0x10033380 in relation_openr (relationName=0x101adae0
"pg_trigger_tgrelid_index", lockmode=0) at heapam.c:518
#17 0x1003ae20 in index_openr (relationName=0x0) at indexam.c:149
#18 0x1009a1a0 in RelationBuildTriggers (relation=0x301897d8) at
trigger.c:551
#19 0x1016290c in RelationBuildDesc (buildinfo={infotype = 2, i =
{info_id = 270357192, info_name = 0x101d52c8 "pg_shadow"}},
oldrelation=0x301897d8)
     at relcache.c:1033
#20 0x1016335c in RelationNameGetRelation (relationName=0x101d52c8
"pg_shadow") at relcache.c:1493
#21 0x10033380 in relation_openr (relationName=0x101d52c8 "pg_shadow",
lockmode=0) at heapam.c:518
#22 0x1003355c in heap_openr (relationName=0x0, lockmode=6) at
heapam.c:595
#23 0x1015ec20 in CatalogCacheInitializeCache (cache=0x102718b8) at
catcache.c:216
#24 0x10160194 in SearchCatCache (cache=0x102718b8, v1=270840281, v2=0,
v3=0, v4=0) at catcache.c:862
#25 0x10165c50 in SearchSysCache (cacheId=22, key1=270840281, key2=0,
key3=0, key4=0) at syscache.c:461
#26 0x1016dba4 in InitializeSessionUserId (username=0x1024b1d9
"davidc") at miscinit.c:450
#27 0x1016ea70 in InitPostgres (dbname=0x10230f40 "regression",
username=0x1024b1d9 "davidc") at postinit.c:337
#28 0x10112190 in PostgresMain (argc=4, argv=0x7fffecd8,
username=0x1024b1d9 "davidc") at postgres.c:1684
#29 0x100ecb54 in DoBackend (port=0x1024b0a8) at postmaster.c:2243
#30 0x100ec3cc in BackendStartup (port=0x1024b0a8) at postmaster.c:1874
#31 0x100eb224 in ServerLoop () at postmaster.c:977
#32 0x100eaccc in PostmasterMain (argc=4, argv=0x1022ab60) at
postmaster.c:771
#33 0x100bdf68 in main (argc=4, argv=0x7ffff814) at main.c:206
#34 0x0fd07f70 in __libc_start_main (argc=4, ubp_av=0x7ffff814,
ubp_ev=0x0, auxvec=0x7ffff8a8, rtld_fini=Cannot access memory at
address 0x0
) at ../sysdeps/powerpc/elf/libc-start.c:119


> Another interesting line of attack would be to try compiling
> src/backend/storage/lmgr/lwlock.c at different optimization levels,
> to see if the problem goes away with less optimization.  We saw a
> problem on AIX (if memory serves) before 7.2 release that turned out
> to be due to overaggressive optimization by the compiler.  We thought
> we'd added enough "volatile" keywords to lwlock.c to discourage any
> code rearrangement, but maybe we still need more.

Okay, I will try to figure out how to do what you just said :-) and
meanwhile hope the stack trace above is helpful.

Thanks!
David

Re: Server hangs on multiple connections

От
David Christian
Дата:
On Friday, Sep 20, 2002, at 11:30 US/Eastern, Tom Lane wrote:

> Another interesting line of attack would be to try compiling
> src/backend/storage/lmgr/lwlock.c at different optimization levels,
> to see if the problem goes away with less optimization.  We saw a
> problem on AIX (if memory serves) before 7.2 release that turned out
> to be due to overaggressive optimization by the compiler.  We thought
> we'd added enough "volatile" keywords to lwlock.c to discourage any
> code rearrangement, but maybe we still need more.

This seems to work:

$ ./configure
$ make
$ cd src/backend/storage/lmgr
$ rm lwlock.o
$ gcc -O0 -g -Wall -Wmissing-prototypes -Wmissing-declarations
-I../../../../src/include   -c -o lwlock.o lwlock.c
$ cd -
$ make check

All tests pass except 'geometry'.
I also tried the above with -O1, and it still failed on 'make check'.

So, is it safe to proceed this way?  If this turns out to be the
solution, is there anything I should be aware of with regard to
stability and performance vs. a normal install?

Thanks,
David

Re: Server hangs on multiple connections

От
Tom Lane
Дата:
David Christian <davidc@comtechmobile.com> writes:
> On Friday, Sep 20, 2002, at 11:30 US/Eastern, Tom Lane wrote:
>> Another interesting line of attack would be to try compiling
>> src/backend/storage/lmgr/lwlock.c at different optimization levels,

[ and indeed the problem goes away at -O0 ]

> So, is it safe to proceed this way?  If this turns out to be the
> solution, is there anything I should be aware of with regard to
> stability and performance vs. a normal install?

This should be stable; whether there's a measurable performance hit
from de-optimizing just that one file is hard to say.

At this point I would say that the problem is that the compiler's
optimizer is rearranging the order of operations inside lwlock.c
in a way that breaks the code for parallel operations.  This could
be a compiler bug, or it could be that the compiler is doing something
it's allowed to do under the C specification --- in which case we need
to add some more "volatile"s to fix it.

Could you send me (off-list, since it's likely to be large) the lwlock.s
file produced by

    gcc -O0 -I../../../../src/include -S lwlock.c

as well as the one produced by

    gcc -O1 -I../../../../src/include -S lwlock.c

Groveling through the assembly code should at least tell me what's being
changed ...

            regards, tom lane

Re: Server hangs on multiple connections

От
Tom Lane
Дата:
Well, the long and the short of it seems to be that no one before you
ever tried to run Postgres on a multi-CPU PowerPC machine :-(

Some digging around on the net made it clear that we were missing
synchronization instructions that are critical for access to shared
memory in a multi-CPU system.  I have applied the attached patch to
CVS tip (7.3beta2-almost).  It looks like it will apply cleanly to
7.2.*, so please try it out (with optimization re-enabled) and let
us know what you see!

(I have confirmed that this patch causes no trouble on LinuxPPC and
OS X 10.1, but I do not have a multi-CPU machine to see if it really
solves the problem...)

            regards, tom lane


*** src/backend/storage/lmgr/s_lock.c.orig    Thu Jun 20 16:29:35 2002
--- src/backend/storage/lmgr/s_lock.c    Fri Sep 20 20:11:53 2002
***************
*** 115,120 ****
--- 115,123 ----
  /* used in darwin. */
  /* We key off __APPLE__ here because this function differs from
   * the LinuxPPC implementation only in compiler syntax.
+  *
+  * NOTE: per the Enhanced PowerPC Architecture manual, v1.0 dated 7-May-2002,
+  * an isync is a sufficient synchronization barrier after a lwarx/stwcx loop.
   */
  static void
  tas_dummy()
***************
*** 134,139 ****
--- 137,143 ----
  fail:        li         r3,1        \n\
              blr                 \n\
  success:                        \n\
+             isync                \n\
              li         r3,0        \n\
              blr                    \n\
  ");
***************
*** 158,163 ****
--- 162,168 ----
  fail:        li        3,1         \n\
              blr                 \n\
  success:                        \n\
+             isync                \n\
              li         3,0            \n\
              blr                    \n\
  ");
*** src/include/storage/s_lock.h.orig    Mon Sep  2 09:50:09 2002
--- src/include/storage/s_lock.h    Fri Sep 20 20:11:46 2002
***************
*** 217,222 ****
--- 217,237 ----
  #endif     /* defined(__mc68000__) && defined(__linux__) */


+ #if defined(__ppc__) || defined(__powerpc__)
+ /*
+  * We currently use out-of-line assembler for TAS on PowerPC; see s_lock.c.
+  * S_UNLOCK is almost standard but requires a "sync" instruction.
+  */
+ #define S_UNLOCK(lock)    \
+ do \
+ {\
+     __asm__ __volatile__ ("    sync \n"); \
+     *((volatile slock_t *) (lock)) = 0; \
+ } while (0)
+
+ #endif /* defined(__ppc__) || defined(__powerpc__) */
+
+
  #if defined(NEED_VAX_TAS_ASM)
  /*
   * VAXen -- even multiprocessor ones

Re: Server hangs on multiple connections

От
David Christian
Дата:
I think you've fixed it.  With your patch, and a simple

$ ./configure
$ make
$ make check
# make install

the check works and all tests (except geometry on floating point stuff)
pass; and after installing, I can really hammer the server and it
doesn't hang.  Looks like users on Yellow Dog Linux multi-CPU PowerPC
platforms won't have this problem anymore ... that is, when the next
person besides me decides to try it. :-)

This whole exercise was well worth my time, and I get the added bonus
of not having to switch machines.  I know you spent a lot of time on it
and I greatly appreciate your care and responsiveness.

Many thanks - feel free to ask me to check anything else on this
platform you would like to see.

David


On Friday, Sep 20, 2002, at 20:40 US/Eastern, Tom Lane wrote:

> Well, the long and the short of it seems to be that no one before you
> ever tried to run Postgres on a multi-CPU PowerPC machine :-(
>
> Some digging around on the net made it clear that we were missing
> synchronization instructions that are critical for access to shared
> memory in a multi-CPU system.  I have applied the attached patch to
> CVS tip (7.3beta2-almost).  It looks like it will apply cleanly to
> 7.2.*, so please try it out (with optimization re-enabled) and let
> us know what you see!

[snip]