Обсуждение: 7.0RC1: possible query and backend problem
I've just upgraded to 7.0RC1 on a Solaris 2.6+patches system (U5 with 256MB). I'm having some trouble with a brand-new query, so I don't know if this is a new bug or an old one. It has two parts: query execution and backend robustness. First, I'm getting a crash with this query (via Apache mod_perl and DBI, or via psql): Table U: id varchar(128) primary key, stop datetime, start datetime Table S1: id varchar(128) primary key, url varchar(512); Table S2: id varchar(128) primary key, url varchar(512); SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url,S2.url FROM U WHERE 'now'::datetime-start<'1 month'::interval AND (id=S1.id OR id=S2.id) GROUP BY id,S1.url,S2.url; The idea is to select whichever of S1.url and S2.url is non-null. From psql, I get: pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. connection to server was lost And now there are no postgres processes running. I suspect that the problem is that, in my data set, either S1.url or S2.url will be null for any given row, and this is causing problems for GROUP BY (I can also believe that my SQL is bad - but that shouldn't crash postgres, should it?). If I stick to just S1, the query runs ok. But... now there are no postgres processes running at all. # ps -fu postgres UID PID PPID C STIME TTY TIME CMD In the postgres log: >Server process (pid 11189) exited with status 139 at Wed Apr 19 16:37:25 2000 >Terminating any active server processes... >NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend died >abnormally and possibly corrupted shared memory. > I have rolled back the current transaction and am going to >terminate your database system connection and exit. > Please reconnect to the database system and repeat your query. >Server processes were terminated at Wed Apr 19 16:37:25 2000 >Reinitializing shared memory and semaphores >IpcMemoryCreate: shmget failed (Invalid argument) key=5432110, >size=144, permission=700 >This type of error is usually caused by an improper >shared memory or System V IPC semaphore configuration. >For more information, see the FAQ and platform-specific >FAQ's in the source directory pgsql/doc or on our >web site at http://www.postgresql.org. >IpcMemoryIdGet: shmget failed (Invalid argument) key=5432110, >size=144, permission=0 >IpcMemoryAttach: shmat failed (Invalid argument) id=-2 >FATAL 1: AttachSLockMemory: could not attach segment Now, the real problem here is that postgres seems to be trying to restart, and fails. It complains about shmget - my /etc/system says: set shmsys:shminfo_shmmax = 0x40000000 set shmsys:shminfo_shmmni = 100 set shmsys:shminfo_shmseg = 32 set semsys:seminfo_semmns = 1024 set semsys:seminfo_semmni = 1024 set semsys:seminfo_semmnu = 1024 set semsys:seminfo_semmsl = 1024 set semsys:seminfo_semmap = 1024 So I don't think this is really a config problem (if it were, the db wouldn't start up to begin with). Rather, I'd guess that postgres isn't freeing a memory block during the "Terminating any active server processes..." phase. Running ipcs -a immediately after the crash shows several segments (although oddly, I have no trouble starting the db again as root - maybe there's a permissions issue at work here?): # ipcs -a IPC status from <running system> as of Wed Apr 19 16:45:42 2000 T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME Shared Memory: m 800 0x0052e32e --rw------- postgres postgres postgres postgres 0 120 12737 12737 13:01:36 13:01:36 13:01:36 m 801 0x0052e325 --rw------- postgres postgres postgres postgres 0 15373312 12737 12737 13:01:36 13:01:36 13:01:36 m 802 0x0052e32b --rw------- postgres postgres postgres postgres 0 102184 12737 12737 13:01:36 13:01:36 13:01:36 m 3 0x500005a3 --rw-r--r-- root root root root 1 68 232 232 15:29:35 15:29:35 15:29:35 Anyway, I hope this report is helpful. Thanks to the developers for their hard work on PG7. -- Mike
At 5:16 PM -0700 4/19/2000, Michael Blakeley wrote: >I've just upgraded to 7.0RC1 on a Solaris 2.6+patches system (U5 >with 256MB). I'm having some trouble with a brand-new query, so I >don't know if this is a new bug or an old one. It has two parts: >query execution and backend robustness. > >First, I'm getting a crash with this query (via Apache mod_perl and >DBI, or via psql): > >Table U: id varchar(128) primary key, stop datetime, start datetime >Table S1: id varchar(128) primary key, url varchar(512); >Table S2: id varchar(128) primary key, url varchar(512); > > SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url,S2.url > FROM U WHERE 'now'::datetime-start<'1 month'::interval > AND (id=S1.id OR id=S2.id) GROUP BY id,S1.url,S2.url; For the benefit of anyone trying to work around crashes like this, the query works fine for me as a UNION, instead. SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url FROM U WHERE 'now'::datetime-start<'1 month'::interval AND id=S1.id GROUP BY id,S1.url UNION SELECT id,date_part('epoch',sum(stop-start)),count(*),S2.url FROM U WHERE 'now'::datetime-start<'1 month'::interval AND id=S2.id GROUP BY id,S2.url; Here's a backtrace, too: # gdb ./bin/postgres ./data/base/sonicbox/core GNU gdb 4.17 Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "sparc-sun-solaris2.6"... Core was generated by `/usr/local/pgsql/bin/postmaster -i -N 512 -B 1024 -o -F'. Program terminated with signal 11, Segmentation Fault. Reading symbols from /usr/lib/libnsl.so.1...done. Reading symbols from /usr/lib/libsocket.so.1...done. Reading symbols from /usr/lib/libdl.so.1...done. Reading symbols from /usr/lib/libm.so.1...done. Reading symbols from /usr/lib/libc.so.1...done. Reading symbols from /usr/lib/libmp.so.2...done. Reading symbols from /usr/platform/SUNW,Ultra-5_10/lib/libc_psr.so.1...done. Reading symbols from /usr/lib/nss_files.so.1...done. #0 0x110f6c in make_one_rel_by_joins () (gdb) bt #0 0x110f6c in make_one_rel_by_joins () #1 0x110d40 in make_one_rel () #2 0x11be74 in subplanner () #3 0x11bd98 in query_planner () #4 0x11ca40 in union_planner () #5 0x11c3fc in subquery_planner () #6 0x11c13c in planner () #7 0x16314c in pg_plan_query () #8 0x163428 in pg_exec_query_dest () #9 0x163270 in pg_exec_query () #10 0x164f60 in PostgresMain () #11 0x131738 in DoBackend () #12 0x130f08 in BackendStartup () #13 0x12f41c in ServerLoop () #14 0x12e978 in PostmasterMain () #15 0xd89f8 in main () -- Mike
Michael Blakeley <mike@blakeley.com> writes: > I've just upgraded to 7.0RC1 on a Solaris 2.6+patches system (U5 with > 256MB). I'm having some trouble with a brand-new query, so I don't > know if this is a new bug or an old one. It has two parts: query > execution and backend robustness. > SELECT id,date_part('epoch',sum(stop-start)),count(*),S1.url,S2.url > FROM U WHERE 'now'::datetime-start<'1 month'::interval > AND (id=S1.id OR id=S2.id) GROUP BY id,S1.url,S2.url; > pqReadData() -- backend closed the channel unexpectedly. > This probably means the backend terminated abnormally > before or while processing the request. I've fixed this problem, I think --- the three-way join clause was confusing the planner :-(. > I suspect that the problem is that, in my data set, either S1.url or > S2.url will be null for any given row, and this is causing problems > for GROUP BY That shouldn't be a problem, although you do need to be careful when dealing with NULLs --- it's easy to write the query so that the WHERE condition will produce NULL, which is interpreted as FALSE. But the above should work OK, because (NULL OR TRUE) will produce TRUE. >> IpcMemoryCreate: shmget failed (Invalid argument) key=5432110, >> size=144, permission=700 Hmm, that is odd. The thing that looks peculiar to me is that it seems to be calculating a different size for the segment than it did the first time through: > # ipcs -a > IPC status from <running system> as of Wed Apr 19 16:45:42 2000 > T ID KEY MODE OWNER GROUP CREATOR > CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME > Shared Memory: > m 800 0x0052e32e --rw------- postgres postgres postgres > postgres 0 120 12737 12737 13:01:36 13:01:36 13:01:36 See the difference? 120 vs 144? What's causing that I wonder... and would it explain the failure to reattach? regards, tom lane
I wrote: >>> IpcMemoryCreate: shmget failed (Invalid argument) key=5432110, >>> size=144, permission=700 > Hmm, that is odd. The thing that looks peculiar to me is that > it seems to be calculating a different size for the segment than > it did the first time through: >> # ipcs -a >> IPC status from <running system> as of Wed Apr 19 16:45:42 2000 >> T ID KEY MODE OWNER GROUP CREATOR >> CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME >> Shared Memory: >> m 800 0x0052e32e --rw------- postgres postgres postgres >> postgres 0 120 12737 12737 13:01:36 13:01:36 13:01:36 > See the difference? 120 vs 144? What's causing that I wonder... > and would it explain the failure to reattach? After further investigation, "Invalid argument" is the typical kernel error code from shmget() if one tries to attach to an existing shared memory segment that is smaller than one asked for. So that's consistent. The size requested for the spinlock segment (which is the only one of Postgres' three shmem segments that could be as small as 144 bytes) is computed by "sizeof(struct foo)"; there is no way that that is going to change from one invocation to the text. But the numbers 120 and 144 are consistent with the theory that the shmem segment was originally created by Postgres 6.5 and you are now trying to attach to it with Postgres 7.0 --- 7.0 has more spinlocks than 6.5 did. Your trace appeared to show a working 7.0 postmaster getting this error while trying to reinitialize. That doesn't make any sense to me; if the 7.0 postmaster had managed to start up originally, then it must have found or created a suitably-sized shmem segment. So I'm confused about the details, but I've got to think that we are looking at some sort of interference between 6.5 and 7.0 installations. One possibility is that after you started the 7.0 postmaster, you accidentally tried to start a 6.5 postmaster on the same port number, and the 6.5 code managed to resize the shared mem segment before failing because of the port number conflict. Not sure if that could happen --- my shmget() man page doesn't say anything about changing the size of an already-existing shmem segment, but maybe your Unix works differently. Comments anyone? If this actually is what happened, we should reorder the startup sequence to check for port-number conflicts before any shared memory segments are touched. But I'm not sure about it. regards, tom lane PS: if you don't see the connection between port number and shmem, it's this: the key numbers used for shmem segments are computed from the port number. So different postmasters can coexist on one machine if they have different port numbers; they'll get different shmem segments. But starting two postmasters on the same port is bad news. I thought we had adequate interlocks against that, but now I'm wondering.