Обсуждение: Cygwin PostgreSQL Regression Test Problems
Over the last few days, I ran the regression tests for 7.1 Beta 3 much more than I have in the past for 7.0.2 and 7.0.3. Unfortunately, I experienced the following problems: 1. Until I did a cvs update last night (1/14/2001), the regression tests were failing on 1/12 and 1/13. Did anyone do a cvs commit that would fix backend children from stackdump-ing on Cygwin? I hope so. Here are some interesting snippets: --- pg_regress output --- .. parallel group (7 tests): create_aggregate create_operator inherit triggers constraints create_misc create_index constraints ... FAILED triggers ... FAILED create_misc ... FAILED create_aggregate ... ok .. --- pg_regress output --- --- postmaster output --- NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. .. ERROR: Relation 'temptest' does not exist 0 [main] postmaster 2640 handle_exceptions: Exception: STATUS_ACCESS_VIOLATION 479 [main] postmaster 2640 stackdump: Dumping stack trace to postmaster.exe.stackdump Server process (pid 2640) exited with status 139 at Sat Jan 13 21:28:36 2001 Terminating any active server processes... Server processes were terminated at Sat Jan 13 21:28:36 2001 Reinitializing shared memory and semaphores IpcMemoryDetach: shmdt(0x120b0000) failed: Invalid argument .. --- postmaster output --- 2. I am unable to successfully run the regression tests on a NT 4.0 SP5 machine with only 64 MB of physical memory and about 175 MB of swap space. Other than lacking RAM and swap space, this machine is the "same" as other NT/2000 machines which can successfully run the regression tests. The tests usually hang during the "parallel group (18 tests)" test right after numerology. By "hang," I mean that the original postmaster is still running, but there are no postmaster children, and there are some number of psql processes hanging around. Using NT's TaskManager, I can see that the machine is running out of memory. I have even seen the "Windows is running low on virtual memory" dialog a few times. Should I expect this behavior from such a lame machine? 3. Once (or twice), I noticed that the plpgsql test failed. Unfortunately, I didn't capture the precise output but I think that postmaster was complaining about being unable to mv <somepath>/pg_internal.init.<somepid> <somepath>/pg_internal.init due to a permissions problem. Sorry, for being vague... Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: > parallel group (7 tests): create_aggregate create_operator inherit triggers constraints create_misc create_index > constraints ... FAILED > triggers ... FAILED > create_misc ... FAILED > create_aggregate ... ok Can't tell much from this. What are the detail diffs (regression.diffs file?) > 2. I am unable to successfully run the regression tests on a NT 4.0 SP5 > machine with only 64 MB of physical memory and about 175 MB of swap space. > Other than lacking RAM and swap space, this machine is the "same" as other > NT/2000 machines which can successfully run the regression tests. > The tests usually hang during the "parallel group (18 tests)" test > right after numerology. By "hang," I mean that the original postmaster > is still running, but there are no postmaster children, and there are > some number of psql processes hanging around. Hm. You will have 18 backends firing up there, plus 18 psqls to drive 'em, and probably 18 shell subprocesses parenting the psqls. I wouldn't be too surprised at running out of memory --- but one would like to expect a more graceful failure than just hanging. What if anything shows up in the postmaster log? > 3. Once (or twice), I noticed that the plpgsql test failed. > Unfortunately, I didn't capture the precise output but I think that > postmaster was complaining about being unable to > mv <somepath>/pg_internal.init.<somepid> <somepath>/pg_internal.init > due to a permissions problem. Sorry, for being vague... Hm. The first backend to fire up after a vacuum will try to rebuild pg_internal.init, and then move it into place with /* * And rename the temp file to its final name, deleting any * previously-existing init file. */ if (rename(tempfilename, finalfilename) < 0) { elog(NOTICE, "Cannot rename init file %s to %s: %m\n\tContinuing anyway, but there's something wrong.", tempfilename,finalfilename); } In a parallel test it's possible that several backends would try to do this at about the same time, but that should be OK; we should end up with just one file from the last-to-finish backend. I think you have found another Cygwin bug :-( regards, tom lane
Tom, I'm finally back in front of the machine where I ran these tests... On Tue, Jan 16, 2001 at 01:45:21AM -0500, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > parallel group (7 tests): create_aggregate create_operator inherit triggers constraints create_misc create_index > > constraints ... FAILED > > triggers ... FAILED > > create_misc ... FAILED > > create_aggregate ... ok > > Can't tell much from this. What are the detail diffs (regression.diffs file?) Unfortunately I ran more (successful) tests after these failure, so the detail diffs are no longer available. > > 2. I am unable to successfully run the regression tests on a NT 4.0 SP5 > > machine with only 64 MB of physical memory and about 175 MB of swap space. > > Other than lacking RAM and swap space, this machine is the "same" as other > > NT/2000 machines which can successfully run the regression tests. > > What if anything shows up in the postmaster log? Sorry, the postmaster log is gone too. > > 3. Once (or twice), I noticed that the plpgsql test failed. > > Unfortunately, I didn't capture the precise output but I think that > > postmaster was complaining about being unable to > > mv <somepath>/pg_internal.init.<somepid> <somepath>/pg_internal.init > > due to a permissions problem. Sorry, for being vague... > > Hm. The first backend to fire up after a vacuum will try to rebuild > pg_internal.init, and then move it into place with > > /* > * And rename the temp file to its final name, deleting any > * previously-existing init file. > */ > if (rename(tempfilename, finalfilename) < 0) > { > elog(NOTICE, "Cannot rename init file %s to %s: %m\n\tContinuing anyway, but there's something wrong.", tempfilename,finalfilename); > } > > In a parallel test it's possible that several backends would try to do > this at about the same time, but that should be OK; we should end up > with just one file from the last-to-finish backend. I think you have > found another Cygwin bug :-( Windows has issues with open files. So, if a backend is trying to rename a file when it is open (by another), then the rename will fail. Will this cause database integrity problems? Or, will there just be some spurious warning? Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Tom, On Thu, Jan 18, 2001 at 12:39:59PM -0500, Tom Lane wrote: > However --- I suppose Windows can't cope with deleting a file someone > else is holding open, either? Yes. > That would cause significantly bigger problems :-( That sounds ominous, please elaborate. Thanks, Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: >> In a parallel test it's possible that several backends would try to do >> this at about the same time, but that should be OK; we should end up >> with just one file from the last-to-finish backend. I think you have >> found another Cygwin bug :-( > Windows has issues with open files. So, if a backend is trying to > rename a file when it is open (by another), then the rename will fail. > Will this cause database integrity problems? Or, will there just be > some spurious warning? In this context the only bad side-effect is that a useless temporary file gets left around. It's small, so I wouldn't worry too much. However --- I suppose Windows can't cope with deleting a file someone else is holding open, either? That would cause significantly bigger problems :-( regards, tom lane
Jason Tishler <Jason.Tishler@dothill.com> writes: > On Thu, Jan 18, 2001 at 12:39:59PM -0500, Tom Lane wrote: >> However --- I suppose Windows can't cope with deleting a file someone >> else is holding open, either? > Yes. >> That would cause significantly bigger problems :-( > That sounds ominous, please elaborate. If you drop a table that someone else has recently used, the someone else's backend is probably still holding the file open. We generally don't close open file descriptors until we have to. In current sources I think that you'd get a "cannot unlink" NOTICE, but the table would get logically dropped anyway, and the sole side-effect would be failure to recover the disk space. But in this case we could be talking about large amounts of disk space. regards, tom lane
Tom, On Thu, Jan 18, 2001 at 12:59:00PM -0500, Tom Lane wrote: > In current sources I think that you'd get a "cannot unlink" NOTICE, > but the table would get logically dropped anyway, and the sole > side-effect would be failure to recover the disk space. But in this > case we could be talking about large amounts of disk space. Cygwin does attempt to overcome the Windows open file issue. If a sharing violation is detected (i.e., the file is open) during an unlink operation (really DeleteFile), Cygwin will queue it for deletion later. However, reading the Cygwin code, I found the following: /* FIXME: this delqueue module is very flawed and should be rewritten. First, having an array of a fixed size for keeping track of the unlinked but not yet deleted files is bad. Second, some programs will unlink files and then create a new one in the same location and this behavior is not supported in the current code. Probably we should find a move/rename function that will work on open files, and move delqueue files to some special location or some such hack... */ With the above caveats, is the current functionality sufficient for PostgreSQL's needs? Thanks Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
> Tom, > > On Thu, Jan 18, 2001 at 12:59:00PM -0500, Tom Lane wrote: > > In current sources I think that you'd get a "cannot unlink" NOTICE, > > but the table would get logically dropped anyway, and the sole > > side-effect would be failure to recover the disk space. But in this > > case we could be talking about large amounts of disk space. > > Cygwin does attempt to overcome the Windows open file issue. If a sharing > violation is detected (i.e., the file is open) during an unlink operation > (really DeleteFile), Cygwin will queue it for deletion later. However, > reading the Cygwin code, I found the following: > > /* FIXME: this delqueue module is very flawed and should be rewritten. > First, having an array of a fixed size for keeping track of the > unlinked but not yet deleted files is bad. Second, some programs > will unlink files and then create a new one in the same location > and this behavior is not supported in the current code. Probably > we should find a move/rename function that will work on open files, > and move delqueue files to some special location or some such > hack... */ > > With the above caveats, is the current functionality sufficient for > PostgreSQL's needs? No, it doesn't seems sufficient, though 7.1 will be a little better because of oid file names. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Jason Tishler <Jason.Tishler@dothill.com> writes: > /* FIXME: this delqueue module is very flawed and should be rewritten. > First, having an array of a fixed size for keeping track of the > unlinked but not yet deleted files is bad. Second, some programs > will unlink files and then create a new one in the same location > and this behavior is not supported in the current code. Probably > we should find a move/rename function that will work on open files, > and move delqueue files to some special location or some such > hack... */ > With the above caveats, is the current functionality sufficient for > PostgreSQL's needs? The fixed-size-array thing sounds like a gotcha waiting to bite someone. How big is the array, anyway? The unlink/recreate issue is not a problem for us anymore, since we use OIDs as filenames --- we won't try to reuse the same filename. regards, tom lane
Tom, On Thu, Jan 18, 2001 at 01:53:36PM -0500, Tom Lane wrote: > Jason Tishler <Jason.Tishler@dothill.com> writes: > > With the above caveats, is the current functionality sufficient for > > PostgreSQL's needs? > > The fixed-size-array thing sounds like a gotcha waiting to bite someone. Agreed. > How big is the array, anyway? The current size is 100 deep. Is that sufficient for PostgreSQL or is this dependent on usage? Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com
Jason Tishler <Jason.Tishler@dothill.com> writes: >> The fixed-size-array thing sounds like a gotcha waiting to bite someone. > Agreed. >> How big is the array, anyway? > The current size is 100 deep. Is that sufficient for PostgreSQL or is > this dependent on usage? Mumble. I'd sure you could gin up a scenario where it fails, but deleting 100 recently-used tables in one transaction doesn't seem like a very likely situation. Probably a more interesting question to ask is how graceful is the behavior when that array fills up? regards, tom lane
Tom, On Thu, Jan 18, 2001 at 03:53:44PM -0500, Tom Lane wrote: > Probably a more interesting question to ask is how graceful is the > behavior when that array fills up? If no slots are available, then the file is never queued. Hence, it is nevered deleted. Jason -- Jason Tishler Director, Software Engineering Phone: +1 (732) 264-8770 x235 Dot Hill Systems Corp. Fax: +1 (732) 264-8798 82 Bethany Road, Suite 7 Email: Jason.Tishler@dothill.com Hazlet, NJ 07730 USA WWW: http://www.dothill.com