Обсуждение: curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Tom Lane wrote: >> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> ! ERROR: could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes >> >> Is that repeatable? What sort of filesystem are you testing on? >> (soft-mounted NFS by any chance?) > doesn't seem to be repeatable :-( Hmm ... http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01 Exact same error --- is it at the same place in the tests where you saw it? Now that I think about it, there have been similar transient failures ("read only 0 of 8192 bytes") in the buildfarm before. It would be helpful to collect a list of exactly which build reports contain that string, but AFAIK there's no very easy way to do that; Andrew, any suggestions? regards, tom lane
Re: curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)
От
Stefan Kaltenbrunner
Дата:
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> Tom Lane wrote: >>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>>> ! ERROR: could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes >>> Is that repeatable? What sort of filesystem are you testing on? >>> (soft-mounted NFS by any chance?) > >> doesn't seem to be repeatable :-( > > Hmm ... > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01 > > Exact same error --- is it at the same place in the tests where you saw it? looks like it is in a similiar place: http://www.kaltenbrunner.cc/files/regression.diffs (I don't have more than this on that failure any more) Stefan
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >> Tom Lane wrote: >> >>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>> >>>> ! ERROR: could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes >>>> >>> Is that repeatable? What sort of filesystem are you testing on? >>> (soft-mounted NFS by any chance?) >>> > > >> doesn't seem to be repeatable :-( >> > > Hmm ... > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01 > > Exact same error --- is it at the same place in the tests where you saw it? > > Now that I think about it, there have been similar transient failures > ("read only 0 of 8192 bytes") in the buildfarm before. It would be > helpful to collect a list of exactly which build reports contain > that string, but AFAIK there's no very easy way to do that; Andrew, > any suggestions? > > > pgbfprod=# select sysname, stage, snapshot from build_status where log ~ $$read only \d+ of \d+ bytes$$; sysname | stage | snapshot -----------+--------------+---------------------zebra | InstallCheck | 2007-09-11 10:25:03wildebeest | InstallCheck |2007-09-11 22:00:11baiji | InstallCheck | 2007-09-12 22:39:24luna_moth | InstallCheck | 2007-09-19 13:10:01 (4 rows) cheers andrew
Re: curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)
От
Stefan Kaltenbrunner
Дата:
Andrew Dunstan wrote: > > > Tom Lane wrote: >> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> >>> Tom Lane wrote: >>> >>>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >>>> >>>>> ! ERROR: could not read block 2 of relation 1663/16384/2606: read >>>>> only 0 of 8192 bytes >>>>> >>>> Is that repeatable? What sort of filesystem are you testing on? >>>> (soft-mounted NFS by any chance?) >>>> >> >> >>> doesn't seem to be repeatable :-( >>> >> >> Hmm ... >> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01 >> >> >> Exact same error --- is it at the same place in the tests where you >> saw it? >> >> Now that I think about it, there have been similar transient failures >> ("read only 0 of 8192 bytes") in the buildfarm before. It would be >> helpful to collect a list of exactly which build reports contain >> that string, but AFAIK there's no very easy way to do that; Andrew, >> any suggestions? >> >> >> > > pgbfprod=# select sysname, stage, snapshot from build_status where log ~ > $$read only \d+ of \d+ bytes$$; > sysname | stage | snapshot > -----------+--------------+--------------------- > zebra | InstallCheck | 2007-09-11 10:25:03 > wildebeest | InstallCheck | 2007-09-11 22:00:11 > baiji | InstallCheck | 2007-09-12 22:39:24 > luna_moth | InstallCheck | 2007-09-19 13:10:01 hmm all of those seem to fail the foreign key checks in a very similiar way and that are vastly different platforms (windows,solaris,openbsd and linux). Stefan
Andrew Dunstan <andrew@dunslane.net> writes: > pgbfprod=# select sysname, stage, snapshot from build_status where log ~ > $$read only \d+ of \d+ bytes$$; > sysname | stage | snapshot > -----------+--------------+--------------------- > zebra | InstallCheck | 2007-09-11 10:25:03 > wildebeest | InstallCheck | 2007-09-11 22:00:11 > baiji | InstallCheck | 2007-09-12 22:39:24 > luna_moth | InstallCheck | 2007-09-19 13:10:01 > (4 rows) Fascinating. So I would venture that (1) it's definitely our bug, not something we could blame on NFS or whatever, and (2) we introduced it fairly recently. That specific error message wording exists only in HEAD, but it's been there since 2007-01-03, so if there were a pre-existing problem you'd think there would be some more matches. The patterns I notice here are (1) they're all InstallCheck not Check failures; (2) though not all at the same place in the tests, it's a fairly short range; (3) it's all references to system catalogs, though not all the same one. My gut feeling is that we're seeing autovacuum truncate off an empty end block and then a backend tries to reference that block again. But there should be enough interlocks in place to prevent such references. Any ideas out there? regards, tom lane
"Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> writes: > Andrew Dunstan wrote: > >> pgbfprod=# select sysname, stage, snapshot from build_status where log ~ >> $$read only \d+ of \d+ bytes$$; >> sysname | stage | snapshot >> -----------+--------------+--------------------- >> zebra | InstallCheck | 2007-09-11 10:25:03 >> wildebeest | InstallCheck | 2007-09-11 22:00:11 >> baiji | InstallCheck | 2007-09-12 22:39:24 >> luna_moth | InstallCheck | 2007-09-19 13:10:01 > > hmm all of those seem to fail the foreign key checks in a very similiar > way and that are vastly different platforms (windows,solaris,openbsd and > linux). Is this exhaustive? That is, are we sure this never happened before Sept 11th? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> writes: > > >> Andrew Dunstan wrote: >> >> >>> pgbfprod=# select sysname, stage, snapshot from build_status where log ~ >>> $$read only \d+ of \d+ bytes$$; >>> sysname | stage | snapshot >>> -----------+--------------+--------------------- >>> zebra | InstallCheck | 2007-09-11 10:25:03 >>> wildebeest | InstallCheck | 2007-09-11 22:00:11 >>> baiji | InstallCheck | 2007-09-12 22:39:24 >>> luna_moth | InstallCheck | 2007-09-19 13:10:01 >>> >> hmm all of those seem to fail the foreign key checks in a very similiar >> way and that are vastly different platforms (windows,solaris,openbsd and >> linux). >> > > Is this exhaustive? That is, are we sure this never happened before Sept 11th? > > Yes, we have never thrown away any buildfarm history, and we have build logs going back several years now. Being able to run queries like this makes it all worth while :-) (Thanks Joshua for the disk space - I know it annoys you.) cheers andrew
Looking back, by far the largest change in the period Sep 1 - Sep 11 was the lazy xid calculation and read-only transactions. That seems like the most likely culprit. But given Tom's comments this commit stands out too: Log Message: ----------- Release the exclusive lock on the table early after truncating it in lazy vacuum, instead of waiting till commit. Modified Files: -------------- pgsql/src/backend/commands: vacuumlazy.c (r1.92 -> r1.93) (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/commands/vacuumlazy.c?r1=1.92&r2=1.93) ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > But given Tom's comments this commit stands out too: > From: "Alvaro Herrera" <alvherre@postgresql.org> > Log Message: > ----------- > Release the exclusive lock on the table early after truncating it in lazy > vacuum, instead of waiting till commit. I had thought about that one and not seen a problem with it --- but sometimes when the light goes on, it's just blinding :-(. This change is undoubtedly what's breaking it. The failures in question are coming from commands that try to insert new entries into various system tables. Now normally, the first place a backend will try to insert a brand-new tuple in a table is the rd_targblock block that is remembered in relcache as being where we last successfully inserted. The failures must be happening because autovacuum has just truncated away where rd_targblock points. There is a mechanism to reset everyone's rd_targblock after a truncation: it's done by broadcasting a shared-invalidation relcache inval message for that relation. Which happens at commit, before releasing locks, which is the correct time for the typical application of this mechanism, namely to make sure people see system-catalog updates on time. Releasing the exclusive lock early allows backends to try to access the relation again before they've heard about the truncation. There might be another way to manage this, but we're not inventing a new invalidation mechanism for 8.3. This patch will have to be reverted for the time being :-( regards, tom lane
Tom Lane wrote: > There might be another way to manage this, but we're not inventing > a new invalidation mechanism for 8.3. This patch will have to be > reverted for the time being :-( Thanks. Seems it was a good judgement call to apply it only to HEAD, after all. In any case, at that point we are mostly done with the expensive steps of vacuuming, so the transaction finishes not long after this. I don't think this issue is worth inventing a new invalidation mechanism. -- Alvaro Herrera http://www.amazon.com/gp/registry/5ZYLFMCVHXC "La victoria es para quien se atreve a estar solo"
Alvaro Herrera <alvherre@commandprompt.com> writes: > In any case, at that point we are mostly done with the expensive steps > of vacuuming, so the transaction finishes not long after this. I don't > think this issue is worth inventing a new invalidation mechanism. Yeah, I agree --- there are only a few catalog updates left to do after we truncate. If we held the main-table exclusive lock while vacuuming the TOAST table, we'd have a problem, but it looks to me like we don't. Idle thought here: did anything get done with the idea of decoupling main-table vacuum decisions from toast-table vacuum decisions? vacuum.c comments * Get a session-level lock too. This will protect our access to the * relation across multiple transactions, so thatwe can vacuum the * relation's TOAST table (if any) secure in the knowledge that no one is * deleting the parentrelation. and it suddenly occurs to me that we'd need some other way to deal with that scenario if autovac tries to vacuum toast tables independently. Also, did you see the thread complaining that autovacuums block CREATE INDEX? This seems true given the current locking definitions, and it's a bit annoying. Is it worth inventing a new table lock type just for vacuum? regards, tom lane
Tom Lane wrote: > Idle thought here: did anything get done with the idea of decoupling > main-table vacuum decisions from toast-table vacuum decisions? vacuum.c > comments > > * Get a session-level lock too. This will protect our access to the > * relation across multiple transactions, so that we can vacuum the > * relation's TOAST table (if any) secure in the knowledge that no one is > * deleting the parent relation. > > and it suddenly occurs to me that we'd need some other way to deal with > that scenario if autovac tries to vacuum toast tables independently. Hmm, right. We didn't change this in 8.3 but it looks like somebody will need to have a great idea before long. Of course, the easy answer is to grab a session-level lock for the main table while vacuuming the toast table, but it doesn't seem very friendly. > Also, did you see the thread complaining that autovacuums block CREATE > INDEX? This seems true given the current locking definitions, and it's > a bit annoying. Is it worth inventing a new table lock type just for > vacuum? Hmm. I think Jim is right in that what we need is to make some forms of ALTER TABLE take a lighter lock, one that doesn't conflict with analyze. Guillaume's complaint are about restore times, which can only be affected by analyze, not vacuum. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > Tom Lane wrote: > >> Idle thought here: did anything get done with the idea of decoupling >> main-table vacuum decisions from toast-table vacuum decisions? vacuum.c >> comments >> >> * Get a session-level lock too. This will protect our access to the >> * relation across multiple transactions, so that we can vacuum the >> * relation's TOAST table (if any) secure in the knowledge that no one is >> * deleting the parent relation. >> >> and it suddenly occurs to me that we'd need some other way to deal with >> that scenario if autovac tries to vacuum toast tables independently. > > Hmm, right. We didn't change this in 8.3 but it looks like somebody > will need to have a great idea before long. > > Of course, the easy answer is to grab a session-level lock for the main > table while vacuuming the toast table, but it doesn't seem very > friendly. Just a normal lock would do, no? At least for normal (non-full) vacuum. I'm not clear why this has to be dealt with at all though. What happens if we don't do anything? Doesn't it just mean a user trying to drop the table will block until the vacuum is done? Or does dropping not take a lock on the toast table? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com