Обсуждение: curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)

Поиск
Список
Период
Сортировка
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> ! ERROR:  could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes
>> 
>> Is that repeatable?  What sort of filesystem are you testing on?
>> (soft-mounted NFS by any chance?)

> doesn't seem to be repeatable :-(

Hmm ... 
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01

Exact same error --- is it at the same place in the tests where you saw it?

Now that I think about it, there have been similar transient failures
("read only 0 of 8192 bytes") in the buildfarm before.  It would be
helpful to collect a list of exactly which build reports contain
that string, but AFAIK there's no very easy way to do that; Andrew,
any suggestions?
        regards, tom lane


Re: curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)

От
Stefan Kaltenbrunner
Дата:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Tom Lane wrote:
>>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>>> ! ERROR:  could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes
>>> Is that repeatable?  What sort of filesystem are you testing on?
>>> (soft-mounted NFS by any chance?)
> 
>> doesn't seem to be repeatable :-(
> 
> Hmm ... 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01
> 
> Exact same error --- is it at the same place in the tests where you saw it?

looks like it is in a similiar place:

http://www.kaltenbrunner.cc/files/regression.diffs (I don't have more 
than this on that failure any more)


Stefan



Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>   
>> Tom Lane wrote:
>>     
>>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>>       
>>>> ! ERROR:  could not read block 2 of relation 1663/16384/2606: read only 0 of 8192 bytes
>>>>         
>>> Is that repeatable?  What sort of filesystem are you testing on?
>>> (soft-mounted NFS by any chance?)
>>>       
>
>   
>> doesn't seem to be repeatable :-(
>>     
>
> Hmm ... 
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01
>
> Exact same error --- is it at the same place in the tests where you saw it?
>
> Now that I think about it, there have been similar transient failures
> ("read only 0 of 8192 bytes") in the buildfarm before.  It would be
> helpful to collect a list of exactly which build reports contain
> that string, but AFAIK there's no very easy way to do that; Andrew,
> any suggestions?
>
>             
>   

pgbfprod=# select sysname, stage, snapshot from build_status where log ~ 
$$read only \d+ of \d+ bytes$$; sysname   |    stage     |      snapshot
-----------+--------------+---------------------zebra     | InstallCheck | 2007-09-11 10:25:03wildebeest | InstallCheck
|2007-09-11 22:00:11baiji      | InstallCheck | 2007-09-12 22:39:24luna_moth  | InstallCheck | 2007-09-19 13:10:01
 
(4 rows)


cheers

andrew



Re: curious regression failures (was Re: [PATCHES] PL/TCL Patch to prevent postgres from becoming multithreaded)

От
Stefan Kaltenbrunner
Дата:
Andrew Dunstan wrote:
> 
> 
> Tom Lane wrote:
>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>  
>>> Tom Lane wrote:
>>>    
>>>> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>>>      
>>>>> ! ERROR:  could not read block 2 of relation 1663/16384/2606: read
>>>>> only 0 of 8192 bytes
>>>>>         
>>>> Is that repeatable?  What sort of filesystem are you testing on?
>>>> (soft-mounted NFS by any chance?)
>>>>       
>>
>>  
>>> doesn't seem to be repeatable :-(
>>>     
>>
>> Hmm ...
>> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=luna_moth&dt=2007-09-19%2013:10:01
>>
>>
>> Exact same error --- is it at the same place in the tests where you
>> saw it?
>>
>> Now that I think about it, there have been similar transient failures
>> ("read only 0 of 8192 bytes") in the buildfarm before.  It would be
>> helpful to collect a list of exactly which build reports contain
>> that string, but AFAIK there's no very easy way to do that; Andrew,
>> any suggestions?
>>
>>            
>>   
> 
> pgbfprod=# select sysname, stage, snapshot from build_status where log ~
> $$read only \d+ of \d+ bytes$$;
>  sysname   |    stage     |      snapshot     
> -----------+--------------+---------------------
> zebra      | InstallCheck | 2007-09-11 10:25:03
> wildebeest | InstallCheck | 2007-09-11 22:00:11
> baiji      | InstallCheck | 2007-09-12 22:39:24
> luna_moth  | InstallCheck | 2007-09-19 13:10:01

hmm all of those seem to fail the foreign key checks in a very similiar
way and that are vastly different platforms (windows,solaris,openbsd and
linux).


Stefan


Andrew Dunstan <andrew@dunslane.net> writes:
> pgbfprod=# select sysname, stage, snapshot from build_status where log ~ 
> $$read only \d+ of \d+ bytes$$;
>   sysname   |    stage     |      snapshot      
>  -----------+--------------+---------------------
>  zebra      | InstallCheck | 2007-09-11 10:25:03
>  wildebeest | InstallCheck | 2007-09-11 22:00:11
>  baiji      | InstallCheck | 2007-09-12 22:39:24
>  luna_moth  | InstallCheck | 2007-09-19 13:10:01
> (4 rows)

Fascinating.  So I would venture that (1) it's definitely our bug,
not something we could blame on NFS or whatever, and (2) we introduced
it fairly recently.  That specific error message wording exists only
in HEAD, but it's been there since 2007-01-03, so if there were a
pre-existing problem you'd think there would be some more matches.

The patterns I notice here are (1) they're all InstallCheck not Check
failures; (2) though not all at the same place in the tests, it's
a fairly short range; (3) it's all references to system catalogs,
though not all the same one.

My gut feeling is that we're seeing autovacuum truncate off an empty end
block and then a backend tries to reference that block again.  But there
should be enough interlocks in place to prevent such references.  Any
ideas out there?
        regards, tom lane


Re: curious regression failures

От
Gregory Stark
Дата:
"Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> writes:

> Andrew Dunstan wrote:
>
>> pgbfprod=# select sysname, stage, snapshot from build_status where log ~
>> $$read only \d+ of \d+ bytes$$;
>>  sysname   |    stage     |      snapshot     
>> -----------+--------------+---------------------
>> zebra      | InstallCheck | 2007-09-11 10:25:03
>> wildebeest | InstallCheck | 2007-09-11 22:00:11
>> baiji      | InstallCheck | 2007-09-12 22:39:24
>> luna_moth  | InstallCheck | 2007-09-19 13:10:01
>
> hmm all of those seem to fail the foreign key checks in a very similiar
> way and that are vastly different platforms (windows,solaris,openbsd and
> linux).

Is this exhaustive? That is, are we sure this never happened before Sept 11th?


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: curious regression failures

От
Andrew Dunstan
Дата:

Gregory Stark wrote:
> "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> writes:
>
>   
>> Andrew Dunstan wrote:
>>
>>     
>>> pgbfprod=# select sysname, stage, snapshot from build_status where log ~
>>> $$read only \d+ of \d+ bytes$$;
>>>  sysname   |    stage     |      snapshot     
>>> -----------+--------------+---------------------
>>> zebra      | InstallCheck | 2007-09-11 10:25:03
>>> wildebeest | InstallCheck | 2007-09-11 22:00:11
>>> baiji      | InstallCheck | 2007-09-12 22:39:24
>>> luna_moth  | InstallCheck | 2007-09-19 13:10:01
>>>       
>> hmm all of those seem to fail the foreign key checks in a very similiar
>> way and that are vastly different platforms (windows,solaris,openbsd and
>> linux).
>>     
>
> Is this exhaustive? That is, are we sure this never happened before Sept 11th?
>
>   

Yes, we have never thrown away any buildfarm history, and we have build 
logs going back several years now. Being able to run queries like this 
makes it all worth while :-) (Thanks Joshua for the disk space - I know 
it annoys you.)

cheers

andrew


Re: curious regression failures

От
Gregory Stark
Дата:

Looking back, by far the largest change in the period Sep 1 - Sep 11 was the
lazy xid calculation and read-only transactions. That seems like the most
likely culprit.

But given Tom's comments this commit stands out too:


Log Message:
-----------
Release the exclusive lock on the table early after truncating it in lazy
vacuum, instead of waiting till commit.

Modified Files:
--------------
    pgsql/src/backend/commands:
        vacuumlazy.c (r1.92 -> r1.93)
        (http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/commands/vacuumlazy.c?r1=1.92&r2=1.93)

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to majordomo@postgresql.org so that your
       message can get through to the mailing list cleanly


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: curious regression failures

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> But given Tom's comments this commit stands out too:

> From: "Alvaro Herrera" <alvherre@postgresql.org>
> Log Message:
> -----------
> Release the exclusive lock on the table early after truncating it in lazy
> vacuum, instead of waiting till commit.

I had thought about that one and not seen a problem with it --- but
sometimes when the light goes on, it's just blinding :-(.  This change
is undoubtedly what's breaking it.  The failures in question are coming
from commands that try to insert new entries into various system tables.
Now normally, the first place a backend will try to insert a brand-new
tuple in a table is the rd_targblock block that is remembered in
relcache as being where we last successfully inserted.  The failures
must be happening because autovacuum has just truncated away where
rd_targblock points.  There is a mechanism to reset everyone's
rd_targblock after a truncation: it's done by broadcasting a
shared-invalidation relcache inval message for that relation.  Which
happens at commit, before releasing locks, which is the correct time for
the typical application of this mechanism, namely to make sure people
see system-catalog updates on time.  Releasing the exclusive lock early
allows backends to try to access the relation again before they've heard
about the truncation.

There might be another way to manage this, but we're not inventing
a new invalidation mechanism for 8.3.  This patch will have to be
reverted for the time being :-(
        regards, tom lane


Re: curious regression failures

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> There might be another way to manage this, but we're not inventing
> a new invalidation mechanism for 8.3.  This patch will have to be
> reverted for the time being :-(

Thanks.  Seems it was a good judgement call to apply it only to HEAD,
after all.

In any case, at that point we are mostly done with the expensive steps
of vacuuming, so the transaction finishes not long after this.  I don't
think this issue is worth inventing a new invalidation mechanism.

-- 
Alvaro Herrera                  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"La victoria es para quien se atreve a estar solo"


Re: curious regression failures

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> In any case, at that point we are mostly done with the expensive steps
> of vacuuming, so the transaction finishes not long after this.  I don't
> think this issue is worth inventing a new invalidation mechanism.

Yeah, I agree --- there are only a few catalog updates left to do after
we truncate.  If we held the main-table exclusive lock while vacuuming
the TOAST table, we'd have a problem, but it looks to me like we don't.

Idle thought here: did anything get done with the idea of decoupling
main-table vacuum decisions from toast-table vacuum decisions?  vacuum.c
comments
    * Get a session-level lock too. This will protect our access to the    * relation across multiple transactions, so
thatwe can vacuum the    * relation's TOAST table (if any) secure in the knowledge that no one is    * deleting the
parentrelation.
 

and it suddenly occurs to me that we'd need some other way to deal with
that scenario if autovac tries to vacuum toast tables independently.

Also, did you see the thread complaining that autovacuums block CREATE
INDEX?  This seems true given the current locking definitions, and it's
a bit annoying.  Is it worth inventing a new table lock type just for
vacuum?
        regards, tom lane


Re: curious regression failures

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> Idle thought here: did anything get done with the idea of decoupling
> main-table vacuum decisions from toast-table vacuum decisions?  vacuum.c
> comments
> 
>      * Get a session-level lock too. This will protect our access to the
>      * relation across multiple transactions, so that we can vacuum the
>      * relation's TOAST table (if any) secure in the knowledge that no one is
>      * deleting the parent relation.
> 
> and it suddenly occurs to me that we'd need some other way to deal with
> that scenario if autovac tries to vacuum toast tables independently.

Hmm, right.  We didn't change this in 8.3 but it looks like somebody
will need to have a great idea before long.

Of course, the easy answer is to grab a session-level lock for the main
table while vacuuming the toast table, but it doesn't seem very
friendly.

> Also, did you see the thread complaining that autovacuums block CREATE
> INDEX?  This seems true given the current locking definitions, and it's
> a bit annoying.  Is it worth inventing a new table lock type just for
> vacuum?

Hmm.  I think Jim is right in that what we need is to make some forms of
ALTER TABLE take a lighter lock, one that doesn't conflict with analyze.
Guillaume's complaint are about restore times, which can only be
affected by analyze, not vacuum.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: curious regression failures

От
Gregory Stark
Дата:
"Alvaro Herrera" <alvherre@commandprompt.com> writes:

> Tom Lane wrote:
>
>> Idle thought here: did anything get done with the idea of decoupling
>> main-table vacuum decisions from toast-table vacuum decisions?  vacuum.c
>> comments
>> 
>>      * Get a session-level lock too. This will protect our access to the
>>      * relation across multiple transactions, so that we can vacuum the
>>      * relation's TOAST table (if any) secure in the knowledge that no one is
>>      * deleting the parent relation.
>> 
>> and it suddenly occurs to me that we'd need some other way to deal with
>> that scenario if autovac tries to vacuum toast tables independently.
>
> Hmm, right.  We didn't change this in 8.3 but it looks like somebody
> will need to have a great idea before long.
>
> Of course, the easy answer is to grab a session-level lock for the main
> table while vacuuming the toast table, but it doesn't seem very
> friendly.

Just a normal lock would do, no? At least for normal (non-full) vacuum.

I'm not clear why this has to be dealt with at all though. What happens if we
don't do anything? Doesn't it just mean a user trying to drop the table will
block until the vacuum is done? Or does dropping not take a lock on the toast
table?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com