Обсуждение: Another nasty cache problem

Поиск
Список
Период
Сортировка

Another nasty cache problem

От
Tom Lane
Дата:
I'm down to the point where the parallel tests mostly work with a small
SI buffer --- but they do still sometimes fail.  I've realized that
there is a whole class of bugs along the following lines:

There are plenty of routines that do two or more SearchSysCacheTuple
calls to get the information they need.  As the code stands, it is
unsafe to continue accessing the tuple returned by SearchSysCacheTuple
after making a second such call, because the second call could possibly
cause an SI cache reset message to be processed, thereby flushing the
contents of the caches.

heap_open and CommandCounterIncrement are other routines that could
cause cache entries to be dropped.

This is a very insidious kind of bug because the probability of
occurrence is very low (at normal SI buffer size a reset is unlikely,
and even if it happens, you won't observe a failure unless the
pfree'd tuple is actually overwritten before you're done with it).
So we cannot hope to catch these things by testing.

I am not sure what to do about it.  One solution path is to make
all the potential trouble spots do SearchSysCacheTupleCopy and then
pfree the copied tuple when done.  However, that adds a nontrivial
amount of overhead, and it'd be awfully easy to miss some trouble
spots or to introduce new ones in the future.

Another possibility is to introduce some sort of notion of a reference
count, and to make the standard usage pattern betuple = SearchSysCacheTuple(...);... use tuple
...ReleaseSysCacheTuple(tuple);
The idea here is that a tuple with positive refcount would not be
deleted during a cache reset, but would simply be removed from its
cache, and then finally deleted when released (or during elog
recovery).

This might allow us to get rid of SearchSysCacheTupleCopy, too,
since the refcount should be just as good as palloc'ing one's own
copy for most purposes.

I haven't looked at the callers of SearchSysCacheTuple to see whether
this would be a practical change to make.  I was wondering if anyone
had any comments or better ideas...
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Bruce Momjian
Дата:
> I'm down to the point where the parallel tests mostly work with a small
> SI buffer --- but they do still sometimes fail.  I've realized that
> there is a whole class of bugs along the following lines:
> 
> There are plenty of routines that do two or more SearchSysCacheTuple
> calls to get the information they need.  As the code stands, it is
> unsafe to continue accessing the tuple returned by SearchSysCacheTuple
> after making a second such call, because the second call could possibly
> cause an SI cache reset message to be processed, thereby flushing the
> contents of the caches.

Yes, I have always been aware of this problem.  The issue is that since
cache entries are removed on a oldest-removed-first basis, I never
thought that several cache lookups would be a problem.  If you do many
cache lookups and expect very old ones to still exist, that could be a
problem.

However, a full reset of the cache could cause major problems.  Is there
a way to re-load the cache after the reset with the most recently cached
entries?  Seems that would be easier.  However, your issue is probably
that the new cache entries would have different locations from the old
entries.  Is it possible to delay the cache reset of the five most
recent cache entries, and do them later?  I don't see many places where
more than 2-3 cache entries are kept.  Maybe we need to keep them around
somehow during cache reset.

> I am not sure what to do about it.  One solution path is to make
> all the potential trouble spots do SearchSysCacheTupleCopy and then
> pfree the copied tuple when done.  However, that adds a nontrivial
> amount of overhead, and it'd be awfully easy to miss some trouble
> spots or to introduce new ones in the future.

Sounds like a lot of overhead to do the copy.

> 
> Another possibility is to introduce some sort of notion of a reference
> count, and to make the standard usage pattern be
>     tuple = SearchSysCacheTuple(...);
>     ... use tuple ...
>     ReleaseSysCacheTuple(tuple);
> The idea here is that a tuple with positive refcount would not be
> deleted during a cache reset, but would simply be removed from its
> cache, and then finally deleted when released (or during elog
> recovery).

If you can do that, can't you just keep the few most recent ones by
default.  Seems that would be very clean.

> This might allow us to get rid of SearchSysCacheTupleCopy, too,
> since the refcount should be just as good as palloc'ing one's own
> copy for most purposes.

Yes, that would be nice.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Yes, I have always been aware of this problem.  The issue is that since
> cache entries are removed on a oldest-removed-first basis, I never
> thought that several cache lookups would be a problem.

They're not, under normal circumstances...

> However, a full reset of the cache could cause major problems.  Is there
> a way to re-load the cache after the reset with the most recently cached
> entries?  Seems that would be easier.  However, your issue is probably
> that the new cache entries would have different locations from the old
> entries.  Is it possible to delay the cache reset of the five most
> recent cache entries, and do them later?

I don't think that's a good answer; what if one of those entries is the
one that the SI messages wanted us to update?  With a scheme like that,
you might be protecting a cache entry that actually isn't being used
anymore.  With a refcount you'd at least know whether it was safe to
throw the entry away.

Of course this just begs the question of what to do when an SI update
message arrives for a tuple that is locked down by refcount.  Maybe
we have to kick out an elog(ERROR) then.  Could be messy.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Yes, I have always been aware of this problem.  The issue is that since
> > cache entries are removed on a oldest-removed-first basis, I never
> > thought that several cache lookups would be a problem.
> 
> They're not, under normal circumstances...
> 
> > However, a full reset of the cache could cause major problems.  Is there
> > a way to re-load the cache after the reset with the most recently cached
> > entries?  Seems that would be easier.  However, your issue is probably
> > that the new cache entries would have different locations from the old
> > entries.  Is it possible to delay the cache reset of the five most
> > recent cache entries, and do them later?
> 
> I don't think that's a good answer; what if one of those entries is the
> one that the SI messages wanted us to update?  With a scheme like that,
> you might be protecting a cache entry that actually isn't being used
> anymore.  With a refcount you'd at least know whether it was safe to
> throw the entry away.
> 
> Of course this just begs the question of what to do when an SI update
> message arrives for a tuple that is locked down by refcount.  Maybe
> we have to kick out an elog(ERROR) then.  Could be messy.

Yep, that was my question.  You can re-load it, but if it is the one
that just got invalidated, what do you reload?  My guess is that you
keep using the same cache entry until the current transaction finishes,
at which point you can throw it away.

Now, if we did proper locking, no SI message could arrive for such an
entry.

My assumption is that these are mostly system cache entries, and they
rarely change, right?  If someone is operating on a table that gets an
SI entry, odds are that later on the system will fail because the table
is changed in some way, right?

Actually, don't we have a transaction id for the transaction that loaded
that cache entry.  We can add a transaction id to the cache record that
shows the transaction that last accessed that cache entry.  Then we can
say if any SI message comes in for a cache entry that was accessed by
the current transaction, we throw an elog.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Now, if we did proper locking, no SI message could arrive for such an
> entry.

> My assumption is that these are mostly system cache entries, and they
> rarely change, right?  If someone is operating on a table that gets an
> SI entry, odds are that later on the system will fail because the table
> is changed in some way, right?

If the tuple is actually *changed* then that's true (and locking should
have prevented it anyway).  But we also issue cache flushes against
whole system tables in order to handle VACUUM of a system table.  There,
the only thing that's actually been modified is the tuple's physical
location (ctid).  We don't want to blow away transactions that are just
looking at cache entries when a VACUUM happens.

Perhaps the caches shouldn't store ctid?  Not sure.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Bruce Momjian
Дата:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Now, if we did proper locking, no SI message could arrive for such an
> > entry.
> 
> > My assumption is that these are mostly system cache entries, and they
> > rarely change, right?  If someone is operating on a table that gets an
> > SI entry, odds are that later on the system will fail because the table
> > is changed in some way, right?
> 
> If the tuple is actually *changed* then that's true (and locking should
> have prevented it anyway).  But we also issue cache flushes against
> whole system tables in order to handle VACUUM of a system table.  There,
> the only thing that's actually been modified is the tuple's physical
> location (ctid).  We don't want to blow away transactions that are just
> looking at cache entries when a VACUUM happens.
> 
> Perhaps the caches shouldn't store ctid?  Not sure.

I am guilt of that.  There are a few place where I grab the tuple from
the cache, then use that to update the heap.  I thought it was a nifty
solution at the time.  I thought I used the CacheCopy calls for that,
but I am not positive.  Even if I did, that doesn't help because the
copy probably has an invalid tid at that point, thought I have opened
the table. Maybe I have to make sure I open the table before geting the
tid from the cache.

Is it only the tid that is of concern.  If so, that can probably be
fixed somehow.


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Perhaps the caches shouldn't store ctid?  Not sure.

> I am guilt of that.  There are a few place where I grab the tuple from
> the cache, then use that to update the heap.  I thought it was a nifty
> solution at the time.  I thought I used the CacheCopy calls for that,
> but I am not positive.  Even if I did, that doesn't help because the
> copy probably has an invalid tid at that point, thought I have opened
> the table. Maybe I have to make sure I open the table before geting the
> tid from the cache.

I believe we worked that out and fixed it a few months ago: it's safe
to use the cache to find a tuple you want to update, if you open and
lock the containing table *before* doing the cache lookup.  Then you
know VACUUM's not running on that table (since you have it locked)
and you have an up-to-date TID for the tuple (since the open+lock
would have processed any pending shared-inval messages).  I went
around and made sure that's true everywhere.

What I was thinking about was adding code to the caches that would
(a) maintain refcounts on cached tuples, (b) reread rather than
discard a tuple if it is invalidated while refcount > 0, and (c)
kick out an error if the reread shows that the tuple has in fact
changed.  It seems that we would need to ignore the TID when deciding
if a tuple has changed, however.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Bruce Momjian
Дата:
> I believe we worked that out and fixed it a few months ago: it's safe
> to use the cache to find a tuple you want to update, if you open and
> lock the containing table *before* doing the cache lookup.  Then you
> know VACUUM's not running on that table (since you have it locked)
> and you have an up-to-date TID for the tuple (since the open+lock
> would have processed any pending shared-inval messages).  I went
> around and made sure that's true everywhere.

Good.

> What I was thinking about was adding code to the caches that would
> (a) maintain refcounts on cached tuples, (b) reread rather than
> discard a tuple if it is invalidated while refcount > 0, and (c)
> kick out an error if the reread shows that the tuple has in fact
> changed.  It seems that we would need to ignore the TID when deciding
> if a tuple has changed, however.

Yes, that is one solution.  We can do it the same way heap_fetch works.
It requires a Buffer pointer which it uses to return a value that calls
ReleaseBuffer() when completed.

However, would just throwing an elog on any cache invalidate on a cache
row looked up in the current transaction/command counter make more
sense?  Sometimes you are using that cache oid in some later actions
that really can't be proper unlocked at the end?  Would be less code.


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Another nasty cache problem

От
Peter Eisentraut
Дата:
On Sun, 30 Jan 2000, Tom Lane wrote:

> There are plenty of routines that do two or more SearchSysCacheTuple
> calls to get the information they need.  As the code stands, it is
> unsafe to continue accessing the tuple returned by SearchSysCacheTuple
> after making a second such call, because the second call could possibly
> cause an SI cache reset message to be processed, thereby flushing the
> contents of the caches.
> 
> heap_open and CommandCounterIncrement are other routines that could
> cause cache entries to be dropped.

This sort of thing should be documented, at least in the comment on top of
the function. From the developer's FAQ I gathered something like that
these tuples can be used for a short while, which is of course very exact.

Anyway, I just counted 254 uses of SearchSysCacheTuple in the backend tree
and a majority of these are probably obviously innocent. Since I don't
have any more developing planned, I would volunteer to take a look at all
of those and look for violations of second cache look up, heap_open, and
CommandCounterIncrement, fixing them where possible, or at least pointing
them out to more experienced people. That might save you from going out of
your way and instituting some reference count or whatever, and it would be
an opportunity for me to read some code.


-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] Another nasty cache problem

От
Peter Eisentraut
Дата:
On Sun, 30 Jan 2000, Bruce Momjian wrote:

> > Perhaps the caches shouldn't store ctid?  Not sure.
> 
> I am guilt of that.  There are a few place where I grab the tuple from
> the cache, then use that to update the heap.  I thought it was a nifty
> solution at the time.  I thought I used the CacheCopy calls for that,
> but I am not positive.  Even if I did, that doesn't help because the
> copy probably has an invalid tid at that point, thought I have opened
> the table. Maybe I have to make sure I open the table before geting the
> tid from the cache.

Urgh, I better check my code for that as well ... :(

> 
> Is it only the tid that is of concern.  If so, that can probably be
> fixed somehow.
> 
> 
> 

-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Peter Eisentraut <e99re41@DoCS.UU.SE> writes:
> This sort of thing should be documented,

... or changed ...

> Anyway, I just counted 254 uses of SearchSysCacheTuple in the backend tree
> and a majority of these are probably obviously innocent. Since I don't
> have any more developing planned, I would volunteer to take a look at all
> of those and look for violations of second cache look up, heap_open, and
> CommandCounterIncrement, fixing them where possible, or at least pointing
> them out to more experienced people. That might save you from going out of
> your way and instituting some reference count or whatever, and it would be
> an opportunity for me to read some code.

I appreciate the offer, but I don't really want to fix it that way.
If that's how things have to work, then the code will be *extremely*
fragile --- any routine that opens a relation or looks up a cache tuple
will potentially break its callers as well as itself.  And since the
probability of failure is so low, we'll never find it; we'll just keep
getting the occasional irreproducible failure report from the field.
I think we need a designed-in solution rather than a restrictive coding
rule.

Also, I am not sure that the existing uses are readily fixable.  For
example, I saw a number of crashes in the parser last night, most of
which traced to uses of Operator or Type pointers --- which are really
SearchSysCacheTuple results, but the parser passes them around with wild
abandon.  I don't see any easy way of restructuring that code to avoid
this.

I am starting to think that Bruce's idea might be the way to go: lock
down any cache entry that's been referenced since the last transaction
start or CommandCounterIncrement, and elog() if it's changed by
invalidation.  Then the only coding rule needed is "cached tuples don't
stay valid across CommandCounterIncrement", which is relatively
simple to check for.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
Tom Lane wrote:
> I'm down to the point where the parallel tests mostly work with a small
> SI buffer --- but they do still sometimes fail.

Have you committed your changes? I tried the parallel tests with cvs of around
5pm GMT 31 Jan, and they were all fine (I just ran out of procs at one point).
This is much better than last week! Thanks! I also tried that nonsensical
join from the other day, and it failed in the same way again:

newnham=# select * from crsids,"tblPerson" where
newnham-# crsids.crsid != "tblPerson"."CRSID";
Backend sent B message without prior T
D21Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

After \. :

Unknown protocol character 'M' read from backend.  (The protocol character is the first character the backend sends in
responseto a query it receives).
 
PQendcopy: resetting connection
Asynchronous NOTIFY 'ndropoulou' from backend with pid '1818589281' received.
Asynchronous NOTIFY 'ndropoulou' from backend with pid '1818589281' received.


pq_flush: send() failed: Broken pipe
FATAL: pq_endmessage failed: errno=32

but no NOTICEs about SI anywhere any more, in fact no messages at all until
the "Unknown protocol character" bit above. The psql frontend process grows to
about 120Mb in size before this if that matters (200Mb swap free).

(Looking at why pg_dumpall creates unique indices for each different type
of index at the moment...)

Cheers,

Patrick


Re: [HACKERS] Another nasty cache problem

От
Bruce Momjian
Дата:
> I am starting to think that Bruce's idea might be the way to go: lock
> down any cache entry that's been referenced since the last transaction
> start or CommandCounterIncrement, and elog() if it's changed by
> invalidation.  Then the only coding rule needed is "cached tuples don't
> stay valid across CommandCounterIncrement", which is relatively
> simple to check for.

Yea, I had a good idea ...

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> Tom Lane wrote:
>> I'm down to the point where the parallel tests mostly work with a small
>> SI buffer --- but they do still sometimes fail.

> Have you committed your changes? I tried the parallel tests with cvs
> of around 5pm GMT 31 Jan, and they were all fine (I just ran out of
> procs at one point).  This is much better than last week! Thanks!

Yes, I committed what I had last night (about 04:35 GMT 1/31).

There are cache-flush-related bugs still left to deal with, but they
seem to be far lower in probability than the ones squashed so far.
I'm finding that even with MAXNUMMESSAGES set to 8, the parallel tests
usually pass; so it seems we need some other way of testing to nail down
the remaining problems.

> I also tried that nonsensical join from the other day, and it failed in
> the same way again:
> newnham=# select * from crsids,"tblPerson" where
> newnham-# crsids.crsid != "tblPerson"."CRSID";
> Backend sent B message without prior T

Hmm.  Can you provide a self-contained test case (a script to build the
failing tables, preferably)?
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
On Mon, Jan 31, 2000 at 09:02:30PM -0500, Tom Lane wrote:
> Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> > Tom Lane wrote:
> 
> There are cache-flush-related bugs still left to deal with, but they
> seem to be far lower in probability than the ones squashed so far.
> I'm finding that even with MAXNUMMESSAGES set to 8, the parallel tests
> usually pass; so it seems we need some other way of testing to nail down
> the remaining problems.
> 
> > I also tried that nonsensical join from the other day, and it failed in
> > the same way again:
> > newnham=# select * from crsids,"tblPerson" where
> > newnham-# crsids.crsid != "tblPerson"."CRSID";
> > Backend sent B message without prior T
> 
> Hmm.  Can you provide a self-contained test case (a script to build the
> failing tables, preferably)?

It seems this is a memory exhaustion thing: I have 128Mb real memory.
Attached below is the C program used to create some random data in
tables test and test2 of database test (which needs to exist). Executing
the non-sensical query
 select * from test,test2 where test.i!=test2.i;

should result in 2600*599=1557400 (ie lots of) rows to be returned.
The process's memory consumption during this select grows to 128Mb, and after
a moment or two:

Backend sent D message without prior T
Backend sent D message without prior T
...

Which isn't quite the same message as before, but is of the same type.
59 processes:  2 running, 57 sleeping
CPU states:  2.3% user, 86.4% nice,  9.3% system,  0.0% interrupt,  1.9% idle
Memory: 74M Act, 37M Inact, 184K Wired, 364K Free, 95M Swap, 262M Swap free
 PID USERNAME PRI NICE   SIZE   RES STATE   TIME   WCPU    CPU COMMAND1547 prlw1     50    0   128M  516K run     1:28
59.28%59.28% psql1552 postgres  50    0  1920K  632K run     1:37 24.32% 24.32% postgres
 

later, while the "Backend sent..." messages appear
1547 prlw1     -5    0   128M   68M sleep   1:41 23.00% 23.00% psql1552 postgres   2    0  1920K    4K sleep   1:41
141.00% 6.88% <postgres>
 

Note that there is still plenty of swap space. The 128Mb number seems to be
more than a coincidence (how to prove?)

So, is this only happening to me? How can lack of real memory affect timing
of interprocess communication?

Cheers,

Patrick

==========================================================================

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>

#include "libpq-fe.h"

const char *progname;

PGresult *send_query(PGconn *db, const char *query)
{ PGresult *res;
 res=PQexec(db,query); switch(PQresultStatus(res))   {     case PGRES_EMPTY_QUERY:       printf("PGRES_EMPTY_QUERY:
%s\n",query);      break;     case PGRES_COMMAND_OK:       printf("PGRES_COMMAND_OK: %s\n",query);       break;
casePGRES_TUPLES_OK:       printf("PGRES_TUPLES_OK: %s\n",query);       break;     case PGRES_COPY_OUT:
printf("PGRES_COPY_OUT:%s\n",query);       break;     case PGRES_COPY_IN:       printf("PGRES_COPY_IN: %s\n",query);
  break;     case PGRES_BAD_RESPONSE:       printf("PGRES_BAD_RESPONSE: %s\n",query);       exit(1);       break;
casePGRES_NONFATAL_ERROR:       printf("PGRES_NONFATAL_ERROR: %s\n",query);       break;     case PGRES_FATAL_ERROR:
  printf("PGRES_FATAL_ERROR: %s\n",query);       exit(1);       break;     default:       fprintf(stderr,"Error from
%s:Unknown response from "\         "PQresultStatus()\n",progname);       exit(1);       break;   }
 
 return res;
}

char get_letter(void)
{ int c;
 do c=(int)random()%128; while(!(isascii(c)&&isalpha(c)));
 return (char)tolower(c);
}

unsigned int get_num(void)
{ return random()%100;
}

int main(int argc, char* argv[])
{ char id[7],query[2048]; int i; PGconn *db; PGresult *res;
 progname=argv[0];
 srandom(42); /* same data each time hopefully */
 db=PQconnectdb("dbname=test"); if(PQstatus(db)==CONNECTION_BAD)   {     fprintf(stderr,"Error from %s: Unable to
connectto database \"test\".\n",       progname);     exit(1);   }
 
 res=send_query(db,"create table test (txt text,var varchar(7),i integer)"); PQclear(res); res=send_query(db,"create
tabletest2(txt text,var varchar(7),i integer)"); PQclear(res);
 
 for(i=1;i<=2600;++i)   {     sprintf(id,"%c%c%c%c%03u",get_letter(),get_letter(),get_letter(),
get_letter(),get_num());
     sprintf(query,"insert into test values ('%s','%s','%i')",id,id,i);     res=send_query(db,query);     PQclear(res);
 }
 
 for(i=1;i<=600;++i)   {     sprintf(id,"%c%c%c%c%03u",get_letter(),get_letter(),get_letter(),
get_letter(),get_num());
     sprintf(query,"insert into test2 values ('%s','%s','%i')",id,id,i);     res=send_query(db,query);
PQclear(res);  }
 
 PQfinish(db);
 return 0;
}


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
On Thu, Feb 03, 2000 at 11:24:34AM +0000, Patrick Welche wrote:
>   char id[7],query[2048];           ^ should be 8 to be safe..


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> Note that there is still plenty of swap space. The 128Mb number seems to be
> more than a coincidence (how to prove?)

The majority of Unix systems have a process size limit kernel parameter,
which is normally set to less than the amount of available swap space
(you don't want a single process running wild to chew up all your swap
and make other stuff start failing for lack of swap...)  Check your
kernel parameters.

It sounds to me like the backend has hit the size limit and is not
reacting gracefully to failure of malloc() to allocate more space.
It ought to exit with an elog(FATAL), probably.  Sigh, time to take
another pass through the code to cast a suspicious eye on everyplace
that calls malloc() directly.

There's a separate question about *why* such a simple query is chewing
up so much memory.  What query plan does EXPLAIN show for your test
query?

You said this was with current sources, right?
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
"Oliver Elphick"
Дата:
Tom Lane wrote: >There's a separate question about *why* such a simple query is chewing >up so much memory.  What query
plandoes EXPLAIN show for your test >query? 
 
I can show a similar problem.
 >You said this was with current sources, right?
This is with current sources: I managed to kill the backend before
it had used up all swap.  If left to run on 6.5.3 or CVS as of 2
weeks back it would kill the whole machine; I haven't let it get that
far today.

bray=# explain select * from pg_operator as a, pg_operator as b;
NOTICE:  QUERY PLAN:

Nested Loop  (cost=12604.88 rows=258064 width=162) ->  Seq Scan on pg_operator b  (cost=24.76 rows=508 width=81) ->
SeqScan on pg_operator a  (cost=24.76 rows=508 width=81)
 

EXPLAIN


-- 
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver              PGP key from public servers; key
ID32B8FAA1                ========================================    "O come, let us worship and bow down; let us
kneel     before the LORD our maker."            Psalms 95:6 
 




Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
On Thu, Feb 03, 2000 at 12:00:21PM -0500, Tom Lane wrote:
>
> The majority of Unix systems have a process size limit kernel parameter,
> which is normally set to less than the amount of available swap space
> (you don't want a single process running wild to chew up all your swap
> and make other stuff start failing for lack of swap...)  Check your
> kernel parameters.

Probably to do with the shell limit:

memoryuse       125460 kbytes
> There's a separate question about *why* such a simple query is chewing
> up so much memory.  What query plan does EXPLAIN show for your test
> query?

test=# explain select * from test,test2 where test.i!=test2.i;
NOTICE:  QUERY PLAN:

Nested Loop  (cost=64104.80 rows=1559400 width=56) ->  Seq Scan on test2  (cost=24.80 rows=600 width=28) ->  Seq Scan
ontest  (cost=106.80 rows=2600 width=28)
 

EXPLAIN

> You said this was with current sources, right?

They're about 2 days old now. (Well, after your SI buffer overrun fixes)

Cheers,

Patrick


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
"Oliver Elphick" <olly@lfix.co.uk> writes:
> Tom Lane wrote:
>> There's a separate question about *why* such a simple query is chewing
>> up so much memory.  What query plan does EXPLAIN show for your test
>> query? 
> I can show a similar problem.

> bray=# explain select * from pg_operator as a, pg_operator as b;
> NOTICE:  QUERY PLAN:

> Nested Loop  (cost=12604.88 rows=258064 width=162)
>  -> Seq Scan on pg_operator b  (cost=24.76 rows=508 width=81)
>  -> Seq Scan on pg_operator a  (cost=24.76 rows=508 width=81)

OK, I sussed this one --- there's a (longstanding) memory leak in
catcache.c.  When entering a system-table tuple into the cache,
it forgot to free the copy of the tuple that had been created in
transaction-local memory context.  Cause enough cache entries to
be created within one transaction, and you'd start to notice the
leak.  The above query exhibits the problem because it produces
about 250K tuples each with six regproc columns, and each regprocout
call does a cache lookup to convert regproc OID to procedure name.
Since you're cycling through 500-plus different procedure names,
and the cache only keeps ~ 300 entries, there's going to be a
fresh cache entry made every time :-(

With the fix I just committed, current sources execute the above query
in constant backend memory space.  psql's space usage still goes to the
moon, of course, since it's trying to buffer the whole query result :-(
... but there's no way around that short of a major redesign of libpq's
API.  When and if we switch over to CORBA, we really need to rethink
the client access API so that buffering the query result in the client-
side library is an option not a requirement.

I do not think this is the same problem that Patrick Welche is
complaining of, unfortunately.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Chris Bitmead
Дата:
Tom Lane wrote:

> With the fix I just committed, current sources execute the above query
> in constant backend memory space.  psql's space usage still goes to the
> moon, of course, since it's trying to buffer the whole query result :-(
> ... but there's no way around that short of a major redesign of libpq's
> API.  When and if we switch over to CORBA, we really need to rethink
> the client access API so that buffering the query result in the client-
> side library is an option not a requirement.

What about portals? Doesn't psql use portals?


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
>   PID USERNAME PRI NICE   SIZE   RES STATE   TIME   WCPU    CPU COMMAND
>  1547 prlw1     50    0   128M  516K run     1:28 59.28% 59.28% psql
>  1552 postgres  50    0  1920K  632K run     1:37 24.32% 24.32% postgres

Sigh, I shoulda read this closely enough to notice that you were
complaining of psql memory overrun, not backend memory overrun :-(

The major problem here is that libpq's API is designed on the assumption
that libpq will buffer the whole query result in application memory
before letting the app see any of it.  I see no way around that without
a fundamental redesign of the API.  Which will happen someday, but not
today.

The minor problem is that libpq doesn't react very gracefully to running
out of memory.  It detects it OK, but then aborts query processing,
which means it gets out of step with the backend.  It needs to be fixed
so that it continues to absorb tuples (but drops them on the floor)
until the backend is done.  I've known of this problem for some time,
but have had many higher-priority problems to worry about.  Perhaps
someone else would like to take it on...
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes:
> What about portals? Doesn't psql use portals?

No ... portals are a backend concept ...
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Chris Bitmead
Дата:
Tom Lane wrote:
> 
> Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes:
> > What about portals? Doesn't psql use portals?
> 
> No ... portals are a backend concept ...

Since when?

According to the old doco you do...

select portal XX * from table_name where ...;

fetch 20 into XX.

If the PQexec() is called with "fetch 20" at a time
wouldn't this mean that you wouldn't exhaust front-end
memory with a big query?


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes:
>> No ... portals are a backend concept ...

> Since when?

> According to the old doco you do...

> select portal XX * from table_name where ...;

> fetch 20 into XX.

That still works if you spell it in the SQL-approved way,
DECLARE CURSOR followed by FETCH.

> If the PQexec() is called with "fetch 20" at a time
> wouldn't this mean that you wouldn't exhaust front-end
> memory with a big query?

Sure, and that's how you work around the problem.  Nonetheless
this requires the user to structure his queries to avoid sucking
up a lot of data in a single query.  If the user doesn't have any
particular reason to need random access into a query result, it'd
be nicer to be able to read the result in a streaming fashion
without buffering it anywhere *or* making arbitrary divisions in it.

In any case, psql doesn't (and IMHO shouldn't) convert a SELECT
into a series of FETCHes for you.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
It seems that I am still tracking problems, but each time they turn out to
have a different cause: A slight variant on the select that caused memory
to run out gives


newnham=# select crsids.surname, "tblPerson"."Surname" from crsids,"tblPerson" where
crsids.usn="tblPerson"."USN"::int4;
pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 
The connection to the server was lost. Attempting reset: Failed.

Nested Loop  (cost=66496.62 rows=34359 width=40) ->  Seq Scan on tblPerson  (cost=157.62 rows=2625 width=24) ->  Seq
Scanon crsids  (cost=25.27 rows=584 width=16)
 

this is the table I based the memory hog on (2600*600). The backend closes
instantly ie., no memory usage! And, as before, it is hard to find a test case
that will do the same as repeatably (ie., test case never crashes, the
above case crashes every single time). "tblPerson", as its strange
capitalisation suggests, was imported from M$ access via ODBC.

select test.txt,test2.var from test,test2 where test2.i=test.var::int4;

Nested Loop  (cost=63504.80 rows=2600 width=40) ->  Seq Scan on test2  (cost=24.80 rows=600 width=16) ->  Seq Scan on
test (cost=105.80 rows=2600 width=24)
 

works fine.

Any thoughts on where to look?

Cheers,

Patrick


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> newnham=# select crsids.surname, "tblPerson"."Surname" from crsids,"tblPerson" where
crsids.usn="tblPerson"."USN"::int4;
> pqReadData() -- backend closed the channel unexpectedly.
>         This probably means the backend terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.

> Any thoughts on where to look?

Is there anything in the postmaster log?  Is there a core file (look
in the database subdirectory, ie .../data/base/yourdatabase/core)?
If so, compiling the backend with -g and extracting a backtrace from
the resulting corefile with gdb would be very useful info.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
On Fri, Feb 04, 2000 at 03:58:57PM -0500, Tom Lane wrote:
> 
> > Any thoughts on where to look?
> 
> Is there anything in the postmaster log?

DEBUG:  Data Base System is in production state at Fri Feb  4 17:11:05 2000
Server process (pid 3588) exited with status 11 at Fri Feb  4 17:14:57 2000
Terminating any active server processes...
Server processes were terminated at Fri Feb  4 17:14:57 2000
Reinitializing shared memory and semaphores

>  Is there a core file (look
> in the database subdirectory, ie .../data/base/yourdatabase/core)?

But no core file ... so who knows what the sigsegv comes from. (don't worry
coredumpsize    unlimited)


> If so, compiling the backend with -g and extracting a backtrace from
> the resulting corefile with gdb would be very useful info.

(already have the -g..)

Still looking...

Cheers,

Patrick


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
>> Is there anything in the postmaster log?

> DEBUG:  Data Base System is in production state at Fri Feb  4 17:11:05 2000
> Server process (pid 3588) exited with status 11 at Fri Feb  4 17:14:57 2000

> But no core file ... so who knows what the sigsegv comes from. (don't worry
> coredumpsize    unlimited)

There sure oughta be a corefile after a SIGSEGV.  Hmm.  How are you
starting the postmaster --- is it from a system startup script?
It might work better to start it from an ordinary user process.
I discovered the other day on a Linux box that the system just plain
would not dump a core file from a process started by root, even though
the process definitely had nonzero "ulimit -c" and had set its euid
to a nonprivileged userid.  But start the same process by hand from an
unprivileged login, and it would dump a core file.  Weird.  Dunno if
your platform behaves the same way, but it's worth trying.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
The killer query was:

select crsids.surname,"tblPerson"."Surname" from crsids,"tblPerson" where crsids.usn="tblPerson"."USN"::int4;

and the reason for the SIGSEGV, is that somehow, text_int4(text *string) in
src/backend/utils/adt/int.c is called with string=(text *)0x0, so obviously
this is a problem!

crsids.usn is integer, "tblPerson"."USN" is varchar(9).

Oddly enough, text_int4 is called from fmgr.c:136 which is in the case
statement for n_arguments=2, yet that should be 1

(gdb) print {FmgrInfo}0x8221a30
$4 = {fn_addr = 0x80f9dbc <text_int4>, fn_plhandler = 0, fn_oid = 1620,  fn_nargs = 1}

unless gdb is reporting the wrong line number. values->data[0]=0=string.

I have a backtrace and a pretty printed copy of the query tree if useful...

Still trying to make a small test case...

Any suggestions appreciated!

Cheers,

Patrick

(source of 31st Jan)

On Sat, Feb 05, 2000 at 12:18:29PM -0500, Tom Lane wrote:
> Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> >> Is there anything in the postmaster log?
> 
> > DEBUG:  Data Base System is in production state at Fri Feb  4 17:11:05 2000
> > Server process (pid 3588) exited with status 11 at Fri Feb  4 17:14:57 2000
> 
> > But no core file ... so who knows what the sigsegv comes from. (don't worry
> > coredumpsize    unlimited)
> 
> There sure oughta be a corefile after a SIGSEGV.  Hmm.  How are you
> starting the postmaster --- is it from a system startup script?
> It might work better to start it from an ordinary user process.
> I discovered the other day on a Linux box that the system just plain
> would not dump a core file from a process started by root, even though
> the process definitely had nonzero "ulimit -c" and had set its euid
> to a nonprivileged userid.  But start the same process by hand from an
> unprivileged login, and it would dump a core file.  Weird.  Dunno if
> your platform behaves the same way, but it's worth trying.
> 
>             regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> and the reason for the SIGSEGV, is that somehow, text_int4(text *string) in
> src/backend/utils/adt/int.c is called with string=(text *)0x0, so obviously
> this is a problem!

Um.  Probably you have a NULL value in "tblPerson"."USN" somewhere?

There are a lot of functions without adequate defenses against NULL
inputs :-( --- we've been cleaning them up slowly, but evidently you
found another one.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
On Fri, Feb 11, 2000 at 06:18:32PM -0500, Tom Lane wrote:
> Patrick Welche <prlw1@newn.cam.ac.uk> writes:
> > and the reason for the SIGSEGV, is that somehow, text_int4(text *string) in
> > src/backend/utils/adt/int.c is called with string=(text *)0x0, so obviously
> > this is a problem!
> 
> Um.  Probably you have a NULL value in "tblPerson"."USN" somewhere?

Yes of course! Naturally I was looking for something far too complicated and
the trees got in the way.. And that's why my test case didn't work.

> There are a lot of functions without adequate defenses against NULL
> inputs :-( --- we've been cleaning them up slowly, but evidently you
> found another one.

So the trouble is, if the function returns and int, and you want to say
return null, there really isn't a value that can be stuck into the int
that represents null?

In the meantime, I think this might help, so I would have seen:

newnham=# select crsids.surname,"tblPerson"."Surname" from crsids,"tblPerson" where
crsids.usn="tblPerson"."USN"::int4;
ERROR:  Trying to convert NULL text to integer (int4)

Cheers,

Patrick



Index: int.c
===================================================================
RCS file: /usr/local/cvsroot/pgsql/src/backend/utils/adt/int.c,v
retrieving revision 1.32
diff -c -r1.32 int.c
*** int.c       2000/01/26 05:57:14     1.32
--- int.c       2000/02/14 11:22:32
***************
*** 277,282 ****
--- 277,285 ----       int                     len;       char       *str; 
+       if (!string)
+               elog(ERROR, "Trying to convert NULL text to integer (int2)");
+        len = (VARSIZE(string) - VARHDRSZ);        str = palloc(len + 1);
***************
*** 317,322 ****
--- 320,328 ----        int                     len;       char       *str;
+ 
+       if (!string)
+               elog(ERROR, "Trying to convert NULL text to integer (int4)");        len = (VARSIZE(string) -
VARHDRSZ);



Re: [HACKERS] Another nasty cache problem

От
Tom Lane
Дата:
Patrick Welche <prlw1@newn.cam.ac.uk> writes: 
> +       if (!string)
> +               elog(ERROR, "Trying to convert NULL text to integer (int2)");

This is unreasonable behavior.  The correct patch is just
if (!string)    return 0;

which will allow the function manager to plow ahead with returning the
NULL that it's going to return anyway.  See the past pghackers threads
about redesigning the function manager interface if you don't understand
what's going on here.
        regards, tom lane


Re: [HACKERS] Another nasty cache problem

От
Patrick Welche
Дата:
On Mon, Feb 14, 2000 at 11:00:16AM -0500, Tom Lane wrote:
> Patrick Welche <prlw1@newn.cam.ac.uk> writes:
>   
> > +       if (!string)
> > +               elog(ERROR, "Trying to convert NULL text to integer (int2)");
> 
> This is unreasonable behavior.  The correct patch is just
> 
>     if (!string)
>         return 0;
> 
> which will allow the function manager to plow ahead with returning the
> NULL that it's going to return anyway.  See the past pghackers threads
> about redesigning the function manager interface if you don't understand
> what's going on here.

Off top of head, that means that null and the string "0" both return 0..
OK - I'll look for the mail thread.

Cheers,

Patrick