Обсуждение: PostgreSQL server terminated by signal 11

Поиск
Список
Период
Сортировка

PostgreSQL server terminated by signal 11

От
"Daniel Caune"
Дата:
<div class="Section1"><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial">Hi,</span></font><p class="MsoNormal"><font face="Arial" size="2"><span style="font-size:10.0pt;
font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA" style="font-size:
10.0pt;font-family:Arial">My PostgreSQL server running on a Linux machine is terminated by signal 11 whenever I try to
createsome indexes on a table, which contains quite a lot of data.  However I succeeded in creating some other indexes
withouthaving the PostgreSQL server terminated:</span></font><p class="MsoNormal"><font face="Arial" size="2"><span
lang="EN-CA"style="font-size: 
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Courier" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Courier">agora=> CREATE INDEX IDX_GSLOG_EVENTTIME</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">agora->   ON GSLOG_EVENT (EVENT_DATE_CREATED);</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">CREATE INDEX</span></font><p class="MsoNormal"><font face="Courier" size="2"><span
lang="EN-CA"style="font-size: 
10.0pt;font-family:Courier">Time: 152908.797 ms</span></font><p class="MsoNormal"><font face="Courier" size="2"><span
lang="EN-CA"style="font-size: 
10.0pt;font-family:Courier">agora=> explain analyze select max(event_date_created) from gslog_event;</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">                                                                              QUERY
PLAN                                                   </span></font><p class="MsoNormal"><font face="Courier"
size="2"><spanlang="EN-CA" style="font-size: 

10.0pt;font-family:Courier">----------------------------------------------------------------------------------------------------------------------------------------------------------------------</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier"> Result  (cost=3.80..3.81 rows=1 width=0) (actual time=0.218..0.221 rows=1
loops=1)</span></font><pclass="MsoNormal"><font face="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">   InitPlan</span></font><p class="MsoNormal"><font face="Courier" size="2"><span
lang="EN-CA"style="font-size: 
10.0pt;font-family:Courier">     ->  Limit  (cost=0.00..3.80 rows=1 width=8) (actual time=0.197..0.200 rows=1
loops=1)</span></font><pclass="MsoNormal"><font face="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">           ->  Index Scan Backward using idx_gslog_eventtime on gslog_event 
(cost=0.00..39338251.59rows=10348246 width=8) (actual time=0.188..0.188 rows=1 loops=1)</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">                 Filter: (event_date_created IS NOT NULL)</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier"> Total runtime: 0.324 ms</span></font><p class="MsoNormal"><font face="Courier"
size="2"><spanlang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">(6 rows)</span></font><p class="MsoNormal"><font face="Courier" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Courier"> </span></font><p class="MsoNormal"><font face="Courier" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Courier">Time: 41.085 ms</span></font><p class="MsoNormal"><font face="Courier" size="2"><span
lang="EN-CA"style="font-size: 
10.0pt;font-family:Courier">agora=> CREATE INDEX IDX_GSLOG_EVENT_SPREAD_PROTOCOL_NAME</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">agora->   ON GSLOG_EVENT (EVENT_DATE_CREATED)</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">agora->   WHERE EVENT_NAME::text <> 'player-login'::text</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">agora->     AND PLAYER_USERNAME IS NOT NULL</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">agora->     AND GAME_CLIENT_VERSION IS NULL;</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">server closed the connection unexpectedly</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">        This probably means the server terminated abnormally</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">        before or while processing the request.</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">The connection to the server was lost. </span></font><font face="Courier" size="2"><span
style="font-size:10.0pt;font-family:Courier">Attemptingreset: Failed.</span></font><p class="MsoNormal"><font
face="Arial"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial">The PostgreSQL log file doesn’t give more information about what went wrong, except that the
serverprocess has been terminated:</span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Courier" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Courier">LOG:  server process (PID 22270) was terminated by signal 11</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  terminating any other active server processes</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  all server processes terminated; reinitializing</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">FATAL:  the database system is starting up</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  database system was interrupted at 2006-07-27 15:29:27 GMT</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  checkpoint record is at 249/179D44A8</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  redo record is at 249/179D44A8; undo record is at 0/0; shutdown FALSE</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  next transaction ID: 543712876; next OID: 344858</span></font><p
class="MsoNormal"><fontface="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  next MultiXactId: 2; next MultiXactOffset: 3</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  database system was not properly shut down; automatic recovery in
progress</span></font><pclass="MsoNormal"><font face="Courier" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  redo starts at 249/179D44EC</span></font><p class="MsoNormal"><font face="Courier"
size="2"><spanlang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  record with zero length at 249/179E4888</span></font><p class="MsoNormal"><font
face="Courier"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  redo done at 249/179E2DFC</span></font><p class="MsoNormal"><font face="Courier"
size="2"><spanlang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  database system is ready</span></font><p class="MsoNormal"><font face="Courier"
size="2"><spanlang="EN-CA" style="font-size: 
10.0pt;font-family:Courier">LOG:  transaction ID wrap limit is 2147484146, limited by database
"postgres"</span></font><pclass="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial">I checked the memory installed on the machine, running memtest86 during more than one day; no
errorfound.  I checked bad blocks on every hard drive installed in this machine, using e2fsck -c /dev/hdxx; no bad
blockfound.  </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial">I’ve already dropped the table, inserted data, and tried to create all the indexes.  The
serversystematically crashed when creating some specific indexes.  The only idea I have for the moment would be to
setupanother machine with the same database environment.  Other idea(s)?</span></font><p class="MsoNormal"><font
face="Arial"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial">Thanks</span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial"> </span></font><p class="MsoNormal"><font face="Arial" size="2"><span lang="EN-CA"
style="font-size:
10.0pt;font-family:Arial">--</span></font><span lang="EN-CA"></span><p class="MsoNormal"><font face="Arial"
size="2"><spanlang="EN-CA" style="font-size: 
10.0pt;font-family:Arial">Daniel CAUNE</span></font><span lang="EN-CA"></span><p class="MsoNormal"><font face="Arial"
size="2"><spanlang="EN-CA" style="font-size: 
10.0pt;font-family:Arial">Ubisoft Online Technology</span></font><span lang="EN-CA"></span><p class="MsoNormal"><font
face="Arial"size="2"><span lang="EN-CA" style="font-size: 
10.0pt;font-family:Arial">(514) 490 2040 ext. 3613</span></font><span lang="EN-CA"></span><p class="MsoNormal"><font
face="TimesNew Roman" size="3"><span lang="EN-CA" style="font-size:12.0pt"> </span></font></div> 

Re: PostgreSQL server terminated by signal 11

От
Jeff Frost
Дата:
On Thu, 27 Jul 2006, Daniel Caune wrote:

> My PostgreSQL server running on a Linux machine is terminated by signal
> 11 whenever I try to create some indexes on a table, which contains
> quite a lot of data.  However I succeeded in creating some other indexes
> without having the PostgreSQL server terminated:

Daniel,

I would guess this is more appropriate for the -admin list so I cc'd it.

I think you are most likely running out of memory or running up against a
ulimit on memory.  I would first check my ulimit settings on the postgres user
and see if they are a bit small.

--
Jeff Frost, Owner     <jeff@frostconsultingllc.com>
Frost Consulting, LLC     http://www.frostconsultingllc.com/
Phone: 650-780-7908    FAX: 650-649-1954

Re: PostgreSQL server terminated by signal 11

От
Tom Lane
Дата:
"Daniel Caune" <daniel.caune@ubisoft.com> writes:
> My PostgreSQL server running on a Linux machine is terminated by signal
> 11 whenever I try to create some indexes on a table, which contains
> quite a lot of data.

Judging from your examples it's got something to do with the partial
index WHERE clause.  What PG version is this exactly?  If you leave out
different parts of the WHERE, does it still crash?  Does the crash
happen immediately after you give the command, or does it run for
awhile?  It might be worth getting a stack trace from the failure
(best way is to attach to the running backend with gdb, provoke the
crash, and do "bt" --- search for "gdb" in the archives if you need
details).
        regards, tom lane


Re: PostgreSQL server terminated by signal 11

От
"Daniel Caune"
Дата:
> De : Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Envoyé : jeudi, juillet 27, 2006 16:06
> À : Daniel Caune
> Cc : pgsql-sql@postgresql.org
> Objet : Re: [SQL] PostgreSQL server terminated by signal 11
>
> "Daniel Caune" <daniel.caune@ubisoft.com> writes:
> > My PostgreSQL server running on a Linux machine is terminated by signal
> > 11 whenever I try to create some indexes on a table, which contains
> > quite a lot of data.
>
> Judging from your examples it's got something to do with the partial
> index WHERE clause.  What PG version is this exactly?  If you leave out
> different parts of the WHERE, does it still crash?  Does the crash
> happen immediately after you give the command, or does it run for
> awhile?  It might be worth getting a stack trace from the failure
> (best way is to attach to the running backend with gdb, provoke the
> crash, and do "bt" --- search for "gdb" in the archives if you need
> details).
>
>             regards, tom lane

The postgres server version is 8.1.4.

Yes, if leave the WHERE clause a simple index, I don't encounter any problem:
 CREATE INDEX IDX_GSLOG_EVENTTIME   ON GSLOG_EVENT (EVENT_DATE_CREATED);


Anyway, I'm not sure, Tom, that is only related to the WHERE clause as crash occur with composite index too, such as:
 CREATE INDEX IDX_GSLOG_EVENT_PLAYER_EVENT   ON GSLOG_EVENT (PLAYER_USERNAME, EVENT_NAME);


The crash may happen a while after sending the command.  For example, supposing I reboot the Linux machine and I
immediatelyrun the command (i.e. most of memory is unused), it takes more than five minutes before crash occurs.  At
suchtime the memory usage is the following (top every second): 

Mem:   2075860k total,  1787600k used,   288260k free,     6300k buffers
Swap:   369452k total,        0k used,   369452k free,  1748032k cached

When reconnecting to the new postgres respawn, it takes approximately the same time for having it crashing, whatever
thenumber of times I proceed like this. 


I did some other tests trying to detect any common denominator that may make the postgres server crashing.  Here some
resultsare: 

select max(length(game_client_version)) from gslog_event;
=> [CRASH]

select max(length(game_client_version))  from gslog_event  where game_client_version is not null;
=> [OK, max = 28]

select count(*) from gslog_event where length(game_client_version) >= 0;
=> [OK, count = 4463726]

select count(*) from gslog_event where upper(game_client_version) = 'FARCRYPC1.33';
=> [OK, count = 576318]

select count(*) from gslog_event where lower(player_username) = 'lythanhphu';
=> [CRASH]

I was thinking about nullable value, but finally, you know what?  I have strictly no idea! :-)

I'll look at the archive for running postgres with gdb and provide more accurate information.

Thanks,

--
Daniel


Re: PostgreSQL server terminated by signal 11

От
"Daniel Caune"
Дата:

> -----Message d'origine-----
> De : Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Envoyé : jeudi, juillet 27, 2006 16:06
> À : Daniel Caune
> Cc : pgsql-sql@postgresql.org
> Objet : Re: [SQL] PostgreSQL server terminated by signal 11
>
> "Daniel Caune" <daniel.caune@ubisoft.com> writes:
> > My PostgreSQL server running on a Linux machine is terminated by signal
> > 11 whenever I try to create some indexes on a table, which contains
> > quite a lot of data.
>
> Judging from your examples it's got something to do with the partial
> index WHERE clause.  What PG version is this exactly?  If you leave out
> different parts of the WHERE, does it still crash?  Does the crash
> happen immediately after you give the command, or does it run for
> awhile?  It might be worth getting a stack trace from the failure
> (best way is to attach to the running backend with gdb, provoke the
> crash, and do "bt" --- search for "gdb" in the archives if you need
> details).
>
>             regards, tom lane

Quite a long time I didn't use gdb! :-)  Anyway I proceed as described hereafter; correct me if I was wrong.

> ps -eaf | grep postgres

postgres  2792  2789  0 21:50 pts/2    00:00:00 su postgres
postgres  2793  2792  0 21:50 pts/2    00:00:00 bash
postgres  2902     1  7 22:17 ?        00:01:10 postgres: dbo agora [local] idle

                                                                               
postgres  2952     1  2 22:32 ?        00:00:00 /usr/lib/postgresql/8.1/bin/postmaster -D /var/lib/postgresql/8.1/main
-cunix_socket_directory=/var/run/postgresql -c config_file=/etc/postgresql/8.1/main/postgresql.conf -c
hba_file=/etc/postgresql/8.1/main/pg_hba.conf-c ident_file=/etc/postgresql/8.1/main/pg_ident.conf 
postgres  2954  2952  0 22:32 ?        00:00:00 postgres: writer process

                                                                               
postgres  2955  2952  0 22:32 ?        00:00:00 postgres: stats buffer process

                                                                               
postgres  2956  2955  0 22:32 ?        00:00:00 postgres: stats collector process

                                                                               

I connected to the postgres server using psql and I retrieved the backend pid by executing the statement "SELECT
pg_backend_pid();"

I started gdb under the UNIX account postgres and I attached the backend process providing the pid returned by the
statement.

I run the command responsible for creating the index and I entered "continue" in gdb for executing the command.  After
awhile, the server crashes: 

  Program received signal SIGSEGV, Segmentation fault.
  0x08079e2a in slot_attisnull ()
  (gdb)
  Continuing.

  Program terminated with signal SIGSEGV, Segmentation fault.
  The program no longer exists.

I can't do "bt" since the program no longer exists.  How can I provide more information, stack trace, and so on?

--
Daniel

Re: PostgreSQL server terminated by signal 11

От
Tom Lane
Дата:
"Daniel Caune" <daniel.caune@ubisoft.com> writes:
> I run the command responsible for creating the index and I entered "continue" in gdb for executing the command.
Aftera while, the server crashes: 

>   Program received signal SIGSEGV, Segmentation fault.
>   0x08079e2a in slot_attisnull ()
>   (gdb)
>   Continuing.

>   Program terminated with signal SIGSEGV, Segmentation fault.
>   The program no longer exists.

> I can't do "bt" since the program no longer exists.

I think you typed one carriage return too many and the thing re-executed
the last command, ie, the continue.  Try it again.

The lack of arguments shown for slot_attisnull suggests that all we're
going to get is a list of function names, without line numbers or
argument values.  If that's not enough to figure out the problem, can
you rebuild with --enable-debug to get a more useful stack trace?

            regards, tom lane

Re: PostgreSQL server terminated by signal 11

От
"D'Arcy J.M. Cain"
Дата:
On Thu, 27 Jul 2006 19:00:27 -0400
"Daniel Caune" <daniel.caune@ubisoft.com> wrote:
> I run the command responsible for creating the index and I entered "continue" in gdb for executing the command.
Aftera while, the server crashes: 
>
>   Program received signal SIGSEGV, Segmentation fault.
>   0x08079e2a in slot_attisnull ()

That's a pretty small function.  I don't see much room for error.  This
diff in src/backend/access/common/heaptuple.c seems like the most
likely place to catch it.

RCS file: /cvsroot/pgsql/src/backend/access/common/heaptuple.c,v
retrieving revision 1.110
diff -u -p -u -r1.110 heaptuple.c
--- heaptuple.c 14 Jul 2006 14:52:16 -0000      1.110
+++ heaptuple.c 27 Jul 2006 23:37:54 -0000
@@ -1470,8 +1470,13 @@ slot_getsomeattrs(TupleTableSlot *slot,
 bool
 slot_attisnull(TupleTableSlot *slot, int attnum)
 {
-       HeapTuple       tuple = slot->tts_tuple;
-       TupleDesc       tupleDesc = slot->tts_tupleDescriptor;
+       HeapTuple       tuple;
+       TupleDesc       tupleDesc;
+
+       assert(slot != NULL);
+
+       tuple =  slot->tts_tuple;
+       tupleDesc = slot->tts_tupleDescriptor;

        /*
         * system attributes are handled by heap_attisnull

Of course, you still have to find out what's calling it with slot set
to NULL if that turns out to be the problem.  It may also be that slot
is not NULL but set to garbage.  You could also add a notice there.
Two, in fact.  One to display the address of slot and one to display
the value of slot->tts_tuple or slot->tts_tupleDescriptor.  If the
first shows a non NULL value and the second causes your crash that
tells you that the value of slot is probably trashed before
calling the function.

Do this in conjunction with Tom Lanes suggestion of "--enable-debug" for
more information.

--
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

Re: PostgreSQL server terminated by signal 11

От
Daniel CAUNE
Дата:
> -----Message d'origine-----
> De : pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org]
> De la part de Tom Lane
> Envoyé : jeudi 27 juillet 2006 19:26
> À : Daniel Caune
> Cc : pgsql-admin@postgresql.org; pgsql-sql@postgresql.org
> Objet : Re: [SQL] PostgreSQL server terminated by signal 11
>
> "Daniel Caune" <daniel.caune@ubisoft.com> writes:
> > I run the command responsible for creating the index and I entered
> "continue" in gdb for executing the command.  After a while, the server
> crashes:
>
> >   Program received signal SIGSEGV, Segmentation fault.
> >   0x08079e2a in slot_attisnull ()
> >   (gdb)
> >   Continuing.
>
> >   Program terminated with signal SIGSEGV, Segmentation fault.
> >   The program no longer exists.
>
> > I can't do "bt" since the program no longer exists.
>
> I think you typed one carriage return too many and the thing re-executed
> the last command, ie, the continue.  Try it again.
>

OK, I'll try that tomorrow morning.  Perhaps can I set a conditional breakpoint to function slot_attisnull when
parameterslot is null (or slot->tts_tupleDescriptor is null). 

> The lack of arguments shown for slot_attisnull suggests that all we're
> going to get is a list of function names, without line numbers or
> argument values.  If that's not enough to figure out the problem, can
> you rebuild with --enable-debug to get a more useful stack trace?
>

Well, I installed PostgreSQL using apt-get but that won't be a problem to get the source from the CVS repository and to
builda postgres binary using the option you provide to me.  Just let me the time to do that. :-) 

Thanks,


--
Daniel


Re: PostgreSQL server terminated by signal 11

От
Daniel CAUNE
Дата:
> -----Message d'origine-----
> De : pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org]
> De la part de D'Arcy J.M. Cain
> Envoyé : jeudi 27 juillet 2006 19:49
> À : Daniel Caune
> Cc : tgl@sss.pgh.pa.us; pgsql-admin@postgresql.org; pgsql-
> sql@postgresql.org
> Objet : Re: [SQL] PostgreSQL server terminated by signal 11
>
> On Thu, 27 Jul 2006 19:00:27 -0400
> "Daniel Caune" <daniel.caune@ubisoft.com> wrote:
> > I run the command responsible for creating the index and I entered
> "continue" in gdb for executing the command.  After a while, the server
> crashes:
> >
> >   Program received signal SIGSEGV, Segmentation fault.
> >   0x08079e2a in slot_attisnull ()
>
> That's a pretty small function.  I don't see much room for error.  This
> diff in src/backend/access/common/heaptuple.c seems like the most
> likely place to catch it.
>
> RCS file: /cvsroot/pgsql/src/backend/access/common/heaptuple.c,v
> retrieving revision 1.110
> diff -u -p -u -r1.110 heaptuple.c
> --- heaptuple.c 14 Jul 2006 14:52:16 -0000      1.110
> +++ heaptuple.c 27 Jul 2006 23:37:54 -0000
> @@ -1470,8 +1470,13 @@ slot_getsomeattrs(TupleTableSlot *slot,
>  bool
>  slot_attisnull(TupleTableSlot *slot, int attnum)
>  {
> -       HeapTuple       tuple = slot->tts_tuple;
> -       TupleDesc       tupleDesc = slot->tts_tupleDescriptor;
> +       HeapTuple       tuple;
> +       TupleDesc       tupleDesc;
> +
> +       assert(slot != NULL);
> +
> +       tuple =  slot->tts_tuple;
> +       tupleDesc = slot->tts_tupleDescriptor;
>
>         /*
>          * system attributes are handled by heap_attisnull
>
> Of course, you still have to find out what's calling it with slot set
> to NULL if that turns out to be the problem.  It may also be that slot
> is not NULL but set to garbage.  You could also add a notice there.
> Two, in fact.  One to display the address of slot and one to display
> the value of slot->tts_tuple or slot->tts_tupleDescriptor.  If the
> first shows a non NULL value and the second causes your crash that
> tells you that the value of slot is probably trashed before
> calling the function.
>

Yes, I was afraid to go that deeper, but it's time! :-))

Actually it seems, from the source code, that a null slot->tts_tuple won't lead to a segmentation fault in function
slot_attisnull,while slot and slot->tts_tupleDescriptor will.  I will trace the function trying to discover what goes
wrongbehind the scene. 

> Do this in conjunction with Tom Lane suggestion of "--enable-debug" for
> more information.
>
OK

--
Daniel


Re: PostgreSQL server terminated by signal 11

От
Tom Lane
Дата:
Daniel CAUNE <d.caune@free.fr> writes:
> Actually it seems, from the source code, that a null slot->tts_tuple
> won't lead to a segmentation fault in function slot_attisnull, while
> slot and slot->tts_tupleDescriptor will.

I'll bet on D'Arcy's theory that slot is being passed in as NULL.
Exactly why remains to be seen ... we need that stack trace!

            regards, tom lane

Re: PostgreSQL server terminated by signal 11

От
"Daniel Caune"
Дата:

> -----Message d'origine-----
> De : Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Envoyé : jeudi, juillet 27, 2006 19:26
> À : Daniel Caune
> Cc : pgsql-admin@postgresql.org; pgsql-sql@postgresql.org
> Objet : Re: [SQL] PostgreSQL server terminated by signal 11
>
> "Daniel Caune" <daniel.caune@ubisoft.com> writes:
> > I run the command responsible for creating the index and I entered
> "continue" in gdb for executing the command.  After a while, the server
> crashes:
>
> >   Program received signal SIGSEGV, Segmentation fault.
> >   0x08079e2a in slot_attisnull ()
> >   (gdb)
> >   Continuing.
>
> >   Program terminated with signal SIGSEGV, Segmentation fault.
> >   The program no longer exists.
>
> > I can't do "bt" since the program no longer exists.
>
> I think you typed one carriage return too many and the thing re-executed
> the last command, ie, the continue.  Try it again.
>

You were right.

Program received signal SIGSEGV, Segmentation fault.
0x08079e2a in slot_attisnull ()
(gdb) bt
#0  0x08079e2a in slot_attisnull ()
#1  0x0807a1d0 in slot_getattr ()
#2  0x080c6c73 in FormIndexDatum ()
#3  0x080c6ef1 in IndexBuildHeapScan ()
#4  0x0809b44d in btbuild ()
#5  0x0825dfdd in OidFunctionCall3 ()
#6  0x080c4f95 in index_build ()
#7  0x080c68eb in index_create ()
#8  0x08117e36 in DefineIndex ()
#9  0x081db4ee in ProcessUtility ()
#10 0x081d8449 in PostgresMain ()
#11 0x081d99d5 in PortalRun ()
#12 0x081d509e in pg_parse_query ()
#13 0x081d6c33 in PostgresMain ()
#14 0x081aae91 in ClosePostmasterPorts ()
#15 0x081ac14c in PostmasterMain ()
#16 0x08168f22 in main ()

--
Daniel

Re: PostgreSQL server terminated by signal 11

От
Tom Lane
Дата:
"Daniel Caune" <daniel.caune@ubisoft.com> writes:
> Program received signal SIGSEGV, Segmentation fault.
> 0x08079e2a in slot_attisnull ()
> (gdb) bt
> #0  0x08079e2a in slot_attisnull ()
> #1  0x0807a1d0 in slot_getattr ()
> #2  0x080c6c73 in FormIndexDatum ()
> #3  0x080c6ef1 in IndexBuildHeapScan ()
> #4  0x0809b44d in btbuild ()
> #5  0x0825dfdd in OidFunctionCall3 ()
> #6  0x080c4f95 in index_build ()
> #7  0x080c68eb in index_create ()
> #8  0x08117e36 in DefineIndex ()

Hmph.  gdb is lying to you, because slot_getattr doesn't call slot_attisnull.
This isn't too unusual in a non-debug build, because the symbol table is
incomplete (no mention of non-global functions).

Given that this doesn't happen right away, but only after it's been
processing for awhile, we can assume that FormIndexDatum has been
successfully iterated many times already, which seems to eliminate
theories like the slot or the keycol value being bogus.  I'm pretty well
convinced now that we're looking at a problem with corrupted data.  Can
you do a SELECT * FROM (or COPY FROM) the table without error?

            regards, tom lane

Re: PostgreSQL server terminated by signal 11

От
"Daniel Caune"
Дата:
> De : Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Envoyé : vendredi, juillet 28, 2006 09:38
> À : Daniel Caune
> Cc : pgsql-admin@postgresql.org; pgsql-sql@postgresql.org
> Objet : Re: [SQL] PostgreSQL server terminated by signal 11
>
> "Daniel Caune" <daniel.caune@ubisoft.com> writes:
> > Program received signal SIGSEGV, Segmentation fault.
> > 0x08079e2a in slot_attisnull ()
> > (gdb) bt
> > #0  0x08079e2a in slot_attisnull ()
> > #1  0x0807a1d0 in slot_getattr ()
> > #2  0x080c6c73 in FormIndexDatum ()
> > #3  0x080c6ef1 in IndexBuildHeapScan ()
> > #4  0x0809b44d in btbuild ()
> > #5  0x0825dfdd in OidFunctionCall3 ()
> > #6  0x080c4f95 in index_build ()
> > #7  0x080c68eb in index_create ()
> > #8  0x08117e36 in DefineIndex ()
>
> Hmph.  gdb is lying to you, because slot_getattr doesn't call
> slot_attisnull.
> This isn't too unusual in a non-debug build, because the symbol table is
> incomplete (no mention of non-global functions).
>
> Given that this doesn't happen right away, but only after it's been
> processing for awhile, we can assume that FormIndexDatum has been
> successfully iterated many times already, which seems to eliminate
> theories like the slot or the keycol value being bogus.  I'm pretty well
> convinced now that we're looking at a problem with corrupted data.  Can
> you do a SELECT * FROM (or COPY FROM) the table without error?
>
>             regards, tom lane

The statement "copy gslog_event to stdout;" leads to "ERROR:  invalid memory alloc request size 4294967293" after
awhile.

  (...)
  354964834       2006-07-19 10:53:42.813+00      (...)
  354964835       2006-07-19 10:53:44.003+00      (...)
  ERROR:  invalid memory alloc request size 4294967293


I tried then "select * from gslog_event where gslog_event_id >= 354964834 and gslog_event_id <= 354964900;":

  354964834 | 2006-07-19 10:53:42.813+00 | (...)
  354964835 | 2006-07-19 10:53:44.003+00 | (...)
  354964837 | 2006-07-19 10:53:44.113+00 | (...)
  354964838 | 2006-07-19 10:53:44.223+00 | (...)
  (...)
  (66 rows)


The statement "select * from gslog_event;" leads to "Killed"...  Ouch! The psql client just exits (the postgres server
crashestoo)! 

The statement "select * from gslog_event where gslog_event_id <= 354964834;" passed.


I did other tests on some other tables that contain less data but that seem also corrupted:

  copy player to stdout
  ERROR:  invalid memory alloc request size 1918988375

  select * from player where id >=771042 and id<=771043;
  ERROR:  invalid memory alloc request size 1918988375

  select max(length(username)) from player;
  ERROR:  invalid memory alloc request size 1918988375

  select max(length(username)) from player where id <= 771042;
   max
  -----
    15

  select max(length(username)) from player where id >= 771050;
   max
  -----
    15

  select max(length(username)) from player where id >= 771044 and id <= 771050;
   max
  -----
    13

Finally:

  select * from player where id=771043;
  ERROR:  invalid memory alloc request size 1918988375

  select id from player where id=771043;
     id
  --------
   771043
  (1 row)

  agora=> select username from player where id=771043;
  ERROR:  invalid memory alloc request size 1918988375


I'm also pretty much convinced that there are some corrupted data, especially varchar row.  Before dropping corrupted
rows,is there a way to read part of corrupted data? 

Thanks Tom for your great support.  I'm just afraid that I wasted your time...  Anyway I'll write a FAQ that provides
someinformation about this kind of problem we have faced. 

Regards,


--
Daniel

Re: PostgreSQL server terminated by signal 11

От
Tom Lane
Дата:
"Daniel Caune" <daniel.caune@ubisoft.com> writes:
> The statement "copy gslog_event to stdout;" leads to "ERROR:  invalid memory alloc request size 4294967293" after
awhile.
> ...
> I did other tests on some other tables that contain less data but that seem also corrupted:

This is a bit scary as it suggests a systemic problem.  You should
definitely try to find out exactly what the corruption looks like.
It's usually not hard to home in on where the first corrupted row is
--- you do
    SELECT ctid, * FROM tab LIMIT n;
and determine the largest value of n that won't trigger a failure.
The corrupted region is then just after the last ctid you see.
You can look at those blocks with "pg_filedump -i -f" and see if
anything pops out.  Check the PG archives for previous discussions
of dealing with corrupted data.

            regards, tom lane