Обсуждение: Join query crashes 7.3.1 backend...

Поиск
Список
Период
Сортировка

Join query crashes 7.3.1 backend...

От
"Mark Cave-Ayland"
Дата:
Hi everyone,

I'm having a problem executing a join query which causes the postgresql
backend to crash. The query I'm trying to execute is:

select t2.resid from res t2, line t1 where t2.resid=t1.resid and
t1.layer=355::bigint;

When this runs then it causes the backend to die and disconnect from
psql :(. I did some initial tests with a limit clause to ensure to see
where things were going wrong and I managed to locate roughly where the
problem was since I could find where the query failed.

select t2.resid from res t2, line t1 where t2.resid=t1.resid and
t1.layer=355::bigint limit 710050;

...worked fine whereas....

select t2.resid from res t2, line t1 where t2.resid=t1.resid and
t1.layer=355::bigint limit 710075;

...caused the backend to die. So I was beginning to suspect that somehow
the data in the table had been corrupted as I've been executing the same
query with many different values of layer before without any problems.
So firstly I tried the line table:

select resid from line where layer=355::bigint;

...which worked without any problems whatsoever. So I was guessing that
the corrupted rows were in the res table. Using the \o command, I dumped
all the resids from the above query into a file and wrote a libpq
program to read in all the resids from the file and query them one by
one in the form 'select resid,* from res where resid=x::bigint' where x
was each line from the above query.

This is where things start getting strange because the program happily
returned all rows produced by the 'select resid from line where
layer=355::bigint' on line far beyond the problem area... up until the
2M+ mark when I decided to stop it. So I was now thinking that the data
in res was not corrupted either. My next step was to verify that the
rows in both the joined and unjoined queries on the line table were
being returned in the same order.

I did an explain which returned the following:

                                                       QUERY PLAN

------------------------------------------------------------------------
------------------------------------------------
 Hash Join  (cost=113489207.76..119550368.76 rows=28567936 width=16)
   Hash Cond: ("outer".resid = "inner".resid)
   ->  Seq Scan on res t2  (cost=100000000.00..103939122.92
rows=72475392 width=8)
   ->  Hash  (cost=13306193.92..13306193.92 rows=28567936 width=8)
         ->  Index Scan using line_resid_idx on line t1
(cost=0.00..13306193.92 rows=28567936 width=8)
               Filter: (layer = 355::bigint)
(6 rows)


So my assumption from this was that the data was being returned in the
order of the line table. To check this, I executed the last query that
worked again:

select t2.resid from res t2, line t1 where t2.resid=t1.resid and
t1.layer=355::bigint limit 710050;

...but this time recorded the output into a file. And then the
following:

select resid from line where layer=355::bigint;

I then did a diff against the output of both queries to make sure that
the resid rows were being returned in the same order.

As suspected, the diff showed that the resids were being returned in the
same order in both cases up to the 710050th result (which is within 50
rows of whatever was causing it to crash).

So now I'm stuck as it seems pulling out the data around offset 710050
separately from both tables does not cause a problem whereas executing
the join directly causes the backend to crash. I have verified (to
within 25 rows) that the rows are being accessed in the same order in
both cases. Can anyone suggest why this is happening? I can supply
additional debug information if people can direct me as to what to do.


Many thanks,

Mark.


---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.


Re: Join query crashes 7.3.1 backend...

От
Tom Lane
Дата:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> I'm having a problem executing a join query which causes the postgresql
> backend to crash.

Please try 7.3.2.   There's a (longstanding) overflow problem in the
hash join code that only triggers on sufficiently large input sets.
It was detected and fixed a couple months ago...

            regards, tom lane

Re: Join query crashes 7.3.1 backend...

От
"Mark Cave-Ayland"
Дата:
Hi Tom,

Thanks for the hint, I'll give that a go later today or tomorrow and see
if that fixes the problem. Incidentally, I did try looking for a
changelog for 7.3.2 to see if there were any known issues that had been
fixed since 7.3.1, but it appears that the release notes pages in the
documentation have not been updated since the release of 7.3 (see
http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=release.
html)


Cheers,

Mark.


---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 25 February 2003 15:17
> To: Mark Cave-Ayland
> Cc: PostgreSQL General
> Subject: Re: [GENERAL] Join query crashes 7.3.1 backend...
>
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> > I'm having a problem executing a join query which causes the
postgresql
> > backend to crash.
>
> Please try 7.3.2.   There's a (longstanding) overflow problem in the
> hash join code that only triggers on sufficiently large input sets.
> It was detected and fixed a couple months ago...
>
>             regards, tom lane

Re: Join query crashes 7.3.1 backend...

От
"Mark Cave-Ayland"
Дата:
Hi Tom,

I've upgraded to 7.3.2 and the problem still exists :(. I'm not that
familiar with the workings of GDB but I managed to get a backtrace for
you:

(gdb) continue
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0811ed8c in BufFileWrite ()
(gdb) bt
#0  0x0811ed8c in BufFileWrite ()
#1  0x080dc237 in ExecHashJoinSaveTuple ()
#2  0x080db201 in ExecHash ()
#3  0x080d5f6d in ExecProcNode ()
#4  0x080dbb06 in ExecHashJoin ()
#5  0x080d5f7d in ExecProcNode ()
#6  0x080d4d44 in ExecutePlan ()
#7  0x080d4389 in ExecutorRun ()
#8  0x081294fb in ProcessQuery ()
#9  0x081279bb in pg_exec_query_string ()
#10 0x08128ad3 in PostgresMain ()
#11 0x0810f044 in DoBackend ()
#12 0x0810e92d in BackendStartup ()
#13 0x0810d9c0 in ServerLoop ()
#14 0x0810d3f9 in PostmasterMain ()
#15 0x080e8cdd in main ()
#16 0x42017499 in __libc_start_main () from /lib/i686/libc.so.6
(gdb)


You may be on the right track with your overflow theory - we estimate
that the query should return somewhere in the region of 29M records as
this is one of the largest layers we have in the DB. Please let me know
if you need any more information and I will do my best to help.


Cheers,

Mark.


---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.


Re: Join query crashes 7.3.1 backend...

От
Tom Lane
Дата:
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> writes:
> I've upgraded to 7.3.2 and the problem still exists :(. I'm not that
> familiar with the workings of GDB but I managed to get a backtrace for
> you:

Boy, that backtrace certainly looks exactly like the hashjoin problem
that I thought I fixed ... and Rae Stiening, the original complainant,
confirmed that the patch fixed the problem for him.  You're certain
this is 7.3.2?

I'd recommend rebuilding with --enable-debug so you can get a more
detailed stack trace.  The thing I'd want to look at is whether
ExecHashJoinGetBatch() is returning an out-of-range batch number to
ExecHashJoin() just before the crash occurs.  It would also be useful
to look at the nbatch, nbuckets, and totalbuckets fields of the
hashtable structure in ExecHashJoin.

            regards, tom lane

Re: Join query crashes 7.3.1 backend...

От
"Mark Cave-Ayland"
Дата:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: 26 February 2003 15:17
> To: Mark Cave-Ayland
> Cc: PostgreSQL General
> Subject: Re: Join query crashes 7.3.1 backend...
>
> Boy, that backtrace certainly looks exactly like the hashjoin
> problem that I thought I fixed ... and Rae Stiening, the
> original complainant, confirmed that the patch fixed the
> problem for him.  You're certain this is 7.3.2?

Hi Tom,

Hope you got my last email (we've been having mail server problems).
Just in case you didn't, I found that the SRPM I got yesterday was 7.3-2
and *NOT* 7.3.2-1. I've now recompiled the right RPMs and installed
them... and the problem has gone away :) So yes, the problem was fixed
between 7.3.1 and 7.3.2. I'm now off to PEBCAK school to study basic
file identification....


Thanks again,

Mark.


---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.