Обсуждение: BUG #2737: hash indexing large table fails, while btree of same index works

Поиск
Список
Период
Сортировка

BUG #2737: hash indexing large table fails, while btree of same index works

От
"Balazs Nagy"
Дата:
The following bug has been logged online:

Bug reference:      2737
Logged by:          Balazs Nagy
Email address:      bnagy@thenewpush.com
PostgreSQL version: 8.1.5
Operating system:   RHEL4
Description:        hash indexing large table fails, while btree of same
index works
Details:

Postgres: 8.1.5
Database table size: ~60 million rows
Field to index: varchar 127

CREATE INDEX ... USING hash ...

fails with a file not found error (psql in verbose mode):

ERROR:  58P01: could not open segment 3 of relation 1663/16439/16509 (target
block 528283): No such file or directory
LOCATION:  _mdfd_getseg, md.c:954

VACUUM, VACUUM FULL doesn't help, full dump and reload doesn't help either

CREATE INDEX ... USING btree ...

works fine.  Could there be a bug in the hash algorithm's implementation?

System is x86_64 SMP 8 CPU core, 16GB RAM, Fiber channel SAN, kernel
2.6.9-42.0.3.ELsmp

I haven't tried the 8.2beta2 yet, but would be happy to try, as the hash
method is better suited for the kind of index I need...

Re: BUG #2737: hash indexing large table fails, while btree of same index works

От
Tom Lane
Дата:
[ cc'ing to pgsql-performance because of performance issue for hash indexes ]

"Balazs Nagy" <bnagy@thenewpush.com> writes:
> Database table size: ~60 million rows
> Field to index: varchar 127

> CREATE INDEX ... USING hash ...

> fails with a file not found error (psql in verbose mode):

> ERROR:  58P01: could not open segment 3 of relation 1663/16439/16509 (target
> block 528283): No such file or directory
> LOCATION:  _mdfd_getseg, md.c:954

Wow, you're trying to build an 8GB hash index?  Considering that hash
indexes still don't have WAL support, it hardly seems like a good idea
to be using one that large.

The immediate problem here seems to be that the hash code is trying to
touch a page in segment 4 when it hasn't yet touched anything in segment
3.  The low-level md.c code assumes (not unreasonably) that this
probably represents a bug in the calling code, and errors out instead of
allowing the segment to be created.

We ought to think about rejiggering the smgr.c interface to support
hash's behavior more reasonably.  There's already one really bad kluge
in mdread() for hash support :-(

One thought that comes to mind is to require hash to do an smgrextend()
addressing the last block it intends to use whenever it allocates a new
batch of blocks, whereupon md.c could adopt a saner API: allow
smgrextend but not other calls to address blocks beyond the current EOF.
I had once wanted to require hash to explicitly fill all the blocks in
sequence, but that's probably too radical compared to what it does now
--- especially seeing that it seems the extension has to be done while
holding the page-zero lock (see _hash_expandtable).  Writing just the
logically last block in a batch would have the effect that hash indexes
could contain holes (unallocated extents) on filesystems that support
that.  Normally I would think that probably a Bad Thing, but since hash
indexes are never scanned sequentially, it might not matter whether they
end up badly fragmented because of after-the-fact filling in of a hole.
Thoughts?

            regards, tom lane

Re: [PERFORM] BUG #2737: hash indexing large table fails,while btree of same index works

От
"Simon Riggs"
Дата:
On Fri, 2006-11-10 at 18:55 -0500, Tom Lane wrote:
> [ cc'ing to pgsql-performance because of performance issue for hash indexes ]
>
> "Balazs Nagy" <bnagy@thenewpush.com> writes:
> > Database table size: ~60 million rows
> > Field to index: varchar 127
>
> > CREATE INDEX ... USING hash ...

I'd be interested in a performance test that shows this is the best way
to index a table though, especially for such a large column. No wonder
there is an 8GB index.

> One thought that comes to mind is to require hash to do an smgrextend()
> addressing the last block it intends to use whenever it allocates a new
> batch of blocks, whereupon md.c could adopt a saner API: allow
> smgrextend but not other calls to address blocks beyond the current EOF.

> Thoughts?

Yes, do it.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PERFORM] BUG #2737: hash indexing large table fails,while btree of same index works

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Fri, 2006-11-10 at 18:55 -0500, Tom Lane wrote:
>> One thought that comes to mind is to require hash to do an smgrextend()
>> addressing the last block it intends to use whenever it allocates a new
>> batch of blocks, whereupon md.c could adopt a saner API: allow
>> smgrextend but not other calls to address blocks beyond the current EOF.

> Yes, do it.

I found out that it's easy to reproduce this failure in the regression
tests, just by building with RELSEG_SIZE set to 128K instead of 1G:

*** ./expected/create_index.out Sun Sep 10 13:44:25 2006
--- ./results/create_index.out  Thu Nov 16 17:33:29 2006
***************
*** 323,328 ****
--- 323,329 ----
  --
  CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
  CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
+ ERROR:  could not open segment 7 of relation 1663/16384/26989 (target block 145): No such file or directory
  CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
  CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
  -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);

AFAICS, any hash index exceeding a single segment is at serious risk.
The fact that we've not heard gripes before suggests that no one is
using gigabyte-sized hash indexes.

But it seems mighty late in the beta cycle to be making subtle changes
in the smgr API.  What I'm inclined to do for now is to hack
_hash_expandtable() to write a page of zeroes at the end of each file
segment when an increment in hashm_ovflpoint causes the logical EOF to
cross segment boundary(s).  This is pretty ugly and nonmodular, but it
will fix the bug without risking breakage of any non-hash code.
I'll revisit the cleaner solution once 8.3 devel begins.  Comments?

            regards, tom lane

Re: [PERFORM] BUG #2737: hash indexing large tablefails,while btree of same index works

От
"Simon Riggs"
Дата:
On Thu, 2006-11-16 at 17:48 -0500, Tom Lane wrote:

> AFAICS, any hash index exceeding a single segment is at serious risk.
> The fact that we've not heard gripes before suggests that no one is
> using gigabyte-sized hash indexes.

Seems so.

> But it seems mighty late in the beta cycle to be making subtle changes
> in the smgr API.  What I'm inclined to do for now is to hack
> _hash_expandtable() to write a page of zeroes at the end of each file
> segment when an increment in hashm_ovflpoint causes the logical EOF to
> cross segment boundary(s).  This is pretty ugly and nonmodular, but it
> will fix the bug without risking breakage of any non-hash code.
> I'll revisit the cleaner solution once 8.3 devel begins.  Comments?

Do we think there is hope of improving hash indexes? If not, I'm
inclined to remove them rather than to spend time bolstering them. We
can remap the keyword as was done with RTREE. It's somewhat embarrassing
having an index without clear benefit that can't cope across crashes. We
wouldn't accept that in other parts of the software...

If there is hope, is there a specific place to look? It would be good to
brain dump some starting places for an investigation.

Does anybody have a perf test that shows hash indexes beating btrees by
any significant margin? (Not saying there isn't one...)

I can see the argument that fixed hash indexes would be faster, but
there are obviously major downsides to that approach.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PERFORM] BUG #2737: hash indexing largetablefails,while btree of same index works

От
"Simon Riggs"
Дата:
On Fri, 2006-11-17 at 08:26 -0600, Kenneth Marshall wrote:

> I certainly hold out some hope that they can improved. I would like to see
> them still included. Once they are gone, it will be much harder to ever
> add them back.

OK, you got it - keep hash indexes then.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [PERFORM] BUG #2737: hash indexing large tablefails,while btree of same index works

От
Tom Lane
Дата:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> Do we think there is hope of improving hash indexes?

Sure.  They lack WAL support, which is just a small matter of
programming.  And no one has ever spent any time on performance
optimization for them, but it certainly seems like there ought to be
scope for improvement.  I don't we should toss them unless it's been
proven that their theoretical performance advantages can't be realized
for some reason.  (This is unlike the situation for rtree, because with
rtree there was no reason to expect that rtree could dominate gist along
any axis.)

> If there is hope, is there a specific place to look?

I recall some speculation that using bucket size == page size might
be a bad choice, because it leads to mostly-empty buckets for typical
key sizes and fill factors.  Using a configurable bucket size could
help a lot.

            regards, tom lane

Re: [PERFORM] BUG #2737: hash indexing large tablefails,while

От
"Julius.Stroffek"
Дата:
Simon Riggs wrote:
> Do we think there is hope of improving hash indexes?
I thought about this a bit. I have an idea that the hash index might
have the fixed number of buckets specified in create index statement and
the tuples in each of these buckets should be stored in a b-tree. This
should give a constant performance improvement (but based on the number
of buckets) for each fetch of a tuple from index compared to a fetch
from b-tree index.

cheers

Julo



Re: [PERFORM] BUG #2737: hash indexing large tablefails,while btree of same index works

От
Kenneth Marshall
Дата:
On Fri, Nov 17, 2006 at 10:59:11AM +0000, Simon Riggs wrote:
> On Thu, 2006-11-16 at 17:48 -0500, Tom Lane wrote:
>
> > AFAICS, any hash index exceeding a single segment is at serious risk.
> > The fact that we've not heard gripes before suggests that no one is
> > using gigabyte-sized hash indexes.
>
> Seems so.
>
> > But it seems mighty late in the beta cycle to be making subtle changes
> > in the smgr API.  What I'm inclined to do for now is to hack
> > _hash_expandtable() to write a page of zeroes at the end of each file
> > segment when an increment in hashm_ovflpoint causes the logical EOF to
> > cross segment boundary(s).  This is pretty ugly and nonmodular, but it
> > will fix the bug without risking breakage of any non-hash code.
> > I'll revisit the cleaner solution once 8.3 devel begins.  Comments?
>
> Do we think there is hope of improving hash indexes? If not, I'm
> inclined to remove them rather than to spend time bolstering them. We
> can remap the keyword as was done with RTREE. It's somewhat embarrassing
> having an index without clear benefit that can't cope across crashes. We
> wouldn't accept that in other parts of the software...
>

While I understand that there are currently serious problems in terms of
recovery with the hash index as it stands currently, it is theoretically
possible to get a result back with a single I/O probe from a hash and
that is not the case with btree indexes. At the top end of the performance
curve, every little bit that can be done to minimize actual I/Os is needed.
I certainly hold out some hope that they can improved. I would like to see
them still included. Once they are gone, it will be much harder to ever
add them back.

Ken Marshall