Обсуждение: BUG #2737: hash indexing large table fails, while btree of same index works
The following bug has been logged online: Bug reference: 2737 Logged by: Balazs Nagy Email address: bnagy@thenewpush.com PostgreSQL version: 8.1.5 Operating system: RHEL4 Description: hash indexing large table fails, while btree of same index works Details: Postgres: 8.1.5 Database table size: ~60 million rows Field to index: varchar 127 CREATE INDEX ... USING hash ... fails with a file not found error (psql in verbose mode): ERROR: 58P01: could not open segment 3 of relation 1663/16439/16509 (target block 528283): No such file or directory LOCATION: _mdfd_getseg, md.c:954 VACUUM, VACUUM FULL doesn't help, full dump and reload doesn't help either CREATE INDEX ... USING btree ... works fine. Could there be a bug in the hash algorithm's implementation? System is x86_64 SMP 8 CPU core, 16GB RAM, Fiber channel SAN, kernel 2.6.9-42.0.3.ELsmp I haven't tried the 8.2beta2 yet, but would be happy to try, as the hash method is better suited for the kind of index I need...
[ cc'ing to pgsql-performance because of performance issue for hash indexes ] "Balazs Nagy" <bnagy@thenewpush.com> writes: > Database table size: ~60 million rows > Field to index: varchar 127 > CREATE INDEX ... USING hash ... > fails with a file not found error (psql in verbose mode): > ERROR: 58P01: could not open segment 3 of relation 1663/16439/16509 (target > block 528283): No such file or directory > LOCATION: _mdfd_getseg, md.c:954 Wow, you're trying to build an 8GB hash index? Considering that hash indexes still don't have WAL support, it hardly seems like a good idea to be using one that large. The immediate problem here seems to be that the hash code is trying to touch a page in segment 4 when it hasn't yet touched anything in segment 3. The low-level md.c code assumes (not unreasonably) that this probably represents a bug in the calling code, and errors out instead of allowing the segment to be created. We ought to think about rejiggering the smgr.c interface to support hash's behavior more reasonably. There's already one really bad kluge in mdread() for hash support :-( One thought that comes to mind is to require hash to do an smgrextend() addressing the last block it intends to use whenever it allocates a new batch of blocks, whereupon md.c could adopt a saner API: allow smgrextend but not other calls to address blocks beyond the current EOF. I had once wanted to require hash to explicitly fill all the blocks in sequence, but that's probably too radical compared to what it does now --- especially seeing that it seems the extension has to be done while holding the page-zero lock (see _hash_expandtable). Writing just the logically last block in a batch would have the effect that hash indexes could contain holes (unallocated extents) on filesystems that support that. Normally I would think that probably a Bad Thing, but since hash indexes are never scanned sequentially, it might not matter whether they end up badly fragmented because of after-the-fact filling in of a hole. Thoughts? regards, tom lane
Re: [PERFORM] BUG #2737: hash indexing large table fails,while btree of same index works
От
"Simon Riggs"
Дата:
On Fri, 2006-11-10 at 18:55 -0500, Tom Lane wrote: > [ cc'ing to pgsql-performance because of performance issue for hash indexes ] > > "Balazs Nagy" <bnagy@thenewpush.com> writes: > > Database table size: ~60 million rows > > Field to index: varchar 127 > > > CREATE INDEX ... USING hash ... I'd be interested in a performance test that shows this is the best way to index a table though, especially for such a large column. No wonder there is an 8GB index. > One thought that comes to mind is to require hash to do an smgrextend() > addressing the last block it intends to use whenever it allocates a new > batch of blocks, whereupon md.c could adopt a saner API: allow > smgrextend but not other calls to address blocks beyond the current EOF. > Thoughts? Yes, do it. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Fri, 2006-11-10 at 18:55 -0500, Tom Lane wrote: >> One thought that comes to mind is to require hash to do an smgrextend() >> addressing the last block it intends to use whenever it allocates a new >> batch of blocks, whereupon md.c could adopt a saner API: allow >> smgrextend but not other calls to address blocks beyond the current EOF. > Yes, do it. I found out that it's easy to reproduce this failure in the regression tests, just by building with RELSEG_SIZE set to 128K instead of 1G: *** ./expected/create_index.out Sun Sep 10 13:44:25 2006 --- ./results/create_index.out Thu Nov 16 17:33:29 2006 *************** *** 323,328 **** --- 323,329 ---- -- CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops); CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops); + ERROR: could not open segment 7 of relation 1663/16384/26989 (target block 145): No such file or directory CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops); CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops); -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops); AFAICS, any hash index exceeding a single segment is at serious risk. The fact that we've not heard gripes before suggests that no one is using gigabyte-sized hash indexes. But it seems mighty late in the beta cycle to be making subtle changes in the smgr API. What I'm inclined to do for now is to hack _hash_expandtable() to write a page of zeroes at the end of each file segment when an increment in hashm_ovflpoint causes the logical EOF to cross segment boundary(s). This is pretty ugly and nonmodular, but it will fix the bug without risking breakage of any non-hash code. I'll revisit the cleaner solution once 8.3 devel begins. Comments? regards, tom lane
Re: [PERFORM] BUG #2737: hash indexing large tablefails,while btree of same index works
От
"Simon Riggs"
Дата:
On Thu, 2006-11-16 at 17:48 -0500, Tom Lane wrote: > AFAICS, any hash index exceeding a single segment is at serious risk. > The fact that we've not heard gripes before suggests that no one is > using gigabyte-sized hash indexes. Seems so. > But it seems mighty late in the beta cycle to be making subtle changes > in the smgr API. What I'm inclined to do for now is to hack > _hash_expandtable() to write a page of zeroes at the end of each file > segment when an increment in hashm_ovflpoint causes the logical EOF to > cross segment boundary(s). This is pretty ugly and nonmodular, but it > will fix the bug without risking breakage of any non-hash code. > I'll revisit the cleaner solution once 8.3 devel begins. Comments? Do we think there is hope of improving hash indexes? If not, I'm inclined to remove them rather than to spend time bolstering them. We can remap the keyword as was done with RTREE. It's somewhat embarrassing having an index without clear benefit that can't cope across crashes. We wouldn't accept that in other parts of the software... If there is hope, is there a specific place to look? It would be good to brain dump some starting places for an investigation. Does anybody have a perf test that shows hash indexes beating btrees by any significant margin? (Not saying there isn't one...) I can see the argument that fixed hash indexes would be faster, but there are obviously major downsides to that approach. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Re: [PERFORM] BUG #2737: hash indexing largetablefails,while btree of same index works
От
"Simon Riggs"
Дата:
On Fri, 2006-11-17 at 08:26 -0600, Kenneth Marshall wrote: > I certainly hold out some hope that they can improved. I would like to see > them still included. Once they are gone, it will be much harder to ever > add them back. OK, you got it - keep hash indexes then. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Do we think there is hope of improving hash indexes? Sure. They lack WAL support, which is just a small matter of programming. And no one has ever spent any time on performance optimization for them, but it certainly seems like there ought to be scope for improvement. I don't we should toss them unless it's been proven that their theoretical performance advantages can't be realized for some reason. (This is unlike the situation for rtree, because with rtree there was no reason to expect that rtree could dominate gist along any axis.) > If there is hope, is there a specific place to look? I recall some speculation that using bucket size == page size might be a bad choice, because it leads to mostly-empty buckets for typical key sizes and fill factors. Using a configurable bucket size could help a lot. regards, tom lane
Simon Riggs wrote: > Do we think there is hope of improving hash indexes? I thought about this a bit. I have an idea that the hash index might have the fixed number of buckets specified in create index statement and the tuples in each of these buckets should be stored in a b-tree. This should give a constant performance improvement (but based on the number of buckets) for each fetch of a tuple from index compared to a fetch from b-tree index. cheers Julo
Re: [PERFORM] BUG #2737: hash indexing large tablefails,while btree of same index works
От
Kenneth Marshall
Дата:
On Fri, Nov 17, 2006 at 10:59:11AM +0000, Simon Riggs wrote: > On Thu, 2006-11-16 at 17:48 -0500, Tom Lane wrote: > > > AFAICS, any hash index exceeding a single segment is at serious risk. > > The fact that we've not heard gripes before suggests that no one is > > using gigabyte-sized hash indexes. > > Seems so. > > > But it seems mighty late in the beta cycle to be making subtle changes > > in the smgr API. What I'm inclined to do for now is to hack > > _hash_expandtable() to write a page of zeroes at the end of each file > > segment when an increment in hashm_ovflpoint causes the logical EOF to > > cross segment boundary(s). This is pretty ugly and nonmodular, but it > > will fix the bug without risking breakage of any non-hash code. > > I'll revisit the cleaner solution once 8.3 devel begins. Comments? > > Do we think there is hope of improving hash indexes? If not, I'm > inclined to remove them rather than to spend time bolstering them. We > can remap the keyword as was done with RTREE. It's somewhat embarrassing > having an index without clear benefit that can't cope across crashes. We > wouldn't accept that in other parts of the software... > While I understand that there are currently serious problems in terms of recovery with the hash index as it stands currently, it is theoretically possible to get a result back with a single I/O probe from a hash and that is not the case with btree indexes. At the top end of the performance curve, every little bit that can be done to minimize actual I/Os is needed. I certainly hold out some hope that they can improved. I would like to see them still included. Once they are gone, it will be much harder to ever add them back. Ken Marshall