HASH: Out of overflow pages. Out of luck

Поиск
Список
Период
Сортировка
Hi Everybody!

I'm sorry I dropped out for so long -- was switching jobs and was on
the verge of deportation for a while. Still not entirely back to
normal, but can raise my head and look around.

The first thing I discovered in the current version (7.2.1) -- as well
as in 7.1.3 -- seems to be an old problem with the hash am. It's
clustering too much. 

The data I'm trying to index is of the type text, so it uses hashvarlena(). The values
are something like this:

REC00014
REC00015
....
REC02463
RBS00001
RBS00002
....
RBS11021
....

It's like several very long sequences with multiple gaps. 

With the existing hashvarlena(), I can index about 2M rows, but not
much more -- it comes back with the message about the overflow pages.

hashvarlena() responds to this input in a piecewise linear fashion. That is,
the succession 'REC00010' .. 'REC00019' results in a linear order
1346868191 .. 1346868200. The next ten hash values will also be sequential, 
but in a different range. 

The quality of the hash function can be a factor here, but probably
not the major one. I was able to jack my limit up to over 3.7M rows by
reversing the order of bytes in hashvarlena() -- I made the pointer go
down instead of up. That spread the hash values more sparsely, but it
failed with the same message when I fed it with more than 4M rows.

I saw Tom answer a similar question a year ago, by saying that the
hash access method is poorly supported and that there is no advantage
to using it. I am not sure about the former, but the latter is not
entirely true: we saw at least 20% gain in performance when we
switched from btree to hash, and my boss considers 20% a big enough
improvement. Besides, he knows the database theory and he is a
long-time BerkelyDB user, and in his world, hash is greatly superior
to btree, so he is wondering why are the postgres implementations so
close. Besides, it's a tough challenge to explain it to a Libertarian
that he'd better not do something.

I guess we can make such people happy by either fixing hash, or by
making btree very much worse -- whichever is easier :)

Seriously, though, if anybody wants to look at the problem, I saved
the set of keys that caused it as

http://home.xnet.com/~selkovjr/tmp/prot.gz

Also, I heard that other database systems have special access methods
for sequences, that address this issue, and I heard as well that
without an option to use an arbitrary hash function instead of a
single built-in, such problems are bound to happen with particular
data sets. How true is that? If a better access method exists, what is
it?

Thank you,

Gene




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Lockhart
Дата:
Сообщение: Re: Did someone break CVS?
Следующее
От: "Christopher Kings-Lynne"
Дата:
Сообщение: Re: HASH: Out of overflow pages. Out of luck