HASH: Out of overflow pages. Out of luck
От | Gene Selkov, Jr. |
---|---|
Тема | HASH: Out of overflow pages. Out of luck |
Дата | |
Msg-id | 20020805022621.06AA1A13D@selkovjr.xnet.com обсуждение исходный текст |
Ответы |
Re: HASH: Out of overflow pages. Out of luck
Re: HASH: Out of overflow pages. Out of luck Re: HASH: Out of overflow pages. Out of luck |
Список | pgsql-hackers |
Hi Everybody! I'm sorry I dropped out for so long -- was switching jobs and was on the verge of deportation for a while. Still not entirely back to normal, but can raise my head and look around. The first thing I discovered in the current version (7.2.1) -- as well as in 7.1.3 -- seems to be an old problem with the hash am. It's clustering too much. The data I'm trying to index is of the type text, so it uses hashvarlena(). The values are something like this: REC00014 REC00015 .... REC02463 RBS00001 RBS00002 .... RBS11021 .... It's like several very long sequences with multiple gaps. With the existing hashvarlena(), I can index about 2M rows, but not much more -- it comes back with the message about the overflow pages. hashvarlena() responds to this input in a piecewise linear fashion. That is, the succession 'REC00010' .. 'REC00019' results in a linear order 1346868191 .. 1346868200. The next ten hash values will also be sequential, but in a different range. The quality of the hash function can be a factor here, but probably not the major one. I was able to jack my limit up to over 3.7M rows by reversing the order of bytes in hashvarlena() -- I made the pointer go down instead of up. That spread the hash values more sparsely, but it failed with the same message when I fed it with more than 4M rows. I saw Tom answer a similar question a year ago, by saying that the hash access method is poorly supported and that there is no advantage to using it. I am not sure about the former, but the latter is not entirely true: we saw at least 20% gain in performance when we switched from btree to hash, and my boss considers 20% a big enough improvement. Besides, he knows the database theory and he is a long-time BerkelyDB user, and in his world, hash is greatly superior to btree, so he is wondering why are the postgres implementations so close. Besides, it's a tough challenge to explain it to a Libertarian that he'd better not do something. I guess we can make such people happy by either fixing hash, or by making btree very much worse -- whichever is easier :) Seriously, though, if anybody wants to look at the problem, I saved the set of keys that caused it as http://home.xnet.com/~selkovjr/tmp/prot.gz Also, I heard that other database systems have special access methods for sequences, that address this issue, and I heard as well that without an option to use an arbitrary hash function instead of a single built-in, such problems are bound to happen with particular data sets. How true is that? If a better access method exists, what is it? Thank you, Gene
В списке pgsql-hackers по дате отправления:
Следующее
От: "Christopher Kings-Lynne"Дата:
Сообщение: Re: HASH: Out of overflow pages. Out of luck