Обсуждение: WTF with hash index?

Поиск
Список
Период
Сортировка

WTF with hash index?

От
Олег Самойлов
Дата:
CentOS 7

$ rpm -q postgresql10
postgresql10-10.6-1PGDG.rhel7.x86_64

SQL script for psql:

\set table_size 1000000
begin;
create table gender (gender varchar);

insert into gender (gender) select case when random<0.50 then 'female' when random<0.99 then 'male' else 'other' end from (select random() as random, generate_series(1,:table_size)) as subselect;

create index gender_btree on gender using btree (gender);
create index gender_hash on gender using hash (gender);
commit;
vacuum full analyze;

Vacuum full is not necessary here, just a little vodoo programming. I expected that the hash index will be much smaller and quicker than the btree index, because it doesn’t keep values inside itself, only hashes. But:

=> \d+
                   List of relations
 Schema |  Name  | Type  | Owner | Size  | Description
--------+--------+-------+-------+-------+-------------
 public | gender | table | olleg | 35 MB |
(1 row)

=> \di+
                          List of relations
 Schema |     Name     | Type  | Owner | Table  | Size  | Description
--------+--------------+-------+-------+--------+-------+-------------
 public | gender_btree | index | olleg | gender | 21 MB |
 public | gender_hash  | index | olleg | gender | 47 MB |
(2 rows)

The hash index not only is more than the btree index, but also is bigger than the table itself. What is wrong with the hash index?

Re: WTF with hash index?

От
Laurenz Albe
Дата:
Олег Самойлов wrote:
> \set table_size 1000000
> begin;
> create table gender (gender varchar);
> 
> insert into gender (gender) select case when random<0.50 then 'female' when random<0.99 then 'male' else 'other' end
from(select random() as random, generate_series(1,:table_size)) as subselect;
 
> 
> create index gender_btree on gender using btree (gender);
> create index gender_hash on gender using hash (gender);
> commit;
> vacuum full analyze;
> 
> Vacuum full is not necessary here, just a little vodoo programming. I expected that the hash index will be much
smallerand quicker than the btree index, because it doesn’t keep values inside itself, only hashes. But:
 
> 
> => \d+
>                    List of relations
>  Schema |  Name  | Type  | Owner | Size  | Description
> --------+--------+-------+-------+-------+-------------
>  public | gender | table | olleg | 35 MB |
> (1 row)
> 
> => \di+
>                           List of relations
>  Schema |     Name     | Type  | Owner | Table  | Size  | Description
> --------+--------------+-------+-------+--------+-------+-------------
>  public | gender_btree | index | olleg | gender | 21 MB |
>  public | gender_hash  | index | olleg | gender | 47 MB |
> (2 rows)
> 
> The hash index not only is more than the btree index, but also is bigger than the table itself. What is wrong with
thehash index?
 

I guess the problem here is that there are so few distinct values, so
all the index items end up in only three hash buckets, forming large
linked lists.

I can't tell off-hand why that would make the index so large though.

Anyway, indexes are pretty useless in such a case.

Is the behavior the same if you have many distinct values?

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Re: WTF with hash index?

От
Andreas Kretschmer
Дата:

Am 13.11.2018 um 17:42 schrieb Олег Самойлов:
> insert into gender (gender) select case when random<0.50 then 'female' 
> when random<0.99 then 'male' else 'other' end from (select random() as 
> random, generate_series(1,:table_size)) as subselect;

is that really your intended data distibution? 99% male?

Regards, Andreas
-- 

2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com



Re: WTF with hash index?

От
Ron
Дата:
On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:
>
>
> Am 13.11.2018 um 17:42 schrieb Олег Самойлов:
>> insert into gender (gender) select case when random<0.50 then 'female' 
>> when random<0.99 then 'male' else 'other' end from (select random() as 
>> random, generate_series(1,:table_size)) as subselect;
>
> is that really your intended data distibution? 99% male?

select case when random<0.50 then 'female'
when random<0.99 then 'male'
             else 'other' end
from (select random() as random, generate_series(1,:table_size)) as subselect;

Shouldn't that make 49% male?

-- 
Angular momentum makes the world go 'round.


Re: WTF with hash index?

От
Andreas Kretschmer
Дата:

Am 13.11.2018 um 19:12 schrieb Ron:
> On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:
>>
>>
>> Am 13.11.2018 um 17:42 schrieb Олег Самойлов:
>>> insert into gender (gender) select case when random<0.50 then 
>>> 'female' when random<0.99 then 'male' else 'other' end from (select 
>>> random() as random, generate_series(1,:table_size)) as subselect;
>>
>> is that really your intended data distibution? 99% male?
>
> select case when random<0.50 then 'female'
> when random<0.99 then 'male'
>             else 'other' end
> from (select random() as random, generate_series(1,:table_size)) as 
> subselect;
>
> Shouldn't that make 49% male?


you are right, my fault :-(.
the case - statement will be left if the first condition is true.

Regards, Andreas

-- 
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com



Re: WTF with hash index?

От
Ron
Дата:
On 11/13/2018 12:39 PM, Andreas Kretschmer wrote:
>
>
> Am 13.11.2018 um 19:12 schrieb Ron:
>> On 11/13/2018 12:07 PM, Andreas Kretschmer wrote:
>>>
>>>
>>> Am 13.11.2018 um 17:42 schrieb Олег Самойлов:
>>>> insert into gender (gender) select case when random<0.50 then 'female' 
>>>> when random<0.99 then 'male' else 'other' end from (select random() as 
>>>> random, generate_series(1,:table_size)) as subselect;
>>>
>>> is that really your intended data distibution? 99% male?
>>
>> select case when random<0.50 then 'female'
>> when random<0.99 then 'male'
>>             else 'other' end
>> from (select random() as random, generate_series(1,:table_size)) as 
>> subselect;
>>
>> Shouldn't that make 49% male?
>
>
> you are right, my fault :-(.
> the case - statement will be left if the first condition is true.

I had to read it thrice. :)  It's an example of why I like white space and 
indenting...


-- 
Angular momentum makes the world go 'round.


Re: WTF with hash index?

От
Олег Самойлов
Дата:
I am just doing experiment what a type a most suitable for enumeration in PostgreSQL. And what index. And this effect looked for me very strange. There is in the PostgreSQL one another hash index. This is gin(jsonb_path_ops) for the jsob type. It is also use hash internally, but it is much better.
Example based on the previous example.

create table jender (jdoc jsonb);

insert into jender (jdoc) select ('{"gender": "'||gender||'"}')::jsonb from gender;

create index jender_hash on jender using gin (jdoc jsonb_path_ops);

=> \d+
                   List of relations
 Schema |  Name  | Type  | Owner | Size  | Description
--------+--------+-------+-------+-------+-------------
 public | gender | table | olleg | 35 MB |
 public | jender | table | olleg | 54 MB |
(2 rows)

=> \di+
                           List of relations
 Schema |     Name     | Type  | Owner | Table  |  Size   | Description
--------+--------------+-------+-------+--------+---------+-------------
 public | gender_btree | index | olleg | gender | 21 MB   |
 public | gender_hash  | index | olleg | gender | 47 MB   |
 public | jender_hash  | index | olleg | jender | 1104 kB |
(3 rows)

Very much better. What about to copy paste algorithm from gin(jsonb_path_ops) to the hash index?

Re: WTF with hash index?

От
Alvaro Herrera
Дата:
On 2018-Nov-13, Олег Самойлов wrote:

> Very much better. What about to copy paste algorithm from
> gin(jsonb_path_ops) to the hash index?

You're welcome to submit patches.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WTF with hash index?

От
Олег Самойлов
Дата:
Ah, thanks. I am not a developer of PostgreSQL. I am a developer in PostgreSQL. :) And I see two hash indexes on the
samedata and one of them 43 times bigger then other, this looked like something terribly wrong. Just free idea how to
considerablyimprove your product. 

> 13 нояб. 2018 г., в 22:37, Alvaro Herrera <alvherre@2ndquadrant.com> написал(а):
>
> On 2018-Nov-13, Олег Самойлов wrote:
>
>> Very much better. What about to copy paste algorithm from
>> gin(jsonb_path_ops) to the hash index?
>
> You're welcome to submit patches.
>
> --
> Álvaro Herrera                https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WTF with hash index?

От
Stephen Frost
Дата:
Greetings,

* Олег Самойлов (splarv@ya.ru) wrote:
> Ah, thanks. I am not a developer of PostgreSQL. I am a developer in PostgreSQL. :) And I see two hash indexes on the
samedata and one of them 43 times bigger then other, this looked like something terribly wrong. Just free idea how to
considerablyimprove your product. 

Ultimately, it might be less improvement overall and more of a
regression though, in the common cases.

This use-case isn't common for a hash index.  If we changed all hash
indexes to instead be gin indexes of jsonb values, we'd end up with the
common hash index use-cases being much, much worse.

In the end, this comes down to choosing the right index for your data
and your queries, which is actually a rather difficult problem to handle
in some kind of automated way as there are lots of different trade-offs
to consider.

Thanks!

Stephen

Вложения