Re: Questions about btree_gin vs btree_gist for low cardinality columns

Поиск

Список

Период

Сортировка

От	Morris de Oryx
Тема	Re: Questions about btree_gin vs btree_gist for low cardinality columns
Дата	1 июня 2019 г. 10:44:00
Msg-id	CAKqncciUyxY04GagOWYALE7+zOeF9a+AD=gYWV8PvJROw_YN7Q@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Questions about btree_gin vs btree_gist for low cardinality columns (Morris de Oryx <morrisdeoryx@gmail.com>)
Ответы	Re: Questions about btree_gin vs btree_gist for low cardinalitycolumns ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Список	pgsql-general

Дерево обсуждения

Since I've been wondering about this subject, I figured I'd take a bit of time and try to do some tests. I'm not new to databases or coding, but have been using Postgres for less than two years. I haven't tried to generate large blocks of test data directly in Postgres before, so I'm sure that there are better ways to do what I've done here. No worries, this gave me a chance to work through at least some of the questions/problems in setting up and running tests.

Anyway, I populated a table with 1M rows of data with very little in them, just a two-character state abbreviation. There are only 59 values, and the distribution is fairly even as I used random() without any tricks to shape the distribution. So, each value is roughly 1/60th of the total row count. Not realistic, but what I've got.

For this table, I built four different kind of index and tried each one out with a count(*) query on a single exact match. I also checked out the size of each index.

Headline results:

Partial index: Smaller (as expeced), fast.

B-tree index: Big, fast.

GIN: Small, slow.

Hash: Large, slow. ("Large" may be exaggerated in comparison with a B-tree because of my test data.)

I'm wondering how much impact it had that I used such very small strings, and how much difference it made that the data was so evenly distributed.

If anyone has any insights, I'd be grateful to hear them. I'm posting the various bits of code involved below for anyone following along at home.

First, my version string as that can make a difference (we deploy on RDS, I develop on macOS):

PostgreSQL 11.3 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

There's a simple lookup table with the 59 abbreviations (AL,AR,AS,AZ,CA,CO,CT,DC,DE,FL,FM,GA,GU,HI,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MH,MI,MN,MO,MP,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,PR,PW,RI,SC,SD,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY)

BEGIN;

CREATE TABLE IF NOT EXISTS api.state (

abbr text

);

COMMIT;

And here's the test data table definition:

BEGIN;

CREATE TABLE IF NOT EXISTS ascendco.state_test (

abbr text,

num integer -- I didn't end up using this.

);

COMMIT;

I wanted to create 1M rows and bashed around with generate_series, recursive CTEs...and didn't get it working. So I wrote a tiny function instead:

/* random() produces are pretty consistent distribution of numbers.

For ideas on generating other distributions, this piece looks good:

https://chartio.com/learn/sql/random-sequences/

CREATE OR REPLACE FUNCTION api.generate_state_test_rows (loop_max int)

RETURNS int AS $$

DECLARE

counter integer := 0;

BEGIN

IF (loop_max < 1) THEN

RETURN 0 ;

END IF;

WHILE counter <= loop_max LOOP

counter := counter + 1 ;

insert into state_test (num,abbr)

values (

random() * 1000000, -- Get a random number between 0-1,000,000.

(select abbr -- Get a random state abbreviation out of our tiny related table.

from state

order by random()

limit 1)

);

END LOOP ;

RETURN 1;

END;

$$ LANGUAGE plpgsql

The horror. The horror. But it works:

select * from generate_state_test_rows(1000000);

Okay, so that's the data set up. Next, the indexes and their sizes:

DROP INDEX IF EXISTS abbr_partial_ma;

DROP INDEX IF EXISTS abbr_btree;

DROP INDEX IF EXISTS abbr_gin;

DROP INDEX IF EXISTS abbr_hash;

CREATE INDEX abbr_partial_ma ON state_test(abbr) WHERE abbr = 'MA';

CREATE INDEX abbr_btree ON state_test USING btree (abbr);

CREATE INDEX abbr_gin ON state_test USING gin (abbr);

CREATE INDEX abbr_hash ON state_test USING hash (abbr);

select 'Partial' as method, pg_table_size('abbr_partial_ma'), pg_table_size('abbr_partial_ma') / 1024 || ' Kb' as "kb" union all

select 'B tree' as method, pg_table_size('abbr_btree'), pg_table_size('abbr_btree') / 1024 || ' Kb' as "kb" union all

select 'GIN' as method, pg_table_size('abbr_gin'), pg_table_size('abbr_gin') / 1024 || ' Kb' as "kb" union all

select 'Hash' as method, pg_table_size('abbr_hash'), pg_table_size('abbr_hash') / 1024 || ' Kb' as "kb"

method pg_table_size kb

Partial 401408 392 Kb

B tree 22487040 21960 Kb

GIN 1916928 1872 Kb

Hash 49250304 48096 Kb

Okay, so the partial index is smaller, basically proportional to the fraction of the file it's indexing. So that makes sense, and is good to know. The hash index size is...harder to explain...very big. Maybe my tiny strings? Not sure what size Postgres hashes to. A hash of a two character string is likely about worst-case. The B-tree is pretty big.

Okay, timing tests. I was hoping that the GIN would do well, but it didn't. Here are some explain dumps.

B-tree (partial, just on this one value)

Aggregate (cost=389.43..389.44 rows=1 width=8)

-> Index Only Scan using abbr_btree on state_test (cost=0.42..346.68 rows=17100 width=0)

Index Cond: (abbr = 'MA'::text)

B-tree on whole table

Aggregate (cost=389.43..389.44 rows=1 width=8)

-> Index Only Scan using abbr_btree on state_test (cost=0.42..346.68 rows=17100 width=0)

Index Cond: (abbr = 'MA'::text)

GIN (btree_gin, hopefully - I created the extension)

Aggregate (cost=4867.03..4867.04 rows=1 width=8)

-> Bitmap Heap Scan on state_test (cost=140.53..4824.28 rows=17100 width=0)

Recheck Cond: (abbr = 'MA'::text)

-> Bitmap Index Scan on abbr_gin (cost=0.00..136.25 rows=17100 width=0)

Index Cond: (abbr = 'MA'::text)

Hash index

Aggregate (cost=4915.00..4915.01 rows=1 width=8)

-> Index Scan using abbr_hash on state_test (cost=0.00..4872.25 rows=17100 width=0)

Index Cond: (abbr = 'MA'::text)

No index

Finalize Aggregate (cost=10696.37..10696.38 rows=1 width=8)

-> Gather (cost=10696.15..10696.36 rows=2 width=8)

Workers Planned: 2

-> Partial Aggregate (cost=9696.15..9696.16 rows=1 width=8)

-> Parallel Seq Scan on state_test (cost=0.00..9678.34 rows=7125 width=0)

Filter: (abbr = 'MA'::text)

On Sat, Jun 1, 2019 at 12:52 PM Morris de Oryx <morrisdeoryx@gmail.com> wrote:

Jeremy's question is great, and really well presented. I can't answer his questions, but I am keenly interested in this subject as well. The links he provides lead to some really interesting and well-though-out pieces, well worth reading.

I'm going to try restating things in my own way in hopes of getting some good feedback and a basic question:

What are the best ways to index low cardinality values in Postgres?

For an example, imagine an address table with 100M US street addresses with two character state abbreviations. So, say there are around 60 values in there (the USPS is the mail system for a variety of US territories, possessions and friends in the Pacific.) Okay, so what's the best index type for state abbreviation? For the sake of argument, assume a normal distribution so something like FM (Federated States of Micronesia) is on a tail end and CA or NY are a whole lot more common.

A B-tree is obviously a pretty bad match for this sort of situation. It works, but B-trees are ideal for unique values, and super large for repeated values. Not getting into the details or Postgres specifics of various kinds of traditional B-trees. (I think B*?) Doesn't matter. You have a huge index because the index size is closely related to the number of rows, not the number of distinct values.

Alternatively, you could set up partial indexes for the distinct values, like so:

Running 10 Million PostgreSQL Indexes In Production (And Counting)
https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production

Like Jeremy, I've wondered about GIN indexes for low-cardinality columns. Has anyone tried this out in PG 10 or 11? It sounds like a good idea. As I understand it, GIN indexes are something like a B-tree of unique values that link to another data structure, like a tree, bitmap, etc. So, in my imaginary example, there are 60 nodes for the state codes [internally there would be more for padding free nodes, evenly sized pages, etc....just pretend there are 60] and then linked to that, 60 data structures with the actual row references. Somehow.

It can imagine things going quite well with a GIN or btree_gin. I can also imagine that the secondary data structure could get bloaty, slow, and horrible. I've worked with a system which uses bitmaps as the secondary structure (at least some of the time), and it can work quite well...not sure how it's implemented in Postgres.

So, does anyone have any links or results about efficiently (space and/or time) indexing Boolean and other low-cardinality columns in Postgres? I'm on PG 11, but figure many are on 9.x or 10.x.

References and results much appreciated.

Thanks!

В списке pgsql-general по дате отправления:

Предыдущее

От: Tom K
Дата: 01 июня 2019 г., 05:53:35
Сообщение: Re: psql: FATAL: the database system is starting up

Следующее

От: Gavin Flower
Дата: 01 июня 2019 г., 11:24:00
Сообщение: Re: Questions about btree_gin vs btree_gist for low cardinalitycolumns

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Questions about btree_gin vs btree_gist for low cardinality columns

Предыдущее

Следующее