Re: Questions about btree_gin vs btree_gist for low cardinality columns

Поиск
Список
Период
Сортировка
От Morris de Oryx
Тема Re: Questions about btree_gin vs btree_gist for low cardinality columns
Дата
Msg-id CAKqncciUyxY04GagOWYALE7+zOeF9a+AD=gYWV8PvJROw_YN7Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Questions about btree_gin vs btree_gist for low cardinality columns  (Morris de Oryx <morrisdeoryx@gmail.com>)
Ответы Re: Questions about btree_gin vs btree_gist for low cardinalitycolumns  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Список pgsql-general
Since I've been wondering about this subject, I figured I'd take a bit of time and try to do some tests. I'm not new to databases or coding, but have been using Postgres for less than two years. I haven't tried to generate large blocks of test data directly in Postgres before, so I'm sure that there are better ways to do what I've done here. No worries, this gave me a chance to work through at least some of the questions/problems in setting up and running tests.

Anyway, I populated a table with 1M rows of data with very little in them, just a two-character state abbreviation. There are only 59 values, and the distribution is fairly even as I used random() without any tricks to shape the distribution. So, each value is roughly 1/60th of the total row count. Not realistic, but what I've got.

For this table, I built four different kind of index and tried each one out with a count(*) query on a single exact match. I also checked out the size of each index. 

Headline results:

Partial index: Smaller (as expeced), fast.
B-tree index: Big, fast.
GIN: Small, slow.
Hash: Large, slow. ("Large" may be exaggerated in comparison with a B-tree because of my test data.)

I'm wondering how much impact it had that I used such very small strings, and how much difference it made that the data was so evenly distributed.

If anyone has any insights, I'd be grateful to hear them. I'm posting the various bits of code involved below for anyone following along at home.

First, my version string as that can make a difference (we deploy on RDS, I develop on macOS):

PostgreSQL 11.3 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

There's a simple lookup table with the 59 abbreviations (AL,AR,AS,AZ,CA,CO,CT,DC,DE,FL,FM,GA,GU,HI,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MH,MI,MN,MO,MP,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,PR,PW,RI,SC,SD,TN,TX,UT,VA,VI,VT,WA,WI,WV,WY)

BEGIN;
CREATE TABLE IF NOT EXISTS api.state (
    abbr text
);
COMMIT;

And here's the test data table definition:

BEGIN;
CREATE TABLE IF NOT EXISTS ascendco.state_test (
    abbr text,
    num integer -- I didn't end up using this.
);
COMMIT;

I wanted to create 1M rows and bashed around with generate_series, recursive CTEs...and didn't get it working. So I wrote a tiny function instead:

/* random() produces are pretty consistent distribution of numbers. 
   For ideas on generating other distributions, this piece looks good:
*/

CREATE OR REPLACE FUNCTION api.generate_state_test_rows (loop_max int) 
   RETURNS int AS $$

DECLARE
   counter integer := 0; 

BEGIN
   IF (loop_max < 1) THEN
      RETURN 0 ;
   END IF; 
   
   WHILE counter <= loop_max LOOP
    counter := counter + 1 ; 

    insert into state_test (num,abbr)

         values (
                  random() * 1000000,   -- Get a random number between 0-1,000,000.
                 (select abbr -- Get a random state abbreviation out of our tiny related table.
                    from state
               order by random() 
                  limit 1)
                );
  
       END LOOP ; 
   
   RETURN 1;
END;

$$ LANGUAGE plpgsql

The horror. The horror. But it works:

select * from generate_state_test_rows(1000000);

Okay, so that's the data set up. Next, the indexes and their sizes:

DROP INDEX IF EXISTS abbr_partial_ma;
DROP INDEX IF EXISTS abbr_btree;
DROP INDEX IF EXISTS abbr_gin;
DROP INDEX IF EXISTS abbr_hash;

CREATE INDEX abbr_partial_ma  ON state_test(abbr) WHERE abbr = 'MA';
CREATE INDEX abbr_btree ON state_test USING btree (abbr);
CREATE INDEX abbr_gin   ON state_test USING gin (abbr);
CREATE INDEX abbr_hash  ON state_test USING hash (abbr);

select 'Partial' as method, pg_table_size('abbr_partial_ma'), pg_table_size('abbr_partial_ma') / 1024 || ' Kb' as "kb" union all
select 'B tree' as  method, pg_table_size('abbr_btree'),      pg_table_size('abbr_btree') / 1024  || ' Kb' as "kb" union all
select 'GIN'    as  method, pg_table_size('abbr_gin'),        pg_table_size('abbr_gin')   / 1024  || ' Kb' as "kb" union all
select 'Hash'   as  method, pg_table_size('abbr_hash'),       pg_table_size('abbr_hash')  / 1024  || ' Kb' as "kb" 

method    pg_table_size    kb
Partial   401408    392 Kb
B tree    22487040    21960 Kb
GIN       1916928    1872 Kb
Hash      49250304    48096 Kb

Okay, so the partial index is smaller, basically proportional to the fraction of the file it's indexing. So that makes sense, and is good to know. The hash index size is...harder to explain...very big. Maybe my tiny strings? Not sure what size Postgres hashes to. A hash of a two character string is likely about worst-case. The B-tree is pretty big.

Okay, timing tests. I was hoping that the GIN would do well, but it didn't. Here are some explain dumps.

B-tree (partial, just on this one value)
Aggregate  (cost=389.43..389.44 rows=1 width=8)
  ->  Index Only Scan using abbr_btree on state_test  (cost=0.42..346.68 rows=17100 width=0)
        Index Cond: (abbr = 'MA'::text)

B-tree on whole table
Aggregate  (cost=389.43..389.44 rows=1 width=8)
  ->  Index Only Scan using abbr_btree on state_test  (cost=0.42..346.68 rows=17100 width=0)
        Index Cond: (abbr = 'MA'::text)        
        
GIN (btree_gin, hopefully - I created the extension)
Aggregate  (cost=4867.03..4867.04 rows=1 width=8)
  ->  Bitmap Heap Scan on state_test  (cost=140.53..4824.28 rows=17100 width=0)
        Recheck Cond: (abbr = 'MA'::text)
        ->  Bitmap Index Scan on abbr_gin  (cost=0.00..136.25 rows=17100 width=0)
              Index Cond: (abbr = 'MA'::text)

Hash index              
Aggregate  (cost=4915.00..4915.01 rows=1 width=8)
  ->  Index Scan using abbr_hash on state_test  (cost=0.00..4872.25 rows=17100 width=0)
        Index Cond: (abbr = 'MA'::text)              
        
No index
Finalize Aggregate  (cost=10696.37..10696.38 rows=1 width=8)
  ->  Gather  (cost=10696.15..10696.36 rows=2 width=8)
        Workers Planned: 2
        ->  Partial Aggregate  (cost=9696.15..9696.16 rows=1 width=8)
              ->  Parallel Seq Scan on state_test  (cost=0.00..9678.34 rows=7125 width=0)
                    Filter: (abbr = 'MA'::text)


On Sat, Jun 1, 2019 at 12:52 PM Morris de Oryx <morrisdeoryx@gmail.com> wrote:
Jeremy's question is great, and really well presented. I can't answer his questions, but I am keenly interested in this subject as well. The links he provides lead to some really interesting and well-though-out pieces, well worth reading.

I'm going to try restating things in my own way in hopes of getting some good feedback and a basic question:

What are the best ways to index low cardinality values in Postgres?

For an example, imagine an address table with 100M US street addresses with two character state abbreviations. So, say there are around 60 values in there (the USPS is the mail system for a variety of US territories, possessions and friends in the Pacific.) Okay, so what's the best index type for state abbreviation? For the sake of argument, assume a normal distribution so something like FM (Federated States of Micronesia) is on a tail end and CA or NY are a whole lot more common.

A B-tree is obviously a pretty bad match for this sort of situation. It works, but B-trees are ideal for unique values, and super large for repeated values. Not getting into the details or Postgres specifics of various kinds of traditional B-trees. (I think B*?) Doesn't matter. You have a huge index because the index size is closely related to the number of rows, not the number of distinct values.

Alternatively, you could set up partial indexes for the distinct values, like so:

Running 10 Million PostgreSQL Indexes In Production (And Counting)

Like Jeremy, I've wondered about GIN indexes for low-cardinality columns. Has anyone tried this out in PG 10 or 11? It sounds like a good idea. As I understand it, GIN indexes are something like a B-tree of unique values that link to another data structure, like a tree, bitmap, etc. So, in my imaginary example, there are 60 nodes for the state codes [internally there would be more for padding free nodes, evenly sized pages, etc....just pretend there are 60] and then linked to that, 60 data structures with the actual row references. Somehow.

It can imagine things going quite well with a GIN or btree_gin. I can also imagine that the secondary data structure could get bloaty, slow, and horrible. I've worked with a system which uses bitmaps as the secondary structure (at least some of the time), and it can work quite well...not sure how it's implemented in Postgres.

So, does anyone have any links or results about efficiently (space and/or time) indexing Boolean and other low-cardinality columns in Postgres? I'm on PG 11, but figure many are on 9.x or 10.x. 

References and results much appreciated. 

Thanks!

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom K
Дата:
Сообщение: Re: psql: FATAL: the database system is starting up
Следующее
От: Gavin Flower
Дата:
Сообщение: Re: Questions about btree_gin vs btree_gist for low cardinalitycolumns