Обсуждение: How to speed up pg_trgm / gin index scan

Поиск

Список

Период

Сортировка

How to speed up pg_trgm / gin index scan

От

Christian Ramseyer

Дата:

22 июня 2015 г., 14:53:04

Hi

I have a pretty large table with syslog messages.

It is already partitioned by month, and for a single month I have e.g.


DM=# \d+ logs_01

    Column    |            Type             |
--------------+-----------------------------+
 host         | character varying(255)      |
 facility     | character varying(10)       |
 priority     | character varying(10)       |
 tag          | character varying(255)      |
 log_date     | timestamp without time zone |
 program      | character varying(255)      |
 msg          | text                        |
 seq          | bigint                      |

Indexes:
    "logs_01_pkey" PRIMARY KEY, btree (seq)
    "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
    "logs_01_date_index" btree (log_date)
    "tridx_logs_01_msg" gin (msg gin_trgm_ops)


DM=# select count(*) from logs_01;
  count
----------
 83052864


I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is
around60 GB on the 35 GB table: 

DM=# select count(*) from logs_01;
  count
----------
 83052864


DM=# \dt+ logs_01
                     List of relations
 Schema |  Name   | Type  |  Owner   | Size  | Description
--------+---------+-------+----------+-------+-------------
 public | logs_01 | table | postgres | 35 GB |

DM=# \di+ tridx_logs_01_msg
                               List of relations
 Schema |       Name        | Type  |  Owner   |  Table  | Size  | Description
--------+-------------------+-------+----------+---------+-------+-------------
 public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |


A typical query on this table looks like this:

explain analyze
select log_date, host, msg
from logs_01 as log   where  log.msg like '%192.23.33.177%'
    and log.log_date >= '2015-1-18 1:45:24'
    and log.log_date <= '2015-1-19 1:45:24'
    order by log_date asc offset 200 limit 50;


It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN
indextakes quite long (40 seconds) 

                                                                                                QUERY PLAN
                                                                                  

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
   ->  Sort  (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
         Sort Key: log_date
         Sort Method: top-N heapsort  Memory: 152kB
         ->  Bitmap Heap Scan on logs_01 log  (cost=34009.21..34505.44 rows=124 width=195) (actual
time=42963.969..42969.725rows=2472 loops=1) 
               Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp
withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) 
               ->  BitmapAnd  (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0
loops=1)
                     ->  Bitmap Index Scan on tridx_logs_01_msg  (cost=0.00..6992.15 rows=8020 width=0) (actual
time=42731.145..42731.145rows=168489 loops=1) 
                           Index Cond: (msg ~~ '%192.23.33.177%'::text)
                     ->  Bitmap Index Scan on logs_01_date_index  (cost=0.00..27016.75 rows=1287939 width=0) (actual
time=180.055..180.055rows=1173048 loops=1) 
                           Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date
<='2015-01-19 01:45:24'::timestamp without time zone)) 
 Total runtime: 42971.137 ms

(also on http://explain.depesz.com/s/KpaB)


Any good ideas on how I could speed this up a bit?

I have already tried to throw quite a bunch of memory at the problem:

shared_buffers = 64GB
work_mem = 16GB

but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.

This is on Postgres 9.1.15 on Linux.

Thanks
Christian

Re: How to speed up pg_trgm / gin index scan

От

Jeff Janes

Дата:

22 июня 2015 г., 20:00:50

On Mon, Jun 22, 2015 at 4:51 AM, Christian Ramseyer <rc@networkz.ch> wrote:

Hi

I have a pretty large table with syslog messages.

It is already partitioned by month, and for a single month I have e.g.

DM=# \d+ logs_01

Column | Type |
--------------+-----------------------------+
host | character varying(255) |
facility | character varying(10) |
priority | character varying(10) |
tag | character varying(255) |
log_date | timestamp without time zone |
program | character varying(255) |
msg | text |
seq | bigint |

Indexes:
"logs_01_pkey" PRIMARY KEY, btree (seq)
"idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
"logs_01_date_index" btree (log_date)
"tridx_logs_01_msg" gin (msg gin_trgm_ops)

DM=# select count(*) from logs_01;
count
----------
83052864

...

A typical query on this table looks like this:

explain analyze
select log_date, host, msg
from logs_01 as log where log.msg like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;

I think that trigram indexes are not well-suited to searching IP addresses.

If the typical query is always an IP address for the LIKE, I think you would want to build an index specifically tailored to that. You could make a function to parse the IP address out of the msg, and then make a functional index, for example. It would require you to write the query differently. Whether it would be a btree index or a gin index would depend on whether you can have more than one IP address in a msg.

Cheers,

Jeff

Re: How to speed up pg_trgm / gin index scan

От

Christian Ramseyer

Дата:

22 июня 2015 г., 20:40:20

On 22/06/15 19:00, Jeff Janes wrote:

>
>
>     A typical query on this table looks like this:
>
>     explain analyze
>     select log_date, host, msg
>     from logs_01 as log   where  log.msg like '%192.23.33.177%'
>         and log.log_date >= '2015-1-18 1:45:24'
>         and log.log_date <= '2015-1-19 1:45:24'
>         order by log_date asc offset 200 limit 50;
>
>
> I think that trigram indexes are not well-suited to searching IP addresses.
>
> If the typical query is always an IP address for the LIKE, I think you
> would want to build an index specifically tailored to that.  You could
> make a function to parse the IP address out of the msg, and then make a
> functional index, for example. It would require you to write the query
> differently.  Whether it would be a btree index or a gin index would
> depend on whether you can have more than one IP address in a msg.
>

Thanks Jeff, but the IP address was mostly an example... I should have
written this more clearly. Generally the searched string will be a
random short fragment from msg (ip, hostname, some part of an error
message etc.).

It must be matched exactly including all punctuation etc, so trigrams
look very suitable.

Cheers
Christian

Re: How to speed up pg_trgm / gin index scan

От

Jaime Casanova

Дата:

22 июня 2015 г., 21:32:46

On Mon, Jun 22, 2015 at 6:51 AM, Christian Ramseyer <rc@networkz.ch> wrote:
>
> DM=# \d+ logs_01
>
>     Column    |            Type             |
> --------------+-----------------------------+
>  host         | character varying(255)      |
>  facility     | character varying(10)       |
>  priority     | character varying(10)       |
>  tag          | character varying(255)      |
>  log_date     | timestamp without time zone |
>  program      | character varying(255)      |
>  msg          | text                        |
>  seq          | bigint                      |
>
> Indexes:
>     "logs_01_pkey" PRIMARY KEY, btree (seq)
>     "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
>     "logs_01_date_index" btree (log_date)
>     "tridx_logs_01_msg" gin (msg gin_trgm_ops)
>
>
> DM=# select count(*) from logs_01;
>   count
> ----------
>  83052864
>
>
> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is
around60 GB on the 35 GB table: 
>
> DM=# select count(*) from logs_01;
>   count
> ----------
>  83052864
>
>
> DM=# \dt+ logs_01
>                      List of relations
>  Schema |  Name   | Type  |  Owner   | Size  | Description
> --------+---------+-------+----------+-------+-------------
>  public | logs_01 | table | postgres | 35 GB |
>
> DM=# \di+ tridx_logs_01_msg
>                                List of relations
>  Schema |       Name        | Type  |  Owner   |  Table  | Size  | Description
> --------+-------------------+-------+----------+---------+-------+-------------
>  public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
>
>

What version of postgres is this? GIN indexes improved a lot in 9.4,
they use less than half the space and have doubled the speed (on
average).

Now, whatever version you have; a GIN index has two data structures,
the main one in which the index entries are stored as key-value pairs
(please someone correct my description of the situation) and a pending
list, which is a temporary unsorted list of pending entries in which
all the newly inserted tuples arrive until a VACUUM (or until the
pending list grows upto work_mem) moves that list into the main
structure.

That happens to avoid the penalty of inserting new rows in the main
structure which could be expensive.
But while the pending list grows the speed of the index decreases. And
because you have work_mem in 16Gb your pending list is possibly
growing without control.

if you have 9.3 or superior you can know how big is that pending list
installing pgstattuple.

CREATE EXTENSION pgstattuple;
SELECT * FROM pgstatginindex('tridx_logs_01_msg');

NOTE: remember that pending_pages is expressed in 8kb-pages

if that is the problem or if you are in <= 9.2 then try VACUUM the table

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

Re: How to speed up pg_trgm / gin index scan

От

Jeff Janes

Дата:

23 июня 2015 г., 01:10:09

On Mon, Jun 22, 2015 at 10:39 AM, Christian Ramseyer <rc@networkz.ch> wrote:

On 22/06/15 19:00, Jeff Janes wrote:

>
>
> A typical query on this table looks like this:
>
> explain analyze
> select log_date, host, msg
> from logs_01 as log where log.msg like '%192.23.33.177%'
> and log.log_date >= '2015-1-18 1:45:24'
> and log.log_date <= '2015-1-19 1:45:24'
> order by log_date asc offset 200 limit 50;
>
>
> I think that trigram indexes are not well-suited to searching IP addresses.
>
> If the typical query is always an IP address for the LIKE, I think you
> would want to build an index specifically tailored to that. You could
> make a function to parse the IP address out of the msg, and then make a
> functional index, for example. It would require you to write the query
> differently. Whether it would be a btree index or a gin index would
> depend on whether you can have more than one IP address in a msg.
>

Thanks Jeff, but the IP address was mostly an example... I should have
written this more clearly. Generally the searched string will be a
random short fragment from msg (ip, hostname, some part of an error
message etc.).

What kind of timings do you get if you search on a hostname or part of an error message? Is it slow in general, or just when the thing you search on happens to be an IP address?

Certainly in my hands, trigram index searching with embedded IP addresses are much worse than on embedded natural language fragments.

It must be matched exactly including all punctuation etc, so trigrams
look very suitable.

I believe the default compilation of pg_trgm ignores all punctuation (converts them to whitespace) in the index. For a LIKE query, it catches them when it rechecks the actual tuple in the heap so you still get the right answer. But if the query is mostly punctuation and short numbers, it takes much longer to get that correct answer.

Since the time range you query over is narrow and the rows are probably well-clustered on it, maybe just using the logs_01_date_index would be faster and then just filtering the table with the LIKE clause:

explain (analyze, buffers)
select log_date, host, msg
from logs_01 as log where (log.msg||'') like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;

Other options would be making your partitions smaller, or using btree_gist and make an index on (log_date, msg gist_trgm_ops). Unfortunately those indexes can be awful slow to build.

Cheers,

Jeff

Re: How to speed up pg_trgm / gin index scan

От

Oleg Bartunov

Дата:

23 июня 2015 г., 02:30:26

Try 9.4 and you'll surprise.

1. GIN has compression

2. GIN has fast scan feature.

Oleg

On Mon, Jun 22, 2015 at 7:51 AM, Christian Ramseyer <rc@networkz.ch> wrote:

Hi

I have a pretty large table with syslog messages.

It is already partitioned by month, and for a single month I have e.g.

DM=# \d+ logs_01

Column | Type |
--------------+-----------------------------+
host | character varying(255) |
facility | character varying(10) |
priority | character varying(10) |
tag | character varying(255) |
log_date | timestamp without time zone |
program | character varying(255) |
msg | text |
seq | bigint |

Indexes:
"logs_01_pkey" PRIMARY KEY, btree (seq)
"idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
"logs_01_date_index" btree (log_date)
"tridx_logs_01_msg" gin (msg gin_trgm_ops)

DM=# select count(*) from logs_01;
count
----------
83052864

I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around 60 GB on the 35 GB table:

DM=# select count(*) from logs_01;
count
----------
83052864

DM=# \dt+ logs_01
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+----------+-------+-------------
public | logs_01 | table | postgres | 35 GB |

DM=# \di+ tridx_logs_01_msg
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-------------------+-------+----------+---------+-------+-------------
public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |

A typical query on this table looks like this:

explain analyze
select log_date, host, msg
from logs_01 as log where log.msg like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;

It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN index takes quite long (40 seconds)

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
-> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
Sort Key: log_date
Sort Method: top-N heapsort Memory: 152kB
-> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725 rows=2472 loops=1)
Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
-> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1)
-> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145 rows=168489 loops=1)
Index Cond: (msg ~~ '%192.23.33.177%'::text)
-> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055 rows=1173048 loops=1)
Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
Total runtime: 42971.137 ms

(also on http://explain.depesz.com/s/KpaB)

Any good ideas on how I could speed this up a bit?

I have already tried to throw quite a bunch of memory at the problem:

shared_buffers = 64GB
work_mem = 16GB

but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.

This is on Postgres 9.1.15 on Linux.

Thanks
Christian

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: How to speed up pg_trgm / gin index scan

От

Christian Ramseyer

Дата:

23 июня 2015 г., 12:40:51

On 22/06/15 20:32, Jaime Casanova wrote:

> What version of postgres is this? GIN indexes improved a lot in 9.4,
> they use less than half the space and have doubled the speed (on
> average).
>
> Now, whatever version you have; a GIN index has two data structures,
> the main one in which the index entries are stored as key-value pairs
> (please someone correct my description of the situation) and a pending
> list, which is a temporary unsorted list of pending entries in which
> all the newly inserted tuples arrive until a VACUUM (or until the
> pending list grows upto work_mem) moves that list into the main
> structure.
>
> That happens to avoid the penalty of inserting new rows in the main
> structure which could be expensive.
> But while the pending list grows the speed of the index decreases. And
> because you have work_mem in 16Gb your pending list is possibly
> growing without control.
>
> if you have 9.3 or superior you can know how big is that pending list
> installing pgstattuple.
>
> CREATE EXTENSION pgstattuple;
> SELECT * FROM pgstatginindex('tridx_logs_01_msg');
>
> NOTE: remember that pending_pages is expressed in 8kb-pages
>
> if that is the problem or if you are in <= 9.2 then try VACUUM the table
>

Thanks, these are some interesting details I wasn't aware of. I think
the pending list shouldn't be an issue since I restored this table into
a new installation from backup and it receives no new data at all.

But it is in 9.1.15 so I try to give 9.4 a shot, as apparently I'm
missing out on a lot of stuff.

Christian

Re: How to speed up pg_trgm / gin index scan

От

Christian Ramseyer

Дата:

23 июня 2015 г., 12:44:22

On 23/06/15 01:30, Oleg Bartunov wrote:
> Try 9.4 and you'll surprise.
>
> 1. GIN has compression
> 2. GIN has fast scan feature.
>

Dang, and I was so happy that the Enterprise Linux we have to use here
finally had 9.1 with pg_trgm :)

But this sounds too good not to try it, I'll try to get 9.4 installed
and report back with the results.

Thanks
Christian

Re: How to speed up pg_trgm / gin index scan

От

Christian Ramseyer

Дата:

28 июля 2015 г., 16:35:50


On 22/06/15 13:51, Christian Ramseyer wrote:
> Hi
>
> I have a pretty large table with syslog messages.
>
> It is already partitioned by month, and for a single month I have e.g.
>
>
> DM=# \d+ logs_01
>
>     Column    |            Type             |
> --------------+-----------------------------+
>  host         | character varying(255)      |
>  facility     | character varying(10)       |
>  priority     | character varying(10)       |
>  tag          | character varying(255)      |
>  log_date     | timestamp without time zone |
>  program      | character varying(255)      |
>  msg          | text                        |
>  seq          | bigint                      |
>
> Indexes:
>     "logs_01_pkey" PRIMARY KEY, btree (seq)
>     "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
>     "logs_01_date_index" btree (log_date)
>     "tridx_logs_01_msg" gin (msg gin_trgm_ops)
>
>
> DM=# select count(*) from logs_01;
>   count
> ----------
>  83052864
>
>
> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is
around60 GB on the 35 GB table: 
>
> DM=# select count(*) from logs_01;
>   count
> ----------
>  83052864
>
>
> DM=# \dt+ logs_01
>                      List of relations
>  Schema |  Name   | Type  |  Owner   | Size  | Description
> --------+---------+-------+----------+-------+-------------
>  public | logs_01 | table | postgres | 35 GB |
>
> DM=# \di+ tridx_logs_01_msg
>                                List of relations
>  Schema |       Name        | Type  |  Owner   |  Table  | Size  | Description
> --------+-------------------+-------+----------+---------+-------+-------------
>  public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
>
>
> A typical query on this table looks like this:
>
> explain analyze
> select log_date, host, msg
> from logs_01 as log   where  log.msg like '%192.23.33.177%'
>     and log.log_date >= '2015-1-18 1:45:24'
>     and log.log_date <= '2015-1-19 1:45:24'
>     order by log_date asc offset 200 limit 50;
>
>
> It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN
indextakes quite long (40 seconds) 
>
>                                                                                                 QUERY PLAN
                                                                                    
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
>    ->  Sort  (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
>          Sort Key: log_date
>          Sort Method: top-N heapsort  Memory: 152kB
>          ->  Bitmap Heap Scan on logs_01 log  (cost=34009.21..34505.44 rows=124 width=195) (actual
time=42963.969..42969.725rows=2472 loops=1) 
>                Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp
withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) 
>                ->  BitmapAnd  (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0
loops=1)
>                      ->  Bitmap Index Scan on tridx_logs_01_msg  (cost=0.00..6992.15 rows=8020 width=0) (actual
time=42731.145..42731.145rows=168489 loops=1) 
>                            Index Cond: (msg ~~ '%192.23.33.177%'::text)
>                      ->  Bitmap Index Scan on logs_01_date_index  (cost=0.00..27016.75 rows=1287939 width=0) (actual
time=180.055..180.055rows=1173048 loops=1) 
>                            Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
(log_date<= '2015-01-19 01:45:24'::timestamp without time zone)) 
>  Total runtime: 42971.137 ms
>
> (also on http://explain.depesz.com/s/KpaB)
>
>
> Any good ideas on how I could speed this up a bit?
>
> I have already tried to throw quite a bunch of memory at the problem:
>
> shared_buffers = 64GB
> work_mem = 16GB
>
> but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.
>
> This is on Postgres 9.1.15 on Linux.
>



> Try 9.4 and you'll surprise.
>
> 1. GIN has compression
> 2. GIN has fast scan feature.
>
> Oleg


Hi Oleg and List

I finally got around to try 9.4, and it is quite fantastic.

Index size went from 58 to now 14 GB:

DM=# \di+ tridx_logs_01_msg
                               List of relations
 Schema |       Name        | Type  |  Owner   |  Table  | Size  |
Description
--------+-------------------+-------+----------+---------+-------+-------------
 public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB |


And the time for the above query went down to about 20 seconds:


DM=# explain analyze
DM-# select log_date, host, msg
DM-# from logs_01 as log   where  log.msg like '%192.23.33.177%'
DM-#     and log.log_date >= '2015-1-18 1:45:24'
DM-#     and log.log_date <= '2015-1-19 1:45:24'
DM-#     order by log_date asc offset 200 limit 50;

                        QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=28815.06..28815.06 rows=1 width=194) (actual
time=19032.099..19032.099 rows=0 loops=1)
   ->  Sort  (cost=28814.74..28815.06 rows=128 width=194) (actual
time=19032.093..19032.093 rows=0 loops=1)
         Sort Key: log_date
         Sort Method: quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on logs_01 log  (cost=28298.06..28810.26
rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1)
               Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND
(log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
(log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
               ->  BitmapAnd  (cost=28298.06..28298.06 rows=128 width=0)
(actual time=19031.983..19031.983 rows=0 loops=1)
                     ->  Bitmap Index Scan on tridx_logs_01_msg
(cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121
rows=99 loops=1)
                           Index Cond: (msg ~~ '%192.23.33.177%'::text)
                     ->  Bitmap Index Scan on logs_01_date_index
(cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084
rows=1173048 loops=1)
                           Index Cond: ((log_date >= '2015-01-18
01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19
01:45:24'::timestamp without time zone))
 Planning time: 0.945 ms
 Execution time: 19032.409 ms
(13 rows)

Great stuff! Sorry Oleg I don't have your original message anymore and
can't reply into the right place in the thread, so I took the liberty to
CC: you.

Christian

Re: How to speed up pg_trgm / gin index scan

От

Merlin Moncure

Дата:

28 июля 2015 г., 17:42:35

On Tue, Jul 28, 2015 at 8:34 AM, Christian Ramseyer <rc@networkz.ch> wrote:
> On 22/06/15 13:51, Christian Ramseyer wrote:
>> Hi
>>
>> I have a pretty large table with syslog messages.
>>
>> It is already partitioned by month, and for a single month I have e.g.
>>
>>
>> DM=# \d+ logs_01
>>
>>     Column    |            Type             |
>> --------------+-----------------------------+
>>  host         | character varying(255)      |
>>  facility     | character varying(10)       |
>>  priority     | character varying(10)       |
>>  tag          | character varying(255)      |
>>  log_date     | timestamp without time zone |
>>  program      | character varying(255)      |
>>  msg          | text                        |
>>  seq          | bigint                      |
>>
>> Indexes:
>>     "logs_01_pkey" PRIMARY KEY, btree (seq)
>>     "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
>>     "logs_01_date_index" btree (log_date)
>>     "tridx_logs_01_msg" gin (msg gin_trgm_ops)
>>
>>
>> DM=# select count(*) from logs_01;
>>   count
>> ----------
>>  83052864
>>
>>
>> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is
around60 GB on the 35 GB table: 
>>
>> DM=# select count(*) from logs_01;
>>   count
>> ----------
>>  83052864
>>
>>
>> DM=# \dt+ logs_01
>>                      List of relations
>>  Schema |  Name   | Type  |  Owner   | Size  | Description
>> --------+---------+-------+----------+-------+-------------
>>  public | logs_01 | table | postgres | 35 GB |
>>
>> DM=# \di+ tridx_logs_01_msg
>>                                List of relations
>>  Schema |       Name        | Type  |  Owner   |  Table  | Size  | Description
>> --------+-------------------+-------+----------+---------+-------+-------------
>>  public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
>>
>>
>> A typical query on this table looks like this:
>>
>> explain analyze
>> select log_date, host, msg
>> from logs_01 as log   where  log.msg like '%192.23.33.177%'
>>     and log.log_date >= '2015-1-18 1:45:24'
>>     and log.log_date <= '2015-1-19 1:45:24'
>>     order by log_date asc offset 200 limit 50;
>>
>>
>> It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN
indextakes quite long (40 seconds) 
>>
>>                                                                                                 QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
>>    ->  Sort  (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
>>          Sort Key: log_date
>>          Sort Method: top-N heapsort  Memory: 152kB
>>          ->  Bitmap Heap Scan on logs_01 log  (cost=34009.21..34505.44 rows=124 width=195) (actual
time=42963.969..42969.725rows=2472 loops=1) 
>>                Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp
withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) 
>>                ->  BitmapAnd  (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0
loops=1)
>>                      ->  Bitmap Index Scan on tridx_logs_01_msg  (cost=0.00..6992.15 rows=8020 width=0) (actual
time=42731.145..42731.145rows=168489 loops=1) 
>>                            Index Cond: (msg ~~ '%192.23.33.177%'::text)
>>                      ->  Bitmap Index Scan on logs_01_date_index  (cost=0.00..27016.75 rows=1287939 width=0) (actual
time=180.055..180.055rows=1173048 loops=1) 
>>                            Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
(log_date<= '2015-01-19 01:45:24'::timestamp without time zone)) 
>>  Total runtime: 42971.137 ms
>>
>> (also on http://explain.depesz.com/s/KpaB)
>>
>>
>> Any good ideas on how I could speed this up a bit?
>>
>> I have already tried to throw quite a bunch of memory at the problem:
>>
>> shared_buffers = 64GB
>> work_mem = 16GB
>>
>> but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.
>>
>> This is on Postgres 9.1.15 on Linux.
>>
>
>
>
>> Try 9.4 and you'll surprise.
>>
>> 1. GIN has compression
>> 2. GIN has fast scan feature.
>>
>> Oleg
>
>
> Hi Oleg and List
>
> I finally got around to try 9.4, and it is quite fantastic.
>
> Index size went from 58 to now 14 GB:
>
> DM=# \di+ tridx_logs_01_msg
>                                List of relations
>  Schema |       Name        | Type  |  Owner   |  Table  | Size  |
> Description
> --------+-------------------+-------+----------+---------+-------+-------------
>  public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB |
>
>
> And the time for the above query went down to about 20 seconds:
>
>
> DM=# explain analyze
> DM-# select log_date, host, msg
> DM-# from logs_01 as log   where  log.msg like '%192.23.33.177%'
> DM-#     and log.log_date >= '2015-1-18 1:45:24'
> DM-#     and log.log_date <= '2015-1-19 1:45:24'
> DM-#     order by log_date asc offset 200 limit 50;
>
>                         QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=28815.06..28815.06 rows=1 width=194) (actual
> time=19032.099..19032.099 rows=0 loops=1)
>    ->  Sort  (cost=28814.74..28815.06 rows=128 width=194) (actual
> time=19032.093..19032.093 rows=0 loops=1)
>          Sort Key: log_date
>          Sort Method: quicksort  Memory: 25kB
>          ->  Bitmap Heap Scan on logs_01 log  (cost=28298.06..28810.26
> rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1)
>                Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND
> (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
> (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
>                ->  BitmapAnd  (cost=28298.06..28298.06 rows=128 width=0)
> (actual time=19031.983..19031.983 rows=0 loops=1)
>                      ->  Bitmap Index Scan on tridx_logs_01_msg
> (cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121
> rows=99 loops=1)
>                            Index Cond: (msg ~~ '%192.23.33.177%'::text)
>                      ->  Bitmap Index Scan on logs_01_date_index
> (cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084
> rows=1173048 loops=1)
>                            Index Cond: ((log_date >= '2015-01-18
> 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19
> 01:45:24'::timestamp without time zone))
>  Planning time: 0.945 ms
>  Execution time: 19032.409 ms
> (13 rows)
>
> Great stuff! Sorry Oleg I don't have your original message anymore and
> can't reply into the right place in the thread, so I took the liberty to
> CC: you.

There are some more big optimizations (via Jeff Janes) coming down the
pike for trigram searching.  See thread:
http://www.postgresql.org/message-id/CAMkU=1woR_Pdmie6d-zj6sDOPiHd_iUe3vZSXFGe_i4-AQYsJQ@mail.gmail.com.

I think it should be possible to patch the 9.4 pg_trgm module with
Jeff's stuff -- it might be worthwhile to do that and run some tests
and report back.  I don't know if they address your particular case
but in some situations the speedups are really dramatic.

merlin

Re: How to speed up pg_trgm / gin index scan

От

Arthur Silva

Дата:

28 июля 2015 г., 17:46:30

On Tue, Jul 28, 2015 at 10:34 AM, Christian Ramseyer <rc@networkz.ch> wrote:

On 22/06/15 13:51, Christian Ramseyer wrote:
> Hi
>
> I have a pretty large table with syslog messages.
>
> It is already partitioned by month, and for a single month I have e.g.
>
>
> DM=# \d+ logs_01
>
> Column | Type |
> --------------+-----------------------------+
> host | character varying(255) |
> facility | character varying(10) |
> priority | character varying(10) |
> tag | character varying(255) |
> log_date | timestamp without time zone |
> program | character varying(255) |
> msg | text |
> seq | bigint |
>
> Indexes:
> "logs_01_pkey" PRIMARY KEY, btree (seq)
> "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
> "logs_01_date_index" btree (log_date)
> "tridx_logs_01_msg" gin (msg gin_trgm_ops)
>
>
> DM=# select count(*) from logs_01;
> count
> ----------
> 83052864
>
>
> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around 60 GB on the 35 GB table:
>
> DM=# select count(*) from logs_01;
> count
> ----------
> 83052864
>
>
> DM=# \dt+ logs_01
> List of relations
> Schema | Name | Type | Owner | Size | Description
> --------+---------+-------+----------+-------+-------------
> public | logs_01 | table | postgres | 35 GB |
>
> DM=# \di+ tridx_logs_01_msg
> List of relations
> Schema | Name | Type | Owner | Table | Size | Description
> --------+-------------------+-------+----------+---------+-------+-------------
> public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
>
>
> A typical query on this table looks like this:
>
> explain analyze
> select log_date, host, msg
> from logs_01 as log where log.msg like '%192.23.33.177%'
> and log.log_date >= '2015-1-18 1:45:24'
> and log.log_date <= '2015-1-19 1:45:24'
> order by log_date asc offset 200 limit 50;
>
>
> It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN index takes quite long (40 seconds)
>
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
> -> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
> Sort Key: log_date
> Sort Method: top-N heapsort Memory: 152kB
> -> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725 rows=2472 loops=1)
> Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
> -> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1)
> -> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145 rows=168489 loops=1)
> Index Cond: (msg ~~ '%192.23.33.177%'::text)
> -> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055 rows=1173048 loops=1)
> Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
> Total runtime: 42971.137 ms
>
> (also on http://explain.depesz.com/s/KpaB)
>
>
> Any good ideas on how I could speed this up a bit?
>
> I have already tried to throw quite a bunch of memory at the problem:
>
> shared_buffers = 64GB
> work_mem = 16GB
>
> but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.
>
> This is on Postgres 9.1.15 on Linux.
>

> Try 9.4 and you'll surprise.
>
> 1. GIN has compression
> 2. GIN has fast scan feature.
>
> Oleg

Hi Oleg and List

I finally got around to try 9.4, and it is quite fantastic.

Index size went from 58 to now 14 GB:

DM=# \di+ tridx_logs_01_msg
List of relations
Schema | Name | Type | Owner | Table | Size |
Description
--------+-------------------+-------+----------+---------+-------+-------------
public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB |

And the time for the above query went down to about 20 seconds:

DM=# explain analyze
DM-# select log_date, host, msg
DM-# from logs_01 as log where log.msg like '%192.23.33.177%'
DM-# and log.log_date >= '2015-1-18 1:45:24'
DM-# and log.log_date <= '2015-1-19 1:45:24'
DM-# order by log_date asc offset 200 limit 50;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=28815.06..28815.06 rows=1 width=194) (actual
time=19032.099..19032.099 rows=0 loops=1)
-> Sort (cost=28814.74..28815.06 rows=128 width=194) (actual
time=19032.093..19032.093 rows=0 loops=1)
Sort Key: log_date
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on logs_01 log (cost=28298.06..28810.26
rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1)
Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND
(log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
(log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
-> BitmapAnd (cost=28298.06..28298.06 rows=128 width=0)
(actual time=19031.983..19031.983 rows=0 loops=1)
-> Bitmap Index Scan on tridx_logs_01_msg
(cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121
rows=99 loops=1)
Index Cond: (msg ~~ '%192.23.33.177%'::text)
-> Bitmap Index Scan on logs_01_date_index
(cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084
rows=1173048 loops=1)
Index Cond: ((log_date >= '2015-01-18
01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19
01:45:24'::timestamp without time zone))
Planning time: 0.945 ms
Execution time: 19032.409 ms
(13 rows)

Great stuff! Sorry Oleg I don't have your original message anymore and
can't reply into the right place in the thread, so I took the liberty to
CC: you.

Christian

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Christian

You could experiment recompiling pg_trgm commenting out the KEEPONLYALNUM and/or IGNORECASE definitions if you are looking for exact matches, this will increase the index size but will make it more selective.

Also, there's a thread around for pg_trgrm 1.2 which will get you even more boost.

--
Arthur Silva

Re: How to speed up pg_trgm / gin index scan

От

Christian Ramseyer

Дата:

29 июля 2015 г., 20:18:52

On 28/07/15 16:42, Merlin Moncure wrote:
>>
>> Great stuff! Sorry Oleg I don't have your original message anymore and
>> can't reply into the right place in the thread, so I took the liberty to
>> CC: you.
>
> There are some more big optimizations (via Jeff Janes) coming down the
> pike for trigram searching.  See thread:
> http://www.postgresql.org/message-id/CAMkU=1woR_Pdmie6d-zj6sDOPiHd_iUe3vZSXFGe_i4-AQYsJQ@mail.gmail.com.
>
> I think it should be possible to patch the 9.4 pg_trgm module with
> Jeff's stuff -- it might be worthwhile to do that and run some tests
> and report back.  I don't know if they address your particular case
> but in some situations the speedups are really dramatic.
>
> merlin
>

On 28/07/15 16:45, Arthur Silva wrote:>
> You could experiment recompiling pg_trgm commenting out the
> KEEPONLYALNUM and/or IGNORECASE definitions if you are looking for exact
> matches, this will increase the index size but will make it more
selective.
>
> Also, there's a thread around for pg_trgrm 1.2 which will get you even
> more boost.


Thanks for the hints and all the hard work you guys are putting into
this. I'll follow the further development closely and report back if we
get any new breakthroughs with this rather big data set.

Christian

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: How to speed up pg_trgm / gin index scan