Обсуждение: How to speed up pg_trgm / gin index scan
Hi I have a pretty large table with syslog messages. It is already partitioned by month, and for a single month I have e.g. DM=# \d+ logs_01 Column | Type | --------------+-----------------------------+ host | character varying(255) | facility | character varying(10) | priority | character varying(10) | tag | character varying(255) | log_date | timestamp without time zone | program | character varying(255) | msg | text | seq | bigint | Indexes: "logs_01_pkey" PRIMARY KEY, btree (seq) "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops) "logs_01_date_index" btree (log_date) "tridx_logs_01_msg" gin (msg gin_trgm_ops) DM=# select count(*) from logs_01; count ---------- 83052864 I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around60 GB on the 35 GB table: DM=# select count(*) from logs_01; count ---------- 83052864 DM=# \dt+ logs_01 List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+----------+-------+------------- public | logs_01 | table | postgres | 35 GB | DM=# \di+ tridx_logs_01_msg List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-------------------+-------+----------+---------+-------+------------- public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB | A typical query on this table looks like this: explain analyze select log_date, host, msg from logs_01 as log where log.msg like '%192.23.33.177%' and log.log_date >= '2015-1-18 1:45:24' and log.log_date <= '2015-1-19 1:45:24' order by log_date asc offset 200 limit 50; It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN indextakes quite long (40 seconds) QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1) -> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1) Sort Key: log_date Sort Method: top-N heapsort Memory: 152kB -> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725rows=2472 loops=1) Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) -> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1) -> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145rows=168489 loops=1) Index Cond: (msg ~~ '%192.23.33.177%'::text) -> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055rows=1173048 loops=1) Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <='2015-01-19 01:45:24'::timestamp without time zone)) Total runtime: 42971.137 ms (also on http://explain.depesz.com/s/KpaB) Any good ideas on how I could speed this up a bit? I have already tried to throw quite a bunch of memory at the problem: shared_buffers = 64GB work_mem = 16GB but it didn't improve between this and the 32GB shared/ 2GB work GB I had before. This is on Postgres 9.1.15 on Linux. Thanks Christian
On Mon, Jun 22, 2015 at 4:51 AM, Christian Ramseyer <rc@networkz.ch> wrote:
Hi
I have a pretty large table with syslog messages.
It is already partitioned by month, and for a single month I have e.g.
DM=# \d+ logs_01
Column | Type |
--------------+-----------------------------+
host | character varying(255) |
facility | character varying(10) |
priority | character varying(10) |
tag | character varying(255) |
log_date | timestamp without time zone |
program | character varying(255) |
msg | text |
seq | bigint |
Indexes:
"logs_01_pkey" PRIMARY KEY, btree (seq)
"idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
"logs_01_date_index" btree (log_date)
"tridx_logs_01_msg" gin (msg gin_trgm_ops)
DM=# select count(*) from logs_01;
count
----------
83052864
...
A typical query on this table looks like this:
explain analyze
select log_date, host, msg
from logs_01 as log where log.msg like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;
I think that trigram indexes are not well-suited to searching IP addresses.
If the typical query is always an IP address for the LIKE, I think you would want to build an index specifically tailored to that. You could make a function to parse the IP address out of the msg, and then make a functional index, for example. It would require you to write the query differently. Whether it would be a btree index or a gin index would depend on whether you can have more than one IP address in a msg.
Cheers,
Jeff
On 22/06/15 19:00, Jeff Janes wrote: > > > A typical query on this table looks like this: > > explain analyze > select log_date, host, msg > from logs_01 as log where log.msg like '%192.23.33.177%' > and log.log_date >= '2015-1-18 1:45:24' > and log.log_date <= '2015-1-19 1:45:24' > order by log_date asc offset 200 limit 50; > > > I think that trigram indexes are not well-suited to searching IP addresses. > > If the typical query is always an IP address for the LIKE, I think you > would want to build an index specifically tailored to that. You could > make a function to parse the IP address out of the msg, and then make a > functional index, for example. It would require you to write the query > differently. Whether it would be a btree index or a gin index would > depend on whether you can have more than one IP address in a msg. > Thanks Jeff, but the IP address was mostly an example... I should have written this more clearly. Generally the searched string will be a random short fragment from msg (ip, hostname, some part of an error message etc.). It must be matched exactly including all punctuation etc, so trigrams look very suitable. Cheers Christian
On Mon, Jun 22, 2015 at 6:51 AM, Christian Ramseyer <rc@networkz.ch> wrote: > > DM=# \d+ logs_01 > > Column | Type | > --------------+-----------------------------+ > host | character varying(255) | > facility | character varying(10) | > priority | character varying(10) | > tag | character varying(255) | > log_date | timestamp without time zone | > program | character varying(255) | > msg | text | > seq | bigint | > > Indexes: > "logs_01_pkey" PRIMARY KEY, btree (seq) > "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops) > "logs_01_date_index" btree (log_date) > "tridx_logs_01_msg" gin (msg gin_trgm_ops) > > > DM=# select count(*) from logs_01; > count > ---------- > 83052864 > > > I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around60 GB on the 35 GB table: > > DM=# select count(*) from logs_01; > count > ---------- > 83052864 > > > DM=# \dt+ logs_01 > List of relations > Schema | Name | Type | Owner | Size | Description > --------+---------+-------+----------+-------+------------- > public | logs_01 | table | postgres | 35 GB | > > DM=# \di+ tridx_logs_01_msg > List of relations > Schema | Name | Type | Owner | Table | Size | Description > --------+-------------------+-------+----------+---------+-------+------------- > public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB | > > What version of postgres is this? GIN indexes improved a lot in 9.4, they use less than half the space and have doubled the speed (on average). Now, whatever version you have; a GIN index has two data structures, the main one in which the index entries are stored as key-value pairs (please someone correct my description of the situation) and a pending list, which is a temporary unsorted list of pending entries in which all the newly inserted tuples arrive until a VACUUM (or until the pending list grows upto work_mem) moves that list into the main structure. That happens to avoid the penalty of inserting new rows in the main structure which could be expensive. But while the pending list grows the speed of the index decreases. And because you have work_mem in 16Gb your pending list is possibly growing without control. if you have 9.3 or superior you can know how big is that pending list installing pgstattuple. CREATE EXTENSION pgstattuple; SELECT * FROM pgstatginindex('tridx_logs_01_msg'); NOTE: remember that pending_pages is expressed in 8kb-pages if that is the problem or if you are in <= 9.2 then try VACUUM the table -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación
On Mon, Jun 22, 2015 at 10:39 AM, Christian Ramseyer <rc@networkz.ch> wrote:
explain (analyze, buffers)
select log_date, host, msg
from logs_01 as log where (log.msg||'') like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;
On 22/06/15 19:00, Jeff Janes wrote:
>
>
> A typical query on this table looks like this:
>
> explain analyze
> select log_date, host, msg
> from logs_01 as log where log.msg like '%192.23.33.177%'
> and log.log_date >= '2015-1-18 1:45:24'
> and log.log_date <= '2015-1-19 1:45:24'
> order by log_date asc offset 200 limit 50;
>
>
> I think that trigram indexes are not well-suited to searching IP addresses.
>
> If the typical query is always an IP address for the LIKE, I think you
> would want to build an index specifically tailored to that. You could
> make a function to parse the IP address out of the msg, and then make a
> functional index, for example. It would require you to write the query
> differently. Whether it would be a btree index or a gin index would
> depend on whether you can have more than one IP address in a msg.
>
Thanks Jeff, but the IP address was mostly an example... I should have
written this more clearly. Generally the searched string will be a
random short fragment from msg (ip, hostname, some part of an error
message etc.).
What kind of timings do you get if you search on a hostname or part of an error message? Is it slow in general, or just when the thing you search on happens to be an IP address?
Certainly in my hands, trigram index searching with embedded IP addresses are much worse than on embedded natural language fragments.
It must be matched exactly including all punctuation etc, so trigrams
look very suitable.
I believe the default compilation of pg_trgm ignores all punctuation (converts them to whitespace) in the index. For a LIKE query, it catches them when it rechecks the actual tuple in the heap so you still get the right answer. But if the query is mostly punctuation and short numbers, it takes much longer to get that correct answer.
Since the time range you query over is narrow and the rows are probably well-clustered on it, maybe just using the logs_01_date_index would be faster and then just filtering the table with the LIKE clause:
select log_date, host, msg
from logs_01 as log where (log.msg||'') like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;
Other options would be making your partitions smaller, or using btree_gist and make an index on (log_date, msg gist_trgm_ops). Unfortunately those indexes can be awful slow to build.
Cheers,
Jeff
Try 9.4 and you'll surprise.
1. GIN has compression
2. GIN has fast scan feature.
On Mon, Jun 22, 2015 at 7:51 AM, Christian Ramseyer <rc@networkz.ch> wrote:
Hi
I have a pretty large table with syslog messages.
It is already partitioned by month, and for a single month I have e.g.
DM=# \d+ logs_01
Column | Type |
--------------+-----------------------------+
host | character varying(255) |
facility | character varying(10) |
priority | character varying(10) |
tag | character varying(255) |
log_date | timestamp without time zone |
program | character varying(255) |
msg | text |
seq | bigint |
Indexes:
"logs_01_pkey" PRIMARY KEY, btree (seq)
"idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
"logs_01_date_index" btree (log_date)
"tridx_logs_01_msg" gin (msg gin_trgm_ops)
DM=# select count(*) from logs_01;
count
----------
83052864
I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around 60 GB on the 35 GB table:
DM=# select count(*) from logs_01;
count
----------
83052864
DM=# \dt+ logs_01
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+----------+-------+-------------
public | logs_01 | table | postgres | 35 GB |
DM=# \di+ tridx_logs_01_msg
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-------------------+-------+----------+---------+-------+-------------
public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
A typical query on this table looks like this:
explain analyze
select log_date, host, msg
from logs_01 as log where log.msg like '%192.23.33.177%'
and log.log_date >= '2015-1-18 1:45:24'
and log.log_date <= '2015-1-19 1:45:24'
order by log_date asc offset 200 limit 50;
It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN index takes quite long (40 seconds)
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
-> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
Sort Key: log_date
Sort Method: top-N heapsort Memory: 152kB
-> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725 rows=2472 loops=1)
Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
-> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1)
-> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145 rows=168489 loops=1)
Index Cond: (msg ~~ '%192.23.33.177%'::text)
-> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055 rows=1173048 loops=1)
Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
Total runtime: 42971.137 ms
(also on http://explain.depesz.com/s/KpaB)
Any good ideas on how I could speed this up a bit?
I have already tried to throw quite a bunch of memory at the problem:
shared_buffers = 64GB
work_mem = 16GB
but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.
This is on Postgres 9.1.15 on Linux.
Thanks
Christian
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 22/06/15 20:32, Jaime Casanova wrote: > What version of postgres is this? GIN indexes improved a lot in 9.4, > they use less than half the space and have doubled the speed (on > average). > > Now, whatever version you have; a GIN index has two data structures, > the main one in which the index entries are stored as key-value pairs > (please someone correct my description of the situation) and a pending > list, which is a temporary unsorted list of pending entries in which > all the newly inserted tuples arrive until a VACUUM (or until the > pending list grows upto work_mem) moves that list into the main > structure. > > That happens to avoid the penalty of inserting new rows in the main > structure which could be expensive. > But while the pending list grows the speed of the index decreases. And > because you have work_mem in 16Gb your pending list is possibly > growing without control. > > if you have 9.3 or superior you can know how big is that pending list > installing pgstattuple. > > CREATE EXTENSION pgstattuple; > SELECT * FROM pgstatginindex('tridx_logs_01_msg'); > > NOTE: remember that pending_pages is expressed in 8kb-pages > > if that is the problem or if you are in <= 9.2 then try VACUUM the table > Thanks, these are some interesting details I wasn't aware of. I think the pending list shouldn't be an issue since I restored this table into a new installation from backup and it receives no new data at all. But it is in 9.1.15 so I try to give 9.4 a shot, as apparently I'm missing out on a lot of stuff. Christian
On 23/06/15 01:30, Oleg Bartunov wrote: > Try 9.4 and you'll surprise. > > 1. GIN has compression > 2. GIN has fast scan feature. > Dang, and I was so happy that the Enterprise Linux we have to use here finally had 9.1 with pg_trgm :) But this sounds too good not to try it, I'll try to get 9.4 installed and report back with the results. Thanks Christian
On 22/06/15 13:51, Christian Ramseyer wrote: > Hi > > I have a pretty large table with syslog messages. > > It is already partitioned by month, and for a single month I have e.g. > > > DM=# \d+ logs_01 > > Column | Type | > --------------+-----------------------------+ > host | character varying(255) | > facility | character varying(10) | > priority | character varying(10) | > tag | character varying(255) | > log_date | timestamp without time zone | > program | character varying(255) | > msg | text | > seq | bigint | > > Indexes: > "logs_01_pkey" PRIMARY KEY, btree (seq) > "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops) > "logs_01_date_index" btree (log_date) > "tridx_logs_01_msg" gin (msg gin_trgm_ops) > > > DM=# select count(*) from logs_01; > count > ---------- > 83052864 > > > I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around60 GB on the 35 GB table: > > DM=# select count(*) from logs_01; > count > ---------- > 83052864 > > > DM=# \dt+ logs_01 > List of relations > Schema | Name | Type | Owner | Size | Description > --------+---------+-------+----------+-------+------------- > public | logs_01 | table | postgres | 35 GB | > > DM=# \di+ tridx_logs_01_msg > List of relations > Schema | Name | Type | Owner | Table | Size | Description > --------+-------------------+-------+----------+---------+-------+------------- > public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB | > > > A typical query on this table looks like this: > > explain analyze > select log_date, host, msg > from logs_01 as log where log.msg like '%192.23.33.177%' > and log.log_date >= '2015-1-18 1:45:24' > and log.log_date <= '2015-1-19 1:45:24' > order by log_date asc offset 200 limit 50; > > > It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN indextakes quite long (40 seconds) > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1) > -> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1) > Sort Key: log_date > Sort Method: top-N heapsort Memory: 152kB > -> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725rows=2472 loops=1) > Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) > -> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1) > -> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145rows=168489 loops=1) > Index Cond: (msg ~~ '%192.23.33.177%'::text) > -> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055rows=1173048 loops=1) > Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date<= '2015-01-19 01:45:24'::timestamp without time zone)) > Total runtime: 42971.137 ms > > (also on http://explain.depesz.com/s/KpaB) > > > Any good ideas on how I could speed this up a bit? > > I have already tried to throw quite a bunch of memory at the problem: > > shared_buffers = 64GB > work_mem = 16GB > > but it didn't improve between this and the 32GB shared/ 2GB work GB I had before. > > This is on Postgres 9.1.15 on Linux. > > Try 9.4 and you'll surprise. > > 1. GIN has compression > 2. GIN has fast scan feature. > > Oleg Hi Oleg and List I finally got around to try 9.4, and it is quite fantastic. Index size went from 58 to now 14 GB: DM=# \di+ tridx_logs_01_msg List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-------------------+-------+----------+---------+-------+------------- public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB | And the time for the above query went down to about 20 seconds: DM=# explain analyze DM-# select log_date, host, msg DM-# from logs_01 as log where log.msg like '%192.23.33.177%' DM-# and log.log_date >= '2015-1-18 1:45:24' DM-# and log.log_date <= '2015-1-19 1:45:24' DM-# order by log_date asc offset 200 limit 50; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=28815.06..28815.06 rows=1 width=194) (actual time=19032.099..19032.099 rows=0 loops=1) -> Sort (cost=28814.74..28815.06 rows=128 width=194) (actual time=19032.093..19032.093 rows=0 loops=1) Sort Key: log_date Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on logs_01 log (cost=28298.06..28810.26 rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1) Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) -> BitmapAnd (cost=28298.06..28298.06 rows=128 width=0) (actual time=19031.983..19031.983 rows=0 loops=1) -> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121 rows=99 loops=1) Index Cond: (msg ~~ '%192.23.33.177%'::text) -> Bitmap Index Scan on logs_01_date_index (cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084 rows=1173048 loops=1) Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) Planning time: 0.945 ms Execution time: 19032.409 ms (13 rows) Great stuff! Sorry Oleg I don't have your original message anymore and can't reply into the right place in the thread, so I took the liberty to CC: you. Christian
On Tue, Jul 28, 2015 at 8:34 AM, Christian Ramseyer <rc@networkz.ch> wrote: > On 22/06/15 13:51, Christian Ramseyer wrote: >> Hi >> >> I have a pretty large table with syslog messages. >> >> It is already partitioned by month, and for a single month I have e.g. >> >> >> DM=# \d+ logs_01 >> >> Column | Type | >> --------------+-----------------------------+ >> host | character varying(255) | >> facility | character varying(10) | >> priority | character varying(10) | >> tag | character varying(255) | >> log_date | timestamp without time zone | >> program | character varying(255) | >> msg | text | >> seq | bigint | >> >> Indexes: >> "logs_01_pkey" PRIMARY KEY, btree (seq) >> "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops) >> "logs_01_date_index" btree (log_date) >> "tridx_logs_01_msg" gin (msg gin_trgm_ops) >> >> >> DM=# select count(*) from logs_01; >> count >> ---------- >> 83052864 >> >> >> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around60 GB on the 35 GB table: >> >> DM=# select count(*) from logs_01; >> count >> ---------- >> 83052864 >> >> >> DM=# \dt+ logs_01 >> List of relations >> Schema | Name | Type | Owner | Size | Description >> --------+---------+-------+----------+-------+------------- >> public | logs_01 | table | postgres | 35 GB | >> >> DM=# \di+ tridx_logs_01_msg >> List of relations >> Schema | Name | Type | Owner | Table | Size | Description >> --------+-------------------+-------+----------+---------+-------+------------- >> public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB | >> >> >> A typical query on this table looks like this: >> >> explain analyze >> select log_date, host, msg >> from logs_01 as log where log.msg like '%192.23.33.177%' >> and log.log_date >= '2015-1-18 1:45:24' >> and log.log_date <= '2015-1-19 1:45:24' >> order by log_date asc offset 200 limit 50; >> >> >> It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN indextakes quite long (40 seconds) >> >> QUERY PLAN >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1) >> -> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1) >> Sort Key: log_date >> Sort Method: top-N heapsort Memory: 152kB >> -> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725rows=2472 loops=1) >> Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) >> -> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1) >> -> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145rows=168489 loops=1) >> Index Cond: (msg ~~ '%192.23.33.177%'::text) >> -> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055rows=1173048 loops=1) >> Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date<= '2015-01-19 01:45:24'::timestamp without time zone)) >> Total runtime: 42971.137 ms >> >> (also on http://explain.depesz.com/s/KpaB) >> >> >> Any good ideas on how I could speed this up a bit? >> >> I have already tried to throw quite a bunch of memory at the problem: >> >> shared_buffers = 64GB >> work_mem = 16GB >> >> but it didn't improve between this and the 32GB shared/ 2GB work GB I had before. >> >> This is on Postgres 9.1.15 on Linux. >> > > > >> Try 9.4 and you'll surprise. >> >> 1. GIN has compression >> 2. GIN has fast scan feature. >> >> Oleg > > > Hi Oleg and List > > I finally got around to try 9.4, and it is quite fantastic. > > Index size went from 58 to now 14 GB: > > DM=# \di+ tridx_logs_01_msg > List of relations > Schema | Name | Type | Owner | Table | Size | > Description > --------+-------------------+-------+----------+---------+-------+------------- > public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB | > > > And the time for the above query went down to about 20 seconds: > > > DM=# explain analyze > DM-# select log_date, host, msg > DM-# from logs_01 as log where log.msg like '%192.23.33.177%' > DM-# and log.log_date >= '2015-1-18 1:45:24' > DM-# and log.log_date <= '2015-1-19 1:45:24' > DM-# order by log_date asc offset 200 limit 50; > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=28815.06..28815.06 rows=1 width=194) (actual > time=19032.099..19032.099 rows=0 loops=1) > -> Sort (cost=28814.74..28815.06 rows=128 width=194) (actual > time=19032.093..19032.093 rows=0 loops=1) > Sort Key: log_date > Sort Method: quicksort Memory: 25kB > -> Bitmap Heap Scan on logs_01 log (cost=28298.06..28810.26 > rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1) > Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND > (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND > (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) > -> BitmapAnd (cost=28298.06..28298.06 rows=128 width=0) > (actual time=19031.983..19031.983 rows=0 loops=1) > -> Bitmap Index Scan on tridx_logs_01_msg > (cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121 > rows=99 loops=1) > Index Cond: (msg ~~ '%192.23.33.177%'::text) > -> Bitmap Index Scan on logs_01_date_index > (cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084 > rows=1173048 loops=1) > Index Cond: ((log_date >= '2015-01-18 > 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 > 01:45:24'::timestamp without time zone)) > Planning time: 0.945 ms > Execution time: 19032.409 ms > (13 rows) > > Great stuff! Sorry Oleg I don't have your original message anymore and > can't reply into the right place in the thread, so I took the liberty to > CC: you. There are some more big optimizations (via Jeff Janes) coming down the pike for trigram searching. See thread: http://www.postgresql.org/message-id/CAMkU=1woR_Pdmie6d-zj6sDOPiHd_iUe3vZSXFGe_i4-AQYsJQ@mail.gmail.com. I think it should be possible to patch the 9.4 pg_trgm module with Jeff's stuff -- it might be worthwhile to do that and run some tests and report back. I don't know if they address your particular case but in some situations the speedups are really dramatic. merlin
On Tue, Jul 28, 2015 at 10:34 AM, Christian Ramseyer <rc@networkz.ch> wrote:
> Try 9.4 and you'll surprise.
On 22/06/15 13:51, Christian Ramseyer wrote:
> Hi
>
> I have a pretty large table with syslog messages.
>
> It is already partitioned by month, and for a single month I have e.g.
>
>
> DM=# \d+ logs_01
>
> Column | Type |
> --------------+-----------------------------+
> host | character varying(255) |
> facility | character varying(10) |
> priority | character varying(10) |
> tag | character varying(255) |
> log_date | timestamp without time zone |
> program | character varying(255) |
> msg | text |
> seq | bigint |
>
> Indexes:
> "logs_01_pkey" PRIMARY KEY, btree (seq)
> "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
> "logs_01_date_index" btree (log_date)
> "tridx_logs_01_msg" gin (msg gin_trgm_ops)
>
>
> DM=# select count(*) from logs_01;
> count
> ----------
> 83052864
>
>
> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around 60 GB on the 35 GB table:
>
> DM=# select count(*) from logs_01;
> count
> ----------
> 83052864
>
>
> DM=# \dt+ logs_01
> List of relations
> Schema | Name | Type | Owner | Size | Description
> --------+---------+-------+----------+-------+-------------
> public | logs_01 | table | postgres | 35 GB |
>
> DM=# \di+ tridx_logs_01_msg
> List of relations
> Schema | Name | Type | Owner | Table | Size | Description
> --------+-------------------+-------+----------+---------+-------+-------------
> public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
>
>
> A typical query on this table looks like this:
>
> explain analyze
> select log_date, host, msg
> from logs_01 as log where log.msg like '%192.23.33.177%'
> and log.log_date >= '2015-1-18 1:45:24'
> and log.log_date <= '2015-1-19 1:45:24'
> order by log_date asc offset 200 limit 50;
>
>
> It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN index takes quite long (40 seconds)
>
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
> -> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
> Sort Key: log_date
> Sort Method: top-N heapsort Memory: 152kB
> -> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725 rows=2472 loops=1)
> Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
> -> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1)
> -> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145 rows=168489 loops=1)
> Index Cond: (msg ~~ '%192.23.33.177%'::text)
> -> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055 rows=1173048 loops=1)
> Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
> Total runtime: 42971.137 ms
>
> (also on http://explain.depesz.com/s/KpaB)
>
>
> Any good ideas on how I could speed this up a bit?
>
> I have already tried to throw quite a bunch of memory at the problem:
>
> shared_buffers = 64GB
> work_mem = 16GB
>
> but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.
>
> This is on Postgres 9.1.15 on Linux.
>
>
> 1. GIN has compression
> 2. GIN has fast scan feature.
>
> Oleg
Hi Oleg and List
I finally got around to try 9.4, and it is quite fantastic.
Index size went from 58 to now 14 GB:
DM=# \di+ tridx_logs_01_msg
List of relations
Schema | Name | Type | Owner | Table | Size |
Description
--------+-------------------+-------+----------+---------+-------+-------------
public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB |
And the time for the above query went down to about 20 seconds:
DM=# explain analyze
DM-# select log_date, host, msg
DM-# from logs_01 as log where log.msg like '%192.23.33.177%'
DM-# and log.log_date >= '2015-1-18 1:45:24'
DM-# and log.log_date <= '2015-1-19 1:45:24'
DM-# order by log_date asc offset 200 limit 50;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=28815.06..28815.06 rows=1 width=194) (actual
time=19032.099..19032.099 rows=0 loops=1)
-> Sort (cost=28814.74..28815.06 rows=128 width=194) (actual
time=19032.093..19032.093 rows=0 loops=1)
Sort Key: log_date
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on logs_01 log (cost=28298.06..28810.26
rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1)
Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND
(log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
(log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
-> BitmapAnd (cost=28298.06..28298.06 rows=128 width=0)
(actual time=19031.983..19031.983 rows=0 loops=1)
-> Bitmap Index Scan on tridx_logs_01_msg
(cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121
rows=99 loops=1)
Index Cond: (msg ~~ '%192.23.33.177%'::text)
-> Bitmap Index Scan on logs_01_date_index
(cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084
rows=1173048 loops=1)
Index Cond: ((log_date >= '2015-01-18
01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19
01:45:24'::timestamp without time zone))
Planning time: 0.945 ms
Execution time: 19032.409 ms
(13 rows)
Great stuff! Sorry Oleg I don't have your original message anymore and
can't reply into the right place in the thread, so I took the liberty to
CC: you.
Christian
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Christian
You could experiment recompiling pg_trgm commenting out the KEEPONLYALNUM and/or IGNORECASE definitions if you are looking for exact matches, this will increase the index size but will make it more selective.
Also, there's a thread around for pg_trgrm 1.2 which will get you even more boost.
--
Arthur Silva
On 28/07/15 16:42, Merlin Moncure wrote: >> >> Great stuff! Sorry Oleg I don't have your original message anymore and >> can't reply into the right place in the thread, so I took the liberty to >> CC: you. > > There are some more big optimizations (via Jeff Janes) coming down the > pike for trigram searching. See thread: > http://www.postgresql.org/message-id/CAMkU=1woR_Pdmie6d-zj6sDOPiHd_iUe3vZSXFGe_i4-AQYsJQ@mail.gmail.com. > > I think it should be possible to patch the 9.4 pg_trgm module with > Jeff's stuff -- it might be worthwhile to do that and run some tests > and report back. I don't know if they address your particular case > but in some situations the speedups are really dramatic. > > merlin > On 28/07/15 16:45, Arthur Silva wrote:> > You could experiment recompiling pg_trgm commenting out the > KEEPONLYALNUM and/or IGNORECASE definitions if you are looking for exact > matches, this will increase the index size but will make it more selective. > > Also, there's a thread around for pg_trgrm 1.2 which will get you even > more boost. Thanks for the hints and all the hard work you guys are putting into this. I'll follow the further development closely and report back if we get any new breakthroughs with this rather big data set. Christian