Обсуждение: poor pefrormance with regexp searches on large tables

Поиск
Список
Период
Сортировка

poor pefrormance with regexp searches on large tables

От
Grzegorz Blinowski
Дата:
Dear All,

I have some problems with regexp queries performance - common sense tells me that my queries should run faster than they do.

The database - table in question has 590 K records, table's size is 3.5GB. I am effectively querying a single attribute "subject" which has an average size of 2KB, so we are doing a query on ~1GB of data. The query looks more or less like this:

SELECT T.tender_id FROM archive_tender T WHERE
(( T.subject !~* '\\mpattern1.*\\M' ) AND ( T.subject ~* '\\mpattern2\\M' OR [4-5 more similar terms]  ) AND T.erased = 0 AND T.rejected = 0
ORDER BY
   tender_id DESC
LIMIT
    10000;

The planner shows seq scan on subject which is OK with regexp match.

Now, the query above takes about 60sec to execute; exactly: 70s for the first run and 60s for the next runs. In my opinion this is too long: It should take 35 s to read the whole table into RAM (assuming 100 MB/s transfers - half the HDD  benchmarked speed). With 12 GB of RAM the whole table should be easily buffered on the operating system level. The regexp match on 1 GB of data takes 1-2 s (I benchmarked it with a simple pcre test). The system is not in the production mode, so there is no additional database activity (no reads, no updates, effectively db is  read-only)

To summarize: any idea how to speed up this query? (please, don't suggest regexp indexing - in this application it would be too time consuming to implement them, and besides - as above - I think that Postgres should do better here even with seq-scan).

Server parameters:
RAM: 12 GB
Cores: 8
HDD: SATA; shows 200 MB/s transfer speed
OS: Linux 64bit; Postgres 8.4


Some performance params from postgresql.conf:
max_connections = 16
shared_buffers = 24MB                 
temp_buffers = 128MB                  
max_prepared_transactions = 50        
work_mem = 128MB                         
maintenance_work_mem = 1GB 
effective_cache_size = 8GB

Database is vacuumed.


Regards,

Greg

Re: poor pefrormance with regexp searches on large tables

От
"Tomas Vondra"
Дата:
On 10 Srpen 2011, 16:26, Grzegorz Blinowski wrote:
> Now, the query above takes about 60sec to execute; exactly: 70s for the
> first run and 60s for the next runs. In my opinion this is too long: It
> should take 35 s to read the whole table into RAM (assuming 100 MB/s
> transfers - half the HDD  benchmarked speed). With 12 GB of RAM the whole
> table should be easily buffered on the operating system level. The regexp

And is it really in the page cache? I'm not an expert in this field, but
I'd guess no. Check if it really gets the data from cache using iostat or
something like that. Use fincore to see what's really in the cache, it's
available here:

http://code.google.com/p/linux-ftools/

> Some performance params from postgresql.conf:
> max_connections = 16
> shared_buffers = 24MB

Why just 24MBs? Have you tried with more memory here, e.g. 256MB or 512MB?
I'm not suggesting the whole table should fit here (seq scan uses small
ring cache anyway), but 24MB is just the bare minimum to start the DB.

> Database is vacuumed.

Just vacuumed or compacted? The simple vacuum just marks the dead tuples
as empty, it does not compact the database. So if you've done a lot of
changes and then just run vacuum, it may still may occupy a lot of space
on the disk. How did you get that the table size is 3.5GB? Is that the
size of the raw data, have you used pg_relation_size or something else?

Tomas




Re: poor pefrormance with regexp searches on large tables

От
pasman pasmański
Дата:
Try to use single regular expression.

2011/8/10, Grzegorz Blinowski <g.blinowski@gmail.com>:
> Dear All,
>
> I have some problems with regexp queries performance - common sense tells me
> that my queries should run faster than they do.
>
> The database - table in question has 590 K records, table's size is 3.5GB. I
> am effectively querying a single attribute "subject" which has an average
> size of 2KB, so we are doing a query on ~1GB of data. The query looks more
> or less like this:
>
> SELECT T.tender_id FROM archive_tender T WHERE
> (( T.subject !~* '\\mpattern1.*\\M' ) AND ( T.subject ~* '\\mpattern2\\M' OR
> [4-5 more similar terms]  ) AND T.erased = 0 AND T.rejected = 0
> ORDER BY
>    tender_id DESC
> LIMIT
>     10000;
>
> The planner shows seq scan on subject which is OK with regexp match.
>
> Now, the query above takes about 60sec to execute; exactly: 70s for the
> first run and 60s for the next runs. In my opinion this is too long: It
> should take 35 s to read the whole table into RAM (assuming 100 MB/s
> transfers - half the HDD  benchmarked speed). With 12 GB of RAM the whole
> table should be easily buffered on the operating system level. The regexp
> match on 1 GB of data takes 1-2 s (I benchmarked it with a simple pcre
> test). The system is not in the production mode, so there is no additional
> database activity (no reads, no updates, effectively db is  read-only)
>
> To summarize: any idea how to speed up this query? (please, don't suggest
> regexp indexing - in this application it would be too time consuming to
> implement them, and besides - as above - I think that Postgres should do
> better here even with seq-scan).
>
> Server parameters:
> RAM: 12 GB
> Cores: 8
> HDD: SATA; shows 200 MB/s transfer speed
> OS: Linux 64bit; Postgres 8.4
>
>
> Some performance params from postgresql.conf:
> max_connections = 16
> shared_buffers = 24MB
> temp_buffers = 128MB
> max_prepared_transactions = 50
> work_mem = 128MB
> maintenance_work_mem = 1GB
> effective_cache_size = 8GB
>
> Database is vacuumed.
>
>
> Regards,
>
> Greg
>


--
------------
pasman

Re: poor pefrormance with regexp searches on large tables

От
"Kevin Grittner"
Дата:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> Some performance params from postgresql.conf:

Please paste the result of running the query on this page:

http://wiki.postgresql.org/wiki/Server_Configuration

For a start, the general advice is usually to start with
shared_buffers at the lesser of 25% of system RAM or 8 GB, and
adjust from there based on benchmarks.  So you might want to try 4GB
for that one.

Just to confirm, you are using 2 Phase Commit?  (People sometimes
mistake the max_prepared_transactions setting for something related
to prepared statements.)

I concur with previous advice that using one regular expression
which matches all of the terms is going to be a lot faster than
matching each small regular expression separately and then combining
them.

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
Grzegorz Blinowski
Дата:
Thnaks for all  the help so far, I increased the shared_mem config parameter (Postgress didn't accept higher values than default, had to increase systemwide shared mem). The current config (as suggested by Kevin Grittner) is as follows:

 version                   | PostgreSQL 8.4.7 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50), 64-bit
 autovacuum                | off
 client_encoding           | LATIN2
 effective_cache_size      | 8GB
 lc_collate                | en_US.UTF-8
 lc_ctype                  | en_US.UTF-8
 listen_addresses          | *
 log_rotation_age          | 1d
 log_rotation_size         | 0
 log_truncate_on_rotation  | on
 logging_collector         | on
 maintenance_work_mem      | 1GB
 max_connections           | 16
 max_prepared_transactions | 50
 max_stack_depth           | 8MB
 port                      | 5432
 server_encoding           | UTF8
 shared_buffers            | 1GB
 statement_timeout         | 25min
 temp_buffers              | 16384
 TimeZone                  | Europe/Berlin
 work_mem                  | 128MB


However, changing shared_mem didn't help. We also checked system I/O stats during the query - and in fact there is almost no IO (even with suboptimal shared_memory). So the problem is not disk transfer/access but rather the way Postgres handles regexp queries... As I have wirtten it is difficult to rewrite the query syntax (the SQL generation in this app is quite complex), but it should be relatively easy to at least join all OR clauses into one regexp, I can try this from the psql CLI. I will post an update if anything interesting happens...

Cheers,

Greg


On Wed, Aug 10, 2011 at 5:27 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> Some performance params from postgresql.conf:

Please paste the result of running the query on this page:

http://wiki.postgresql.org/wiki/Server_Configuration

For a start, the general advice is usually to start with
shared_buffers at the lesser of 25% of system RAM or 8 GB, and
adjust from there based on benchmarks.  So you might want to try 4GB
for that one.

Just to confirm, you are using 2 Phase Commit?  (People sometimes
mistake the max_prepared_transactions setting for something related
to prepared statements.)

I concur with previous advice that using one regular expression
which matches all of the terms is going to be a lot faster than
matching each small regular expression separately and then combining
them.

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
Tomas Vondra
Дата:
Dne 10.8.2011 19:01, Grzegorz Blinowski napsal(a):
> However, changing shared_mem didn't help. We also checked system I/O
> stats during the query - and in fact there is almost no IO (even with
> suboptimal shared_memory). So the problem is not disk transfer/access
> but rather the way Postgres handles regexp queries... As I have wirtten
> it is difficult to rewrite the query syntax (the SQL generation in this
> app is quite complex), but it should be relatively easy to at least join
> all OR clauses into one regexp, I can try this from the psql CLI. I will
> post an update if anything interesting happens...

Can you post EXPLAIN ANALYZE, prefferably using explain.depesz.com?

Tomas

Re: poor pefrormance with regexp searches on large tables

От
"Kevin Grittner"
Дата:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> the problem is not disk transfer/access but rather the way
> Postgres handles regexp queries.

As a diagnostic step, could you figure out some non-regexp way to
select about the same percentage of rows with about the same
distribution across the table, and compare times?  So far I haven't
seen any real indication that the time is spent in evaluating the
regular expressions, versus just loading pages from the OS into
shared buffers and picking out individual tuples and columns from
the table.  For all we know, the time is mostly spent decompressing
the 2K values.  Perhaps you need to save them without compression.
If they are big enough after compression to be stored out-of-line by
default, you might want to experiment with having them in-line in
the tuple.

http://www.postgresql.org/docs/8.4/interactive/storage-toast.html

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
"Kevin Grittner"
Дата:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> autovacuum                | off

BTW, that's generally not a good idea -- it leaves you much more
vulnerable to bloat which could cause performance problems to
manifest in any number of ways.  You might want to calculate your
heap bloat on this table.

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
"Kevin Grittner"
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> So far I haven't seen any real indication that the time is spent
> in evaluating the regular expressions

Just as a reality check here, I ran some counts against a
moderately-sized table (half a million rows).  Just counting the
rows unconditionally was about five times as fast as having to pick
out even a small column for a compare.  Taking a substring of a
bigger (but normally non-TOASTed) value and doing a compare was only
a little slower.  Using a regular expression anchored to the front
of the string to do the equivalent of the compare to the substring
took about twice as long as the substring approach.  For a
non-anchored regular expression where it would normally need to scan
in a bit, it took twice as long as the anchored regular expression.
These times seem like they might leave some room for improvement,
but it doesn't seem too outrageous.

Each test run three times.

select count(*) from "Case";
 count
--------
 527769
(1 row)

Time: 47.696 ms
Time: 47.858 ms
Time: 47.687 ms

select count(*) from "Case" where "filingCtofcNo" = '0878';
 count
--------
 198645
(1 row)

Time: 219.233 ms
Time: 225.410 ms
Time: 226.723 ms

select count(*) from "Case"
where substring("caption" from 1 for 5) = 'State';
 count
--------
 178142
(1 row)

Time: 238.160 ms
Time: 237.114 ms
Time: 240.388 ms

select count(*) from "Case" where "caption" ~ '^State';
 count
--------
 178142
(1 row)

Time: 532.821 ms
Time: 535.341 ms
Time: 529.121 ms

select count(*) from "Case" where "caption" ~ 'Wisconsin';
 count
--------
 157483
(1 row)

Time: 1167.433 ms
Time: 1172.282 ms
Time: 1170.562 ms

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
Grzegorz Blinowski
Дата:
A small followup regarding the suggestion to turn off compression - I used:

ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE EXTERNAL

to turn off compression, however I get an impression that "nothing happend". When exactly this alteration takes effect? Perhaps I should reload the entire db from backup to change the storage method?

Regards,

greg


On Wed, Aug 10, 2011 at 7:17 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> the problem is not disk transfer/access but rather the way
> Postgres handles regexp queries.

As a diagnostic step, could you figure out some non-regexp way to
select about the same percentage of rows with about the same
distribution across the table, and compare times?  So far I haven't
seen any real indication that the time is spent in evaluating the
regular expressions, versus just loading pages from the OS into
shared buffers and picking out individual tuples and columns from
the table.  For all we know, the time is mostly spent decompressing
the 2K values.  Perhaps you need to save them without compression.
If they are big enough after compression to be stored out-of-line by
default, you might want to experiment with having them in-line in
the tuple.

http://www.postgresql.org/docs/8.4/interactive/storage-toast.html

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
"Kevin Grittner"
Дата:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> A small followup regarding the suggestion to turn off compression
> - I used:
>
> ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE
> EXTERNAL
>
> to turn off compression, however I get an impression that "nothing
> happend". When exactly this alteration takes effect? Perhaps I
> should reload the entire db from backup to change the storage
> method?

Yeah, the storage option just affects future storage of values; it
does not perform a conversion automatically.  There are various ways
you could cause the rows to be re-written so that they use the new
TOAST policy for the column.  One of the simplest would be to do a
data-only dump of the table, truncate the table, and restore the
data.  If that table is a big enough portion of the database, a
pg_dump of the whole database might be about as simple.

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
Grzegorz Blinowski
Дата:
To summarize this thread:

We have tried most of the suggestions and found two of them effective:

1) collapsing OR expressions in the WHERE clause into one '(...)|(...)' regexp resulted in about 60% better search time
2) changing long attribute storage to EXTERNAL gave 30% better search time (but only on the first search - i.e. before data is cached)

Surprisingly, changing shared_mem from 24MB to 1 GB gave no apparent effect.

Thanks once again for all your help!!!

Regards,

Greg


On Thu, Aug 11, 2011 at 3:56 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> A small followup regarding the suggestion to turn off compression
> - I used:
>
> ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE
> EXTERNAL
>
> to turn off compression, however I get an impression that "nothing
> happend". When exactly this alteration takes effect? Perhaps I
> should reload the entire db from backup to change the storage
> method?

Yeah, the storage option just affects future storage of values; it
does not perform a conversion automatically.  There are various ways
you could cause the rows to be re-written so that they use the new
TOAST policy for the column.  One of the simplest would be to do a
data-only dump of the table, truncate the table, and restore the
data.  If that table is a big enough portion of the database, a
pg_dump of the whole database might be about as simple.

-Kevin

Re: poor pefrormance with regexp searches on large tables

От
"Kevin Grittner"
Дата:
Grzegorz Blinowski <g.blinowski@gmail.com> wrote:

> 2) changing long attribute storage to EXTERNAL gave 30% better
> search time (but only on the first search - i.e. before data is
> cached)

That suggests that all of the following are true:

(1)  The long value was previously being compressed and stored
in-line.

(2)  It's now being stored uncompressed, out-of-line in the TOAST
table.

(3)  Following the TOAST pointers on cached tuples isn't
significantly more or less expensive than decompressing the data.

(4)  The smaller base tuple caused fewer page reads from disk, even
with the out-of-line storage for the large value.

The first three aren't surprising; that last one is.  Unless there
is significant bloat of the table, I'm having trouble seeing why
that first run is cheaper this way.  Make sure your vacuum policy is
aggressive enough; otherwise you will probably see a slow but steady
deterioration in performance..

-Kevin