Обсуждение: poor pefrormance with regexp searches on large tables
Dear All,
I have some problems with regexp queries performance - common sense tells me that my queries should run faster than they do.
The database - table in question has 590 K records, table's size is 3.5GB. I am effectively querying a single attribute "subject" which has an average size of 2KB, so we are doing a query on ~1GB of data. The query looks more or less like this:
SELECT T.tender_id FROM archive_tender T WHERE
(( T.subject !~* '\\mpattern1.*\\M' ) AND ( T.subject ~* '\\mpattern2\\M' OR [4-5 more similar terms] ) AND T.erased = 0 AND T.rejected = 0
ORDER BY
tender_id DESC
LIMIT
10000;
The planner shows seq scan on subject which is OK with regexp match.
Now, the query above takes about 60sec to execute; exactly: 70s for the first run and 60s for the next runs. In my opinion this is too long: It should take 35 s to read the whole table into RAM (assuming 100 MB/s transfers - half the HDD benchmarked speed). With 12 GB of RAM the whole table should be easily buffered on the operating system level. The regexp match on 1 GB of data takes 1-2 s (I benchmarked it with a simple pcre test). The system is not in the production mode, so there is no additional database activity (no reads, no updates, effectively db is read-only)
To summarize: any idea how to speed up this query? (please, don't suggest regexp indexing - in this application it would be too time consuming to implement them, and besides - as above - I think that Postgres should do better here even with seq-scan).
Server parameters:
RAM: 12 GB
Cores: 8
HDD: SATA; shows 200 MB/s transfer speed
OS: Linux 64bit; Postgres 8.4
Some performance params from postgresql.conf:
max_connections = 16
shared_buffers = 24MB
temp_buffers = 128MB
max_prepared_transactions = 50
work_mem = 128MB
maintenance_work_mem = 1GB
effective_cache_size = 8GB
Database is vacuumed.
Regards,
Greg
I have some problems with regexp queries performance - common sense tells me that my queries should run faster than they do.
The database - table in question has 590 K records, table's size is 3.5GB. I am effectively querying a single attribute "subject" which has an average size of 2KB, so we are doing a query on ~1GB of data. The query looks more or less like this:
SELECT T.tender_id FROM archive_tender T WHERE
(( T.subject !~* '\\mpattern1.*\\M' ) AND ( T.subject ~* '\\mpattern2\\M' OR [4-5 more similar terms] ) AND T.erased = 0 AND T.rejected = 0
ORDER BY
tender_id DESC
LIMIT
10000;
The planner shows seq scan on subject which is OK with regexp match.
Now, the query above takes about 60sec to execute; exactly: 70s for the first run and 60s for the next runs. In my opinion this is too long: It should take 35 s to read the whole table into RAM (assuming 100 MB/s transfers - half the HDD benchmarked speed). With 12 GB of RAM the whole table should be easily buffered on the operating system level. The regexp match on 1 GB of data takes 1-2 s (I benchmarked it with a simple pcre test). The system is not in the production mode, so there is no additional database activity (no reads, no updates, effectively db is read-only)
To summarize: any idea how to speed up this query? (please, don't suggest regexp indexing - in this application it would be too time consuming to implement them, and besides - as above - I think that Postgres should do better here even with seq-scan).
Server parameters:
RAM: 12 GB
Cores: 8
HDD: SATA; shows 200 MB/s transfer speed
OS: Linux 64bit; Postgres 8.4
Some performance params from postgresql.conf:
max_connections = 16
shared_buffers = 24MB
temp_buffers = 128MB
max_prepared_transactions = 50
work_mem = 128MB
maintenance_work_mem = 1GB
effective_cache_size = 8GB
Database is vacuumed.
Regards,
Greg
On 10 Srpen 2011, 16:26, Grzegorz Blinowski wrote: > Now, the query above takes about 60sec to execute; exactly: 70s for the > first run and 60s for the next runs. In my opinion this is too long: It > should take 35 s to read the whole table into RAM (assuming 100 MB/s > transfers - half the HDD benchmarked speed). With 12 GB of RAM the whole > table should be easily buffered on the operating system level. The regexp And is it really in the page cache? I'm not an expert in this field, but I'd guess no. Check if it really gets the data from cache using iostat or something like that. Use fincore to see what's really in the cache, it's available here: http://code.google.com/p/linux-ftools/ > Some performance params from postgresql.conf: > max_connections = 16 > shared_buffers = 24MB Why just 24MBs? Have you tried with more memory here, e.g. 256MB or 512MB? I'm not suggesting the whole table should fit here (seq scan uses small ring cache anyway), but 24MB is just the bare minimum to start the DB. > Database is vacuumed. Just vacuumed or compacted? The simple vacuum just marks the dead tuples as empty, it does not compact the database. So if you've done a lot of changes and then just run vacuum, it may still may occupy a lot of space on the disk. How did you get that the table size is 3.5GB? Is that the size of the raw data, have you used pg_relation_size or something else? Tomas
Try to use single regular expression. 2011/8/10, Grzegorz Blinowski <g.blinowski@gmail.com>: > Dear All, > > I have some problems with regexp queries performance - common sense tells me > that my queries should run faster than they do. > > The database - table in question has 590 K records, table's size is 3.5GB. I > am effectively querying a single attribute "subject" which has an average > size of 2KB, so we are doing a query on ~1GB of data. The query looks more > or less like this: > > SELECT T.tender_id FROM archive_tender T WHERE > (( T.subject !~* '\\mpattern1.*\\M' ) AND ( T.subject ~* '\\mpattern2\\M' OR > [4-5 more similar terms] ) AND T.erased = 0 AND T.rejected = 0 > ORDER BY > tender_id DESC > LIMIT > 10000; > > The planner shows seq scan on subject which is OK with regexp match. > > Now, the query above takes about 60sec to execute; exactly: 70s for the > first run and 60s for the next runs. In my opinion this is too long: It > should take 35 s to read the whole table into RAM (assuming 100 MB/s > transfers - half the HDD benchmarked speed). With 12 GB of RAM the whole > table should be easily buffered on the operating system level. The regexp > match on 1 GB of data takes 1-2 s (I benchmarked it with a simple pcre > test). The system is not in the production mode, so there is no additional > database activity (no reads, no updates, effectively db is read-only) > > To summarize: any idea how to speed up this query? (please, don't suggest > regexp indexing - in this application it would be too time consuming to > implement them, and besides - as above - I think that Postgres should do > better here even with seq-scan). > > Server parameters: > RAM: 12 GB > Cores: 8 > HDD: SATA; shows 200 MB/s transfer speed > OS: Linux 64bit; Postgres 8.4 > > > Some performance params from postgresql.conf: > max_connections = 16 > shared_buffers = 24MB > temp_buffers = 128MB > max_prepared_transactions = 50 > work_mem = 128MB > maintenance_work_mem = 1GB > effective_cache_size = 8GB > > Database is vacuumed. > > > Regards, > > Greg > -- ------------ pasman
Grzegorz Blinowski <g.blinowski@gmail.com> wrote: > Some performance params from postgresql.conf: Please paste the result of running the query on this page: http://wiki.postgresql.org/wiki/Server_Configuration For a start, the general advice is usually to start with shared_buffers at the lesser of 25% of system RAM or 8 GB, and adjust from there based on benchmarks. So you might want to try 4GB for that one. Just to confirm, you are using 2 Phase Commit? (People sometimes mistake the max_prepared_transactions setting for something related to prepared statements.) I concur with previous advice that using one regular expression which matches all of the terms is going to be a lot faster than matching each small regular expression separately and then combining them. -Kevin
Thnaks for all the help so far, I increased the shared_mem config parameter (Postgress didn't accept higher values than default, had to increase systemwide shared mem). The current config (as suggested by Kevin Grittner) is as follows:
version | PostgreSQL 8.4.7 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50), 64-bit
autovacuum | off
client_encoding | LATIN2
effective_cache_size | 8GB
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
listen_addresses | *
log_rotation_age | 1d
log_rotation_size | 0
log_truncate_on_rotation | on
logging_collector | on
maintenance_work_mem | 1GB
max_connections | 16
max_prepared_transactions | 50
max_stack_depth | 8MB
port | 5432
server_encoding | UTF8
shared_buffers | 1GB
statement_timeout | 25min
temp_buffers | 16384
TimeZone | Europe/Berlin
work_mem | 128MB
However, changing shared_mem didn't help. We also checked system I/O stats during the query - and in fact there is almost no IO (even with suboptimal shared_memory). So the problem is not disk transfer/access but rather the way Postgres handles regexp queries... As I have wirtten it is difficult to rewrite the query syntax (the SQL generation in this app is quite complex), but it should be relatively easy to at least join all OR clauses into one regexp, I can try this from the psql CLI. I will post an update if anything interesting happens...
Cheers,
Greg
version | PostgreSQL 8.4.7 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50), 64-bit
autovacuum | off
client_encoding | LATIN2
effective_cache_size | 8GB
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
listen_addresses | *
log_rotation_age | 1d
log_rotation_size | 0
log_truncate_on_rotation | on
logging_collector | on
maintenance_work_mem | 1GB
max_connections | 16
max_prepared_transactions | 50
max_stack_depth | 8MB
port | 5432
server_encoding | UTF8
shared_buffers | 1GB
statement_timeout | 25min
temp_buffers | 16384
TimeZone | Europe/Berlin
work_mem | 128MB
However, changing shared_mem didn't help. We also checked system I/O stats during the query - and in fact there is almost no IO (even with suboptimal shared_memory). So the problem is not disk transfer/access but rather the way Postgres handles regexp queries... As I have wirtten it is difficult to rewrite the query syntax (the SQL generation in this app is quite complex), but it should be relatively easy to at least join all OR clauses into one regexp, I can try this from the psql CLI. I will post an update if anything interesting happens...
Cheers,
Greg
On Wed, Aug 10, 2011 at 5:27 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Please paste the result of running the query on this page:
http://wiki.postgresql.org/wiki/Server_Configuration
For a start, the general advice is usually to start with
shared_buffers at the lesser of 25% of system RAM or 8 GB, and
adjust from there based on benchmarks. So you might want to try 4GB
for that one.
Just to confirm, you are using 2 Phase Commit? (People sometimes
mistake the max_prepared_transactions setting for something related
to prepared statements.)
I concur with previous advice that using one regular expression
which matches all of the terms is going to be a lot faster than
matching each small regular expression separately and then combining
them.
-Kevin
Dne 10.8.2011 19:01, Grzegorz Blinowski napsal(a): > However, changing shared_mem didn't help. We also checked system I/O > stats during the query - and in fact there is almost no IO (even with > suboptimal shared_memory). So the problem is not disk transfer/access > but rather the way Postgres handles regexp queries... As I have wirtten > it is difficult to rewrite the query syntax (the SQL generation in this > app is quite complex), but it should be relatively easy to at least join > all OR clauses into one regexp, I can try this from the psql CLI. I will > post an update if anything interesting happens... Can you post EXPLAIN ANALYZE, prefferably using explain.depesz.com? Tomas
Grzegorz Blinowski <g.blinowski@gmail.com> wrote: > the problem is not disk transfer/access but rather the way > Postgres handles regexp queries. As a diagnostic step, could you figure out some non-regexp way to select about the same percentage of rows with about the same distribution across the table, and compare times? So far I haven't seen any real indication that the time is spent in evaluating the regular expressions, versus just loading pages from the OS into shared buffers and picking out individual tuples and columns from the table. For all we know, the time is mostly spent decompressing the 2K values. Perhaps you need to save them without compression. If they are big enough after compression to be stored out-of-line by default, you might want to experiment with having them in-line in the tuple. http://www.postgresql.org/docs/8.4/interactive/storage-toast.html -Kevin
Grzegorz Blinowski <g.blinowski@gmail.com> wrote: > autovacuum | off BTW, that's generally not a good idea -- it leaves you much more vulnerable to bloat which could cause performance problems to manifest in any number of ways. You might want to calculate your heap bloat on this table. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > So far I haven't seen any real indication that the time is spent > in evaluating the regular expressions Just as a reality check here, I ran some counts against a moderately-sized table (half a million rows). Just counting the rows unconditionally was about five times as fast as having to pick out even a small column for a compare. Taking a substring of a bigger (but normally non-TOASTed) value and doing a compare was only a little slower. Using a regular expression anchored to the front of the string to do the equivalent of the compare to the substring took about twice as long as the substring approach. For a non-anchored regular expression where it would normally need to scan in a bit, it took twice as long as the anchored regular expression. These times seem like they might leave some room for improvement, but it doesn't seem too outrageous. Each test run three times. select count(*) from "Case"; count -------- 527769 (1 row) Time: 47.696 ms Time: 47.858 ms Time: 47.687 ms select count(*) from "Case" where "filingCtofcNo" = '0878'; count -------- 198645 (1 row) Time: 219.233 ms Time: 225.410 ms Time: 226.723 ms select count(*) from "Case" where substring("caption" from 1 for 5) = 'State'; count -------- 178142 (1 row) Time: 238.160 ms Time: 237.114 ms Time: 240.388 ms select count(*) from "Case" where "caption" ~ '^State'; count -------- 178142 (1 row) Time: 532.821 ms Time: 535.341 ms Time: 529.121 ms select count(*) from "Case" where "caption" ~ 'Wisconsin'; count -------- 157483 (1 row) Time: 1167.433 ms Time: 1172.282 ms Time: 1170.562 ms -Kevin
A small followup regarding the suggestion to turn off compression - I used:
ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE EXTERNAL
to turn off compression, however I get an impression that "nothing happend". When exactly this alteration takes effect? Perhaps I should reload the entire db from backup to change the storage method?
Regards,
greg
ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE EXTERNAL
to turn off compression, however I get an impression that "nothing happend". When exactly this alteration takes effect? Perhaps I should reload the entire db from backup to change the storage method?
Regards,
greg
On Wed, Aug 10, 2011 at 7:17 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> the problem is not disk transfer/access but rather the wayAs a diagnostic step, could you figure out some non-regexp way to
> Postgres handles regexp queries.
select about the same percentage of rows with about the same
distribution across the table, and compare times? So far I haven't
seen any real indication that the time is spent in evaluating the
regular expressions, versus just loading pages from the OS into
shared buffers and picking out individual tuples and columns from
the table. For all we know, the time is mostly spent decompressing
the 2K values. Perhaps you need to save them without compression.
If they are big enough after compression to be stored out-of-line by
default, you might want to experiment with having them in-line in
the tuple.
http://www.postgresql.org/docs/8.4/interactive/storage-toast.html
-Kevin
Grzegorz Blinowski <g.blinowski@gmail.com> wrote: > A small followup regarding the suggestion to turn off compression > - I used: > > ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE > EXTERNAL > > to turn off compression, however I get an impression that "nothing > happend". When exactly this alteration takes effect? Perhaps I > should reload the entire db from backup to change the storage > method? Yeah, the storage option just affects future storage of values; it does not perform a conversion automatically. There are various ways you could cause the rows to be re-written so that they use the new TOAST policy for the column. One of the simplest would be to do a data-only dump of the table, truncate the table, and restore the data. If that table is a big enough portion of the database, a pg_dump of the whole database might be about as simple. -Kevin
To summarize this thread:
We have tried most of the suggestions and found two of them effective:
1) collapsing OR expressions in the WHERE clause into one '(...)|(...)' regexp resulted in about 60% better search time
2) changing long attribute storage to EXTERNAL gave 30% better search time (but only on the first search - i.e. before data is cached)
Surprisingly, changing shared_mem from 24MB to 1 GB gave no apparent effect.
Thanks once again for all your help!!!
Regards,
Greg
We have tried most of the suggestions and found two of them effective:
1) collapsing OR expressions in the WHERE clause into one '(...)|(...)' regexp resulted in about 60% better search time
2) changing long attribute storage to EXTERNAL gave 30% better search time (but only on the first search - i.e. before data is cached)
Surprisingly, changing shared_mem from 24MB to 1 GB gave no apparent effect.
Thanks once again for all your help!!!
Regards,
Greg
On Thu, Aug 11, 2011 at 3:56 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> A small followup regarding the suggestion to turn off compressionYeah, the storage option just affects future storage of values; it
> - I used:
>
> ALTER TABLE archive_tender ALTER COLUMN subject SET STORAGE
> EXTERNAL
>
> to turn off compression, however I get an impression that "nothing
> happend". When exactly this alteration takes effect? Perhaps I
> should reload the entire db from backup to change the storage
> method?
does not perform a conversion automatically. There are various ways
you could cause the rows to be re-written so that they use the new
TOAST policy for the column. One of the simplest would be to do a
data-only dump of the table, truncate the table, and restore the
data. If that table is a big enough portion of the database, a
pg_dump of the whole database might be about as simple.
-Kevin
Grzegorz Blinowski <g.blinowski@gmail.com> wrote: > 2) changing long attribute storage to EXTERNAL gave 30% better > search time (but only on the first search - i.e. before data is > cached) That suggests that all of the following are true: (1) The long value was previously being compressed and stored in-line. (2) It's now being stored uncompressed, out-of-line in the TOAST table. (3) Following the TOAST pointers on cached tuples isn't significantly more or less expensive than decompressing the data. (4) The smaller base tuple caused fewer page reads from disk, even with the out-of-line storage for the large value. The first three aren't surprising; that last one is. Unless there is significant bloat of the table, I'm having trouble seeing why that first run is cheaper this way. Make sure your vacuum policy is aggressive enough; otherwise you will probably see a slow but steady deterioration in performance.. -Kevin