Обсуждение: PostgreSQL - Help Optimizing performance - full text search on Heroku
I have a big table ~15M records called entities. I want to find top 5 entities matching "hockey" in their name. I have a Full text index built for that, which is used : gin_ix_entity_full_text_search_name, which indexes the name. Query SELECT "entities".*, ts_rank(to_tsvector('english', "entities"."name"::text), to_tsquery('english', 'hockey'::text)) AS "rank0.48661998202865475" FROM "entities" WHERE "entities"."place" = 'f' AND (to_tsvector('english', "entities"."name"::text) @@ to_tsquery('english', 'hockey'::text)) ORDER BY "rank0.48661998202865475" DESC LIMIT 5 Duration 25,623 ms QUERY PLAN Limit (cost=4447.28..4447.29 rows=5 width=3116) (actual time=18509.274..18509.282 rows=5 loops=1) -> Sort (cost=4447.28..4448.41 rows=2248 width=3116) (actual time=18509.271..18509.273 rows=5 loops=1) Sort Key: (ts_rank(to_tsvector('english'::regconfig, (name)::text), '''hockey'''::tsquery)) Sort Method: top-N heapsort Memory: 19kB -> Bitmap Heap Scan on entities (cost=43.31..4439.82 rows=2248 width=3116) (actual time=119.003..18491.408 rows=2533 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''hockey'''::tsquery) Filter: (NOT place) -> Bitmap Index Scan on gin_ix_entity_full_text_search_name (cost=0.00..43.20 rows=2266 width=0) (actual time=74.093..74.093 rows=2593 loops=1) Index Cond: (to_tsvector('english'::regconfig, (name)::text) @@ '''hockey'''::tsquery) Total runtime: 18509.381 ms (10 rows) Is it because of my boolean condition (not Place?) If so, I should add it to my index and I should get a very fast query? Or is it the sorting condition which makes it very long? Thanks helping me understand the Query plan and how to fix my 25 seconds query! Here are my DB parameters . It is an online DB hosted by Heroku, on Amazon services. They describe it as having 1.7GB of ram, 1 processing unit and a DB of max 1TB. name | current_setting version | PostgreSQL 9.0.7 on i486-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 32-bit archive_command | test -f /etc/postgresql/9.0/main/wal-e.d/ARCHIVING_OFF || envdir /etc/postgresql/9.0/resource29857_heroku_com/wal-e.d/env wal-e wal-push %p archive_mode | on archive_timeout | 1min checkpoint_completion_target | 0.7 checkpoint_segments | 40 client_min_messages | notice cpu_index_tuple_cost | 0.001 cpu_operator_cost | 0.0005 cpu_tuple_cost | 0.003 effective_cache_size | 1530000kB hot_standby | on lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 listen_addresses | * log_checkpoints | on log_destination | syslog log_line_prefix | %u [YELLOW] log_min_duration_statement | 50ms log_min_messages | notice logging_collector | on maintenance_work_mem | 64MB max_connections | 500 max_prepared_transactions | 500 max_stack_depth | 2MB max_standby_archive_delay | -1 max_standby_streaming_delay | -1 max_wal_senders | 10 port | random_page_cost | 2 server_encoding | UTF8 shared_buffers | 415MB ssl | on syslog_ident | resource29857_heroku_com TimeZone | UTC wal_buffers | 8MB wal_keep_segments | 127 wal_level | hot_standby work_mem | 100MB (39 rows) I tried playing with the work_mem, setting it as high as 1.5GB, with no success. I believe it is heroku reading speed that is abysmal in this case. But I'd like to confirm that. Or is it the function that I'm calling in my SELECT clause? Thanks for help Also posted on http://dba.stackexchange.com/questions/16437/postgresql-help-optimizing-sql-performance-full-text-search -- View this message in context: http://postgresql.1045698.n5.nabble.com/PostgreSQL-Help-Optimizing-performance-full-text-search-on-Heroku-tp5638777p5638777.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On 13.4.2012 18:14, xlash wrote: > I have a big table ~15M records called entities. I want to find top 5 > entities matching "hockey" in their name. Number of rows in not a very useful metric - if might be 15 MBs or 15 GBs, depending on the structure. We need to know at least this: select relpages, reltuples from pg_class where relname = 'entities'; and similarly for the index (just replace the relation name). > QUERY PLAN > > Limit (cost=4447.28..4447.29 rows=5 width=3116) (actual > time=18509.274..18509.282 rows=5 loops=1) > -> Sort (cost=4447.28..4448.41 rows=2248 width=3116) (actual > time=18509.271..18509.273 rows=5 loops=1) > Sort Key: (ts_rank(to_tsvector('english'::regconfig, (name)::text), > '''hockey'''::tsquery)) > Sort Method: top-N heapsort Memory: 19kB > -> Bitmap Heap Scan on entities (cost=43.31..4439.82 rows=2248 > width=3116) (actual time=119.003..18491.408 rows=2533 loops=1) > Recheck Cond: (to_tsvector('english'::regconfig, (name)::text) @@ > '''hockey'''::tsquery) > Filter: (NOT place) > -> Bitmap Index Scan on gin_ix_entity_full_text_search_name > (cost=0.00..43.20 rows=2266 width=0) (actual time=74.093..74.093 rows=2593 > loops=1) > Index Cond: (to_tsvector('english'::regconfig, > (name)::text) @@ '''hockey'''::tsquery) > Total runtime: 18509.381 ms > (10 rows) I recommend services like explain.depesz.com instead of posting the plans directly (which causes wrapping etc. so the plans are barely readable). I've posted the plan here: http://explain.depesz.com/s/Jr7 > Is it because of my boolean condition (not Place?) If so, I should add it to > my index and I should get a very fast query? Or is it the sorting condition > which makes it very long? No. The step that consumes most of the time is the "bitmap heap scan" (see the "actual time" difference, it's nicely visible from the plan at explain.depesz.com). Let me briefly explain how the bitmap index scan works (and how that applies to this query). First, the index is scanned and a bitmap of pages (tuples) that need to be read from the relation is built. This is the "bitmap index scan" node in the plan. Then, the bitmap is used to read pages from the relation - this is necessary to check visibility of the rows (this is not stored in the index) and get the complete rows if needed. If you check the plan you'll see the first stage takes 74 ms (so it's negligible) but scanning the relation takes 18491 ms (like 99.9% of the runtime). The sorting clearly is not the culprit as it takes ~ 17 ms. And the 'NOT place' condition does not make much difference - the bitmap index scan returns 2593 rows and the recheck produces 2533 rows, so ~2% rows were removed (not necessarily due to the 'NOT place' condition). So it's highly unlikely adding this column to the index will improve the performance. So my guess is that you're I/O bound - reading the table causes so much random I/O the machine can't handle that. You can verify this by watching "iostat -x" when running the query. My bet is the device will be ~100% utilized. > > Thanks helping me understand the Query plan and how to fix my 25 seconds > query! > > > Here are my DB parameters . It is an online DB hosted by Heroku, on Amazon > services. They describe it as having 1.7GB of ram, 1 processing unit and a > DB of max 1TB. Well, AWS instances are known to have I/O issues (which is expected when running database with virtualized devices). Now sure what to recommend here - either use an instance with more RAM (so that the whole 'entities' relation is cached) or split the table somehow so that less data needs to be read from the disk. E.g. vertical partitioning i.e. splitting the table vertically into two parts or something like that. Tomas