Обсуждение: Query Performance question

Поиск
Список
Период
Сортировка

Query Performance question

От
"Magers, James"
Дата:

I am using a Pentaho process to access the database and select the appropriate information to update the DB tables and records.  I am trying to select the previous subscription key in order to update the factable for any records that have the previous key to have the current subscription key. This query is intended to use the current subscription key and subscription info to select the previous subscription key to allow for the information to be updated.  I would like to optimize the query to execute more efficiently.

The database table has about 60K records in it and when I run an explain anaylyze it indicates that the query optimizer chooses to execute a bitmap heap scan, this seems like an inefficient method for this query.

Query:
Select subscription_key as prev_sub_key
from member_subscription_d
where subscription_value ='noname@mailinator.com'
and newsletter_nme = 'newsletter_member'
and subscription_platform = 'email'
and version = (select version -1 as mtch_vers
               from member_subscription_d
               where subscription_key = 4037516)


Current Data in Database for this address:
 subscription_key | version |       date_from        |          date_to           |  newsletter_nme   | subscription_platform | subscription_value | subscription_status | list_status | current_status | unsubscribetoken |    transaction_date    | newsletter_sts
------------------+---------+------------------------+----------------------------+-------------------+-----------------------+--------------------+---------------------+-------------+----------------+------------------+------------------------+----------------
          4001422 |       1 | 2000-02-09 00:00:00-05 | 2014-04-19 09:57:24-04     | newsletter_member | email                 | noname@mailinator.com       | VALID               | pending     | f              |                  | 2000-02-09 00:00:00-05 |              2
          4019339 |       2 | 2014-04-19 09:57:24-04 | 2014-06-04 12:27:34-04     | newsletter_member | email                 | noname@mailinator.com       | VALID               | subscribe   | f              |                  | 2014-04-19 09:57:24-04 |              1
          4037516 |       3 | 2014-06-04 12:27:34-04 | 2199-12-31 23:59:59.999-05 | newsletter_member | email                 | noname@mailinator.com       | VALID               | subscribe   | t              |                  | 2014-06-04 12:27:34-04 |              1
(3 rows)


System information:
Postgres Version: 9.2
OS : Linux cmprodpgsql1 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Pentaho: 5.0.1-stable

postgresql.conf
checkpoint_segments = '8'
data_directory = '/var/lib/postgresql/9.2/main'
datestyle = 'iso, mdy'
default_text_search_config = 'pg_catalog.english'
effective_cache_size = '2GB'
external_pid_file = '/var/run/postgresql/9.2-main.pid'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
listen_addresses = '*'
log_line_prefix = '%t '
max_connections = '200'
max_wal_senders = '3'
port = 5432
shared_buffers = '1024MB'
ssl = off
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/certs/ssl-cert-snakeoil.key'
timezone = 'localtime'
unix_socket_directory = '/var/run/postgresql'
wal_keep_segments = '8'
wal_level = 'hot_standby'
work_mem = '100MB'

Re: Query Performance question

От
Tomas Vondra
Дата:
On 14.7.2014 00:55, Magers, James wrote:
> I am using a Pentaho process to access the database and select the
> appropriate information to update the DB tables and records.  I am
> trying to select the previous subscription key in order to update the
> factable for any records that have the previous key to have the current
> subscription key. This query is intended to use the current subscription
> key and subscription info to select the previous subscription key to
> allow for the information to be updated.  I would like to optimize the
> query to execute more efficiently.
>
> The database table has about 60K records in it and when I run an explain
> anaylyze it indicates that the query optimizer chooses to execute a
> bitmap heap scan, this seems like an inefficient method for this query.

Why do you think it's inefficient? The planner thinks it's efficient,
for some reason. And it's impossible to say if that's a good decision,
because we don't know (a) the explain plan, and (b) structure of the
table involved (indexes, ...).

Please post the explain analyze output to explain.depesz.com and post
the link here (it's more readable than posting it here directly).

Also, please do this:

  SELECT relname, relpages, reltuples
    FROM pg_class WHERE relname = 'member_subscription_d'

and this

  \d member_subscription_d

and post the results here.

regards
Tomas


Re: Query Performance question

От
"Magers, James"
Дата:
Tomas,

Thank you for your feedback.  I am attaching the requested information.  While I do not think the query is necessarily
inefficient,I believe  a sequence scan would be more efficient.   

\d member_subscription_d

                                        Table "public.member_subscription_d"
        Column         |           Type           |                            Modifiers
-----------------------+--------------------------+-----------------------------------------------------------------
 subscription_key      | bigint                   | not null default nextval('subscription_id_seq'::regclass)
 version               | integer                  | not null
 date_from             | timestamp with time zone |
 date_to               | timestamp with time zone |
 newsletter_nme        | character varying(50)    |
 subscription_platform | character varying(50)    |
 subscription_value    | character varying(255)   |
 subscription_status   | character varying(100)   |
 list_status           | character varying(25)    |
 current_status        | boolean                  |
 unsubscribetoken      | character varying(200)   |
 transaction_date      | timestamp with time zone |
 newsletter_sts        | integer                  |
Indexes:
    "member_subscription_key" PRIMARY KEY, btree (subscription_key)
    "idx_member_subscription_d_list_status" btree (list_status)
    "idx_member_subscription_d_newsletter_nme" btree (newsletter_nme)
    "idx_member_subscription_d_subscription_status" btree (subscription_status)
    "idx_member_subscription_d_subscription_value" btree (subscription_value)
    "idx_member_subscription_d_tk" btree (subscription_key)
Referenced by:
    TABLE "member_recipient_f" CONSTRAINT "member_subscription_d_recipient_f_fk" FOREIGN KEY (subscription_key)
REFERENCESmember_subscription_d(subscription_key) 



pgahq_datamart-#     FROM pg_class WHERE relname = 'member_subscription_d';
        relname        | relpages | reltuples
-----------------------+----------+-----------
 member_subscription_d |     1383 |     63012
(1 row)


Explain output:
http://explain.depesz.com/s/OVK

Re: Query Performance question

От
Thomas Kellerer
Дата:
Magers, James, 14.07.2014 04:20:

> Thank you for your feedback.  I am attaching the requested information.
> While I do not think the query is necessarily inefficient, I believe  a sequence scan would be more efficient.

You can try

set enable_indexscan = off;
set enable_bitmapscan = off;

and then run your query.

But I would be very surprised if a seq scan (which reads through the whole table) was faster than those 4ms you have
now


Re: Query Performance question

От
"Magers, James"
Дата:
Thomas,

Thank you.  I executed the query this morning after disabling the scan types.  I am including links to explain.depesz
outputfor each of the three variations that I executed.  
 

indexscan and bitmapscan off: http://explain.depesz.com/s/sIx
seqscan and bitmapscan off: http://explain.depesz.com/s/GfM
bitmapscan off: http://explain.depesz.com/s/3wna


Thank you,
James


Re: Query Performance question

От
Thomas Kellerer
Дата:
Magers, James, 14.07.2014 15:18:
> Thank you.  I executed the query this morning after disabling the scan types.
> I am including links to explain.depesz output for each of the three variations that I executed.
>
> indexscan and bitmapscan off: http://explain.depesz.com/s/sIx
> seqscan and bitmapscan off: http://explain.depesz.com/s/GfM
> bitmapscan off: http://explain.depesz.com/s/3wna
>

So the original query (using an "Index Scan" + "Bitmap Index Scan") is indeed the most efficient one: 4ms vs. 39ms vs.
64ms 






Re: Query Performance question

От
"Magers, James"
Дата:
Thomas,

I would have to agree that the current results do indicate that.  However, I have run this explain analyze multiple
timesand the timing varies from about 4ms to 35ms using the Bitmap Heap Scan.  Here is an explain plan from Thursday of
lastweek that shows about 21ms.  Part of the issue in trying to isolate if the query can be faster is that once the
datais cached any way that the query is executed appears to be quicker.
 

http://explain.depesz.com/s/SIX1

Thank you,
James

Re: Query Performance question

От
"Tomas Vondra"
Дата:
On 14 Červenec 2014, 16:00, Magers, James wrote:
> Thomas,
>
> I would have to agree that the current results do indicate that.  However,
> I have run this explain analyze multiple times and the timing varies from
> about 4ms to 35ms using the Bitmap Heap Scan.  Here is an explain plan
> from Thursday of last week that shows about 21ms.  Part of the issue in
> trying to isolate if the query can be faster is that once the data is
> cached any way that the query is executed appears to be quicker.
>
> http://explain.depesz.com/s/SIX1

I think that judging the performance based on this limited number of
samples is futile, especially when the plans are this fast. The
measurements are easy to influence by other tasks running on the system,
OS process scheduling etc. Or it might be because of memory pressure on
the system, causing the important data from page cache (and thus I/O for
queries accessing them).

This might be the reason why you saw higher timings, and it's impossible
to say based solely on explain plan from a single execution. To get
meaningful numbers it's necessary to execute the query repeatedly, to
eliminate caching effects. But the question is whether these caching
effects will happen on production or not. (Because what if you tweak the
configuration to get the best plan based on assumption that everything is
cached, when it won't be in practice?)

That being said, the only plan that's actually faster than the bitmap
index scan (which you believe is inefficient) is this one

   http://explain.depesz.com/s/3wna

The reason why it's not selected by the optimizer is that the cost is
estimated to be 20.60, while the bitmap index scan cost is estimated as
20.38. So the optimizer decides that 20.38 is lower than 20.60, and thus
chooses the bitmap index scan.

What you may do is tweak cost constants, described here

www.postgresql.org/docs/9.4/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS

You need to increase the bitmap idex scan cost estimate, so that it's more
expensive than the index scan. I'd guess that increasing the
cpu_tuple_cost and/or cpu_index_tuple_cost a bit should do the trick.

regards
Tomas




Re: Query Performance question

От
"Magers, James"
Дата:
Tomas,

Thank you for the recommendation.  In this case, The bitmap scan runs quite quickly, however in production were data
mayor may not be cached and at higher volumes I am trying to ensure the process will continue to execute efficiently
andreduce the impact of the process on other processes running against the database.  
 

My assessment is based on my experiences with the scans.  Does your experience provide you with a different assessment
ofthe scan types and how efficient they may be?
 

Thank you,
James




Re: Query Performance question

От
"Tomas Vondra"
Дата:
On 14 Červenec 2014, 18:02, Magers, James wrote:
> Tomas,
>
> Thank you for the recommendation.  In this case, The bitmap scan runs
> quite quickly, however in production were data may or may not be cached
> and at higher volumes I am trying to ensure the process will continue to
> execute efficiently and reduce the impact of the process on other
> processes running against the database.

That's why it's important to do the testing with representative amount of
data. Testing the queries on significantly reduced dataset is pointless,
because the optimizer will do different decisions.

> My assessment is based on my experiences with the scans.  Does your
> experience provide you with a different assessment of the scan types and
> how efficient they may be?

No. Because I don't have your data. And it seems that your assessment is
based on experience with dataset that's very different from your expected
production dataset, which means the experience is not directly applicable.
The optimizer considers the size of the dataset when choosing the plan.

regards
Tomas



Re: Query Performance question

От
"Magers, James"
Дата:
Thank you Tomas.  I did execute the queries against a dataset that was representative of what we expect the production
datasetto have.  By higher volume I meant more transactions happening against the data,  We would expect the data size
toincrease over time and when we executed against a dataset that was about 4x larger the index scan was selected to
performthe lookup versus the bitmap heap scan.  The scan of the both the smaller and larger datasets were returning in
similartimes between the two groups of tests.  This is part of the reason that I was thinking that the bitmap heap scan
maynot be as efficient since 4 times the data returned in just a little more time using the index scan.
 


Thank you,
James

-----Original Message-----
From: Tomas Vondra [mailto:tv@fuzzy.cz] 
Sent: Monday, July 14, 2014 12:29 PM
To: Magers, James
Cc: Tomas Vondra; Thomas Kellerer; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Query Performance question

On 14 Červenec 2014, 18:02, Magers, James wrote:
> Tomas,
>
> Thank you for the recommendation.  In this case, The bitmap scan runs
> quite quickly, however in production were data may or may not be cached
> and at higher volumes I am trying to ensure the process will continue to
> execute efficiently and reduce the impact of the process on other
> processes running against the database.

That's why it's important to do the testing with representative amount of
data. Testing the queries on significantly reduced dataset is pointless,
because the optimizer will do different decisions.

> My assessment is based on my experiences with the scans.  Does your
> experience provide you with a different assessment of the scan types and
> how efficient they may be?

No. Because I don't have your data. And it seems that your assessment is
based on experience with dataset that's very different from your expected
production dataset, which means the experience is not directly applicable.
The optimizer considers the size of the dataset when choosing the plan.

regards
Tomas