Обсуждение: Performance issues when the number of records are around 10 Million

Поиск
Список
Период
Сортировка

Performance issues when the number of records are around 10 Million

От
venu madhav
Дата:
Hi all,
      In my database application, I've a table whose records can reach 10M and insertions can happen at a faster rate like 100 insertions per second in the peak times. I configured postgres to do auto vacuum on hourly basis. I have frontend GUI application in CGI which displays the data from the database. When I try to get the last twenty records from the database, it takes around 10-15  mins to complete the operation.This is the query which is used:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780;

Can any one suggest me a better solution to improve the performance.

Please let me know if you've any further queries.


Thank you,
Venu

Re: Performance issues when the number of records are around 10 Million

От
"Kevin Grittner"
Дата:
venu madhav <venutaurus539@gmail.com> wrote:

> When I try to get the last twenty records from the database, it
> takes around 10-15  mins to complete the operation.

Making this a little easier to read (for me, at least) I get this:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
    e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
    e.wifi_addr_2, e.view_status, bssid
  FROM event e,
       signature s
  WHERE s.sig_id = e.signature
    AND e.timestamp >= '1270449180'
    AND e.timestamp <  '1273473180'
  ORDER BY
    e.cid DESC,
    e.cid DESC
  limit 21
  offset 10539780
;

Why the timestamp range, the order by, the limit, *and* the offset?
On the face of it, that seems a bit confused.  Not to mention that
your ORDER BY has the same column twice.

Perhaps that OFFSET is not needed?  It is telling PostgreSQL that
whatever results are generated based on the rest of the query, read
through and ignore the first ten and a half million.  Since you said
you had about ten million rows, you wanted the last 20, and the
ORDER by is DESCending, you're probably not going to get what you
want.

What, exactly, *is* it you want again?

-Kevin

Re: Performance issues when the number of records are around 10 Million

От
"Jorge Montero"
Дата:
First, are you sure you are getting autovacuum to run hourly? Autovacuum will only vacuum when certain configuration thresholds are reached. You can set it to only check for those thresholds every so often, but no vacuuming or analyzing will be done unless they are hit, regardless of how often autovacuum checks the tables. Whenever you are dealing with time series, the default thresholds are often insufficient, especially when you are especially interested in the last few records on a large table. 
 
What are your autovacuum configuration parameters?
When were the two tables last autovacuum and analyzed, according to pg_stat_user_tables?
Could you post the output of explain analyze of your query?
Which default statistic collection parameters do you use? Have you changed them specifically for the tables you are using?
Which version of Postgres are you running? Which OS?
 
 

>>> venu madhav <venutaurus539@gmail.com> 05/11/10 3:47 AM >>>
Hi all,
In my database application, I've a table whose records can reach 10M and insertions can happen at a faster rate like 100 insertions per second in the peak times. I configured postgres to do auto vacuum on hourly basis. I have frontend GUI application in CGI which displays the data from the database. When I try to get the last twenty records from the database, it takes around 10-15 mins to complete the operation.This is the query which is used:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid FROM event e, signature s WHERE s.sig_id = e.signature AND e.timestamp >= '1270449180' AND e.timestamp < '1273473180' ORDER BY e.cid DESC, e.cid DESC limit 21 offset 10539780;

Can any one suggest me a better solution to improve the performance.

Please let me know if you've any further queries.


Thank you,
Venu

Re: Performance issues when the number of records are around 10 Million

От
Shrirang Chitnis
Дата:
Venu,

For starters,

1) You have used the e.cid twice in ORDER BY clause.
2) If you want last twenty records in the table matching the criteria of timestamp, why do you need the offset?
3) Do you have indexes on sig_id, signature and timestamp fields?

If you do not get a good response after that, please post the EXPLAIN ANALYZE for the query.

Thanks,

Shrirang Chitnis
Sr. Manager, Applications Development
HOV Services
Office: (866) 808-0935 Ext: 39210
shrirang.chitnis@hovservices.com
www.hovservices.com


The information contained in this message, including any attachments, is attorney privileged and/or confidential
informationintended only for the use of the individual or entity named as addressee.  The review, dissemination,
distributionor copying of this communication by or to anyone other than the intended addressee is strictly prohibited.
Ifyou have received this communication in error, please immediately notify the sender by replying to the message and
destroyall copies of the original message. 

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of venu madhav
Sent: Tuesday, May 11, 2010 2:18 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance issues when the number of records are around 10 Million

Hi all,
      In my database application, I've a table whose records can reach 10M and insertions can happen at a faster rate
like100 insertions per second in the peak times. I configured postgres to do auto vacuum on hourly basis. I have
frontendGUI application in CGI which displays the data from the database. When I try to get the last twenty records
fromthe database, it takes around 10-15  mins to complete the operation.This is the query which is used: 

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config,
e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE s.sig_id = e.signature   AND e.timestamp >=
'1270449180'AND e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780; 

Can any one suggest me a better solution to improve the performance.

Please let me know if you've any further queries.


Thank you,
Venu

Re: Performance issues when the number of records are around 10 Million

От
Josh Berkus
Дата:
> * select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
> e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
> e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE
> s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND e.timestamp
> < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780;

Anything with an offset that high is going to result in a sequential
scan of most of the table.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: Performance issues when the number of records are around 10 Million

От
venu madhav
Дата:


On Wed, May 12, 2010 at 3:17 AM, Jorge Montero <jorge_montero@homedecorators.com> wrote:
First, are you sure you are getting autovacuum to run hourly? Autovacuum will only vacuum when certain configuration thresholds are reached. You can set it to only check for those thresholds every so often, but no vacuuming or analyzing will be done unless they are hit, regardless of how often autovacuum checks the tables. Whenever you are dealing with time series, the default thresholds are often insufficient, especially when you are especially interested in the last few records on a large table. 
 
[Venu] Yes, autovacuum is running every hour. I could see in the log messages. All the configurations for autovacuum are disabled except that it should run for every hour. This application runs on an embedded box, so can't change the parameters as they effect the other applications running on it. Can you please explain what do you mean by default parameters.
 
What are your autovacuum configuration parameters?
[Venu] Except these all others are disabled.
 #---------------------------------------------------------------------------    
# AUTOVACUUM PARAMETERS                                                         
#---------------------------------------------------------------------------    
                                                                                
autovacuum = on                         # enable autovacuum subprocess?         
autovacuum_naptime = 3600               # time between autovacuum runs, in secs

When were the two tables last autovacuum and analyzed, according to pg_stat_user_tables?
[Venu] This is the content of pg_stat_user_tables for the two tables I am using in that query.
 relid | schemaname |     relname      | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del
-------+------------+------------------+----------+--------------+----------+---------------+-----------+-----------+-----------
 41188 | public     | event            |      117 |   1201705723 |      998 |          2824 |        28 |         0 |         0
 41209 | public     | signature        |      153 |         5365 |        2 |            72 |         1 |         0 |         0

Could you post the output of explain analyze of your query?
 snort=# EXPLAIN ANALYZE select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,  e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND e.timestamp < '1273473180'  ORDER BY e.cid DESC,
e.cid DESC limit 21 offset 10539780;

                                                                 QUERY
PLAN
---------------------------------------------------------------------------

------------------------------------------------------------------
 Limit  (cost=7885743.98..7885743.98 rows=1 width=287) (actual
time=1462193.060..1462193.083 rows=14 loops=1)
   ->  Sort  (cost=7859399.66..7885743.98 rows=10537727 width=287)
(actual time=1349648.207..1456496.334 rows=10539794 loops=1)
         Sort Key: e.cid
         ->  Hash Join  (cost=2.44..645448.31 rows=10537727 width=287)
(actual time=0.182..139745.001 rows=10539794 loops=1)
               Hash Cond: ("outer".signature = "inner".sig_id)
               ->  Seq Scan on event e  (cost=0.00..487379.97
rows=10537727 width=104) (actual time=0.012..121595.257 rows=10539794
loops=1)
                     Filter: (("timestamp" >= 1270449180::bigint) AND
("timestamp" < 1273473180::bigint))
               ->  Hash  (cost=2.35..2.35 rows=35 width=191) (actual
time=0.097..0.097 rows=36 loops=1)
                     ->  Seq Scan on signature s  (cost=0.00..2.35
rows=35 width=191) (actual time=0.005..0.045 rows=36 loops=1)
 Total runtime: 1463829.145 ms
(10 rows)

Which default statistic collection parameters do you use? Have you changed them specifically for the tables you are using?
[Venu] These are the statistic collection parameters:
 # - Query/Index Statistics Collector -            
                                                                           
stats_start_collector = on           
stats_command_string = on                                                  
#stats_block_level = off             
stats_row_level = on                                                       
#stats_reset_on_server_start = off

Please let me know if you are referring to something else.
Which version of Postgres are you running? Which OS?
[Venu] Postgres Version 8.1 and Cent OS 5.1 is the Operating System.

Thank you,
Venu
 
 

>>> venu madhav <venutaurus539@gmail.com> 05/11/10 3:47 AM >>>

Hi all,
In my database application, I've a table whose records can reach 10M and insertions can happen at a faster rate like 100 insertions per second in the peak times. I configured postgres to do auto vacuum on hourly basis. I have frontend GUI application in CGI which displays the data from the database. When I try to get the last twenty records from the database, it takes around 10-15 mins to complete the operation.This is the query which is used:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid FROM event e, signature s WHERE s.sig_id = e.signature AND e.timestamp >= '1270449180' AND e.timestamp < '1273473180' ORDER BY e.cid DESC, e.cid DESC limit 21 offset 10539780;

Can any one suggest me a better solution to improve the performance.

Please let me know if you've any further queries.


Thank you,
Venu

Re: Performance issues when the number of records are around 10 Million

От
venu madhav
Дата:


On Wed, May 12, 2010 at 3:22 AM, Shrirang Chitnis <Shrirang.Chitnis@hovservices.com> wrote:
Venu,

For starters,

1) You have used the e.cid twice in ORDER BY clause.
[Venu] Actually the second cid acts as a secondary sort order if any other column in the table is used for sorting. In the query since the primary sorting key was also  cid, we are seeing it twice. I can remove it.
2) If you want last twenty records in the table matching the criteria of timestamp, why do you need the offset?
[Venu] It is part of an UI  application where a user can ask for date between any dates. It has the options to browse through the data retrieved between those intervals.
3) Do you have indexes on sig_id, signature and timestamp fields?
[Venu] Yes, I do have indexes on those three.
 
If you do not get a good response after that, please post the EXPLAIN ANALYZE for the query.
snort=# EXPLAIN ANALYZE select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,  e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780;
                                                                 QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7885743.98..7885743.98 rows=1 width=287) (actual time=1462193.060..1462193.083 rows=14 loops=1)
   ->  Sort  (cost=7859399.66..7885743.98 rows=10537727 width=287) (actual time=1349648.207..1456496.334 rows=10539794 loops=1)
         Sort Key: e.cid
         ->  Hash Join  (cost=2.44..645448.31 rows=10537727 width=287) (actual time=0.182..139745.001 rows=10539794 loops=1)
               Hash Cond: ("outer".signature = "inner".sig_id)
               ->  Seq Scan on event e  (cost=0.00..487379.97 rows=10537727 width=104) (actual time=0.012..121595.257 rows=10539794 loops=1)
                     Filter: (("timestamp" >= 1270449180::bigint) AND ("timestamp" < 1273473180::bigint))
               ->  Hash  (cost=2.35..2.35 rows=35 width=191) (actual time=0.097..0.097 rows=36 loops=1)
                     ->  Seq Scan on signature s  (cost=0.00..2.35 rows=35 width=191) (actual time=0.005..0.045 rows=36 loops=1)
 Total runtime: 1463829.145 ms
(10 rows)
Thank you,
Venu Madhav.

Thanks,

Shrirang Chitnis
Sr. Manager, Applications Development
HOV Services
Office: (866) 808-0935 Ext: 39210
shrirang.chitnis@hovservices.com
www.hovservices.com


The information contained in this message, including any attachments, is attorney privileged and/or confidential information intended only for the use of the individual or entity named as addressee.  The review, dissemination, distribution or copying of this communication by or to anyone other than the intended addressee is strictly prohibited.  If you have received this communication in error, please immediately notify the sender by replying to the message and destroy all copies of the original message.

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of venu madhav
Sent: Tuesday, May 11, 2010 2:18 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance issues when the number of records are around 10 Million

Hi all,
     In my database application, I've a table whose records can reach 10M and insertions can happen at a faster rate like 100 insertions per second in the peak times. I configured postgres to do auto vacuum on hourly basis. I have frontend GUI application in CGI which displays the data from the database. When I try to get the last twenty records from the database, it takes around 10-15  mins to complete the operation.This is the query which is used:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name, e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21 offset 10539780;

Can any one suggest me a better solution to improve the performance.

Please let me know if you've any further queries.


Thank you,
Venu

Re: Performance issues when the number of records are around 10 Million

От
venu madhav
Дата:


On Wed, May 12, 2010 at 3:20 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
venu madhav <venutaurus539@gmail.com> wrote:

> When I try to get the last twenty records from the database, it
> takes around 10-15  mins to complete the operation.

Making this a little easier to read (for me, at least) I get this:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
   e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
   e.wifi_addr_2, e.view_status, bssid
 FROM event e,
      signature s
 WHERE s.sig_id = e.signature
   AND e.timestamp >= '1270449180'
   AND e.timestamp <  '1273473180'
 ORDER BY
   e.cid DESC,
   e.cid DESC
 limit 21
 offset 10539780
;

Why the timestamp range, the order by, the limit, *and* the offset?
On the face of it, that seems a bit confused.  Not to mention that
your ORDER BY has the same column twice.
[Venu] The second column acts as a secondary key for sorting if the primary sorting key is a different column. For this query both of them are same. This query is part of an application which allows user to select time ranges and retrieve the data in that interval. Hence the time stamp. To have it in some particular order we're doing order by. If the records are more in the interval, we display in sets of 20/30 etc. The user also has  the option to browse through any of those records hence the limit and offset.

Perhaps that OFFSET is not needed?  It is telling PostgreSQL that
whatever results are generated based on the rest of the query, read
through and ignore the first ten and a half million.  Since you said
you had about ten million rows, you wanted the last 20, and the
ORDER by is DESCending, you're probably not going to get what you
want.

What, exactly, *is* it you want again?

[Venu] As explain above this query is part of the application where user wishes to see the records from the database between any start and end times. They get rendered as a HTML page with pagination links to traverse through the data. The user has option to go to any set of records. When the user asks for the last set of 20 records, this query gets executed.
Hope it is clear now. Please let me know if you need any further info.

Thank you,
Venu
-Kevin

Re: Performance issues when the number of records are around 10 Million

От
Robert Haas
Дата:
On Wed, May 12, 2010 at 1:45 AM, venu madhav <venutaurus539@gmail.com> wrote:
> [Venu] Yes, autovacuum is running every hour. I could see in the log
> messages. All the configurations for autovacuum are disabled except that it
> should run for every hour. This application runs on an embedded box, so
> can't change the parameters as they effect the other applications running on
> it. Can you please explain what do you mean by default parameters.
> autovacuum = on                         # enable autovacuum
> subprocess?
> autovacuum_naptime = 3600               # time between autovacuum runs, in
> secs

The default value for autovacuum_naptime is a minute.  Why would you
want to increase it by a factor of 60?  That seems likely to result in
I/O spikes, table bloat, and generally poor performance.

There are dramatic performance improvements in PostgreSQL 8.3 and 8.4.
 Upgrading would probably help, a lot.

The points already made about LIMIT <some huge value> are also right on target.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company