Re: [ADMIN] performance issue using DBI
От | nikolaus@dilger.cc |
---|---|
Тема | Re: [ADMIN] performance issue using DBI |
Дата | |
Msg-id | 20020606204319.4378.h015.c001.wm@mail.dilger.cc.criticalpath.net обсуждение исходный текст |
Список | pgsql-general |
Nicolas, The more data your tables contain the more work PostgreSQL needs to do. Adding indexes on your tables makes the INSERTS slower due to additional overhead. Indexes speed up SELECT but slow down INSERT. Therefore OLTP systems tend to have few indexes and data warehouses many. Instead of adding your data one line at a time I would use a staging area to speed things up as well as set operations instead of processing one item at a time in a loop. Your raw data obviously contains the following columns: remote_ip, phone_type, phone_number url, timestamp So create a table raw_data with the above columns. Truncate it before each data load. As first step of your daily processing load your raw data into the new table. Then find out all unique sessions within your raw data: CREATE TABLE unique_sessions AS SELECT DISTINCT remote_ip, phone_type, phone_number FROM raw_data; Now find all the new sessions: CREATE TABLE new_sessions AS SELECT remote_ip, phone_type, phone_number FROM unique_sessions MINUS SELECT remote_ip, phone_type, phone_number FROM sessions; Finally insert the new sessions into the sessions table: INSERT INTO sessions SELECT nextval('sessin_id'), remote_ip, phone_type, phone_number FROM new_sessions; Now insert into the actions table: INSERT INTO actions SELECT s.session_id, r.url, r.timestamp FROM raw_data r, sessions s WHERE s. remote_ip = r. remote_ip AND s.phone_type = r. phone_type AND s.phone_number =r. phone_number As the last step clean up the staging tables no longer needed: DROP TABLE unique_sessions; DROP TABLE new_sessions; TRUNCATE TABLE raw_data; Regards, Nikolaus On Thu, 06 June 2002, "Nicolas Nolst" wrote > > <html><div style='background-color:'><DIV></DIV> > <DIV></DIV>Hi all,<BR><BR>I have developped a perl script to populate a database with two tables: sessions<BR>and actions.<BR><BR>the table actions contains the following columns: session_id, url, timestamp.<BR>The column session_id references to the table sessions.<BR><BR>the table sessions contains the following columns: session_id, remote_ip,<BR>phone_type, phone_number. The column session_id is serial.<BR><BR>The lines of the table actions which are part of the same session have the same<BR>session_id.<BR><BR>There are then more lines in the table actions than in the table session.<BR><BR><BR>To fill the two tables, I first need to know if the session already exists for a<BR>certain phone_type, a certain remote_ip and a certain phone_number:<BR><BR>SELECT session_id FROM sessions WHERE (phone_number = ?) AND (remote_ip = ?)<BR>AND (phone_type = ?) ORDER BY session_id;<BR><BR>I also need to apply a criteria to know if I have to add a new entry in the<BR>table sessions or not:<BR><BR>SELECT (max(timestamp) + ?)<? FROM actions WHERE (session_id = ?);<BR><BR><BR><BR>If the session already exists I add a line in the table actions with a INSERT<BR><BR>If the session doesn't exist or if the criteria is true, I add a line in the<BR>table sessions with an INSERT and then add a line with a INSERT in the table actions (I use nextval and currval).<BR><BR>I have put indexes on sessions(session_id), sessions(msisdn),<BR>actions(session_id) and actions(timestamp). I process one log file of about 20000 lines every day. All the lines are processed in one transaction (autocommit set to 0).<BR><BR>My problem is that populating my database is slower when the data gets bigger<BR>and the performance falls dramatically. I thought that is would be improve with<BR>my indexes but the problem still persists.<BR><BR>Could you please give me some clues that could solve this issue.<BR><BR>Thanks.<BR><BR><BR><BR>Nicolas Nolst > <DIV></DIV> > <DIV></DIV><IMG src="http://www.3dflags/World/Gif/belgium_gs.gif"> > <DIV></DIV></div><br clear=all><hr>MSN Photos is the easiest way to share and print your photos: <a href='http://g.msn.com/1HM500901/157'>Click Here</a><br></html>
В списке pgsql-general по дате отправления: