Обсуждение: database design question, new to postgresql
Hello group, I need to design and develop a web reporting system to let users query/view syslog files on a unix host. For now, I am concentrating on the authentication file that has user logon (success/failure) and logoff records. The log file is logrotated every week or so. My reporting system parses the log entries and put the result into a postgresql database (I am proposing to use postgresql as the backend). Since this deals with multi-year archive and I believe 'partitioing' is an ideal feature to handle this problem. So here is the design scheme: CREATE TABLE logon_success( name varchar(32) not null, srcip inet not null, date date not null, time time not null, ... ); CREATE TABLE logon_success_yy${year}mm${month}( CHECK (date >= DATE '$year-$month-01' AND date < DATE '$next_year-$next_month-1') ) INHERITS ($tname) ; As you can see from the sample code, I am using perl to dynamically generate children tables as I parse log files in a daily cron job script. Once the log file is analyzed and archived in the database, I have a simple web UI that sysadmin can select and view user logon events. I have built a sample framework and it works so far. Keep in mind, this reporting system is not limited to just user logon, it should also work with system events such as services failures/startup, hardware failures, etc Now here are my questions: 1) Should I use database to implement such a reporting system? Are there any alternatives, architects, designs? 2) Is partitioning a good approach to speed up log query/view? The user comment in partitioning in pgsql manual seems to indicate partitioning may be slower than non-partitioned table under certain circumstances. 3) How to avoid repetitive log entry scanning since my cron job script is run daily but logrotate runs weekly? This means everytime my script will be parsing duplicate entries. 4) When parsing log files, it's quite possible that there are identical entries, for example a user logins really fast, resulting 2 or more identical entries..In this case can I still use primary key/index at all? If I can, how do I design primary key or index to speed up query? 3) What are the most glaring limitations and flaws in my design? Thank you for taking time to review and answer my questions! Let me know if I am not clear on any specific detail.. Fei
Fei Liu wrote: > Hello group, I need to design and develop a web reporting system to > let users query/view syslog files on a unix host. For now, I am > concentrating on the authentication file that has user logon > (success/failure) and logoff records. The log file is logrotated every > week or so. My reporting system parses the log entries and put the > result into a postgresql database (I am proposing to use postgresql as > the backend). Since this deals with multi-year archive and I believe > 'partitioing' is an ideal feature to handle this problem. So here is > the design scheme: > > CREATE TABLE logon_success( > name varchar(32) not null, > srcip inet not null, > date date not null, > time time not null, > ... > ); > > > CREATE TABLE logon_success_yy${year}mm${month}( > CHECK (date >= DATE '$year-$month-01' AND date < DATE > '$next_year-$next_month-1') > ) > INHERITS ($tname) > ; > > As you can see from the sample code, I am using perl to dynamically > generate children tables as I parse log files in a daily cron job > script. Once the log file is analyzed and archived in the database, I > have a simple web UI that sysadmin can select and view user logon > events. I have built a sample framework and it works so far. Keep in > mind, this reporting system is not limited to just user logon, it > should also work with system events such as services failures/startup, > hardware failures, etc > > Now here are my questions: > 1) Should I use database to implement such a reporting system? Are > there any alternatives, architects, designs? > 2) Is partitioning a good approach to speed up log query/view? The > user comment in partitioning in pgsql manual seems to indicate > partitioning may be slower than non-partitioned table under certain > circumstances. > 3) How to avoid repetitive log entry scanning since my cron job script > is run daily but logrotate runs weekly? This means everytime my script > will be parsing duplicate entries. > 4) When parsing log files, it's quite possible that there are > identical entries, for example a user logins really fast, resulting 2 > or more identical entries..In this case can I still use primary > key/index at all? If I can, how do I design primary key or index to > speed up query? > 3) What are the most glaring limitations and flaws in my design? > > Thank you for taking time to review and answer my questions! Let me > know if I am not clear on any specific detail.. > > Fei > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend Let me add one more question, what are the best approaches to analyze postgresql query performance and how to improve postgresql query performance? Fei
Fei Liu wrote: > Fei Liu wrote: >> Hello group, I need to design and develop a web reporting system to >> let users query/view syslog files on a unix host. For now, I am >> concentrating on the authentication file that has user logon >> (success/failure) and logoff records. The log file is logrotated >> every week or so. My reporting system parses the log entries and put >> the result into a postgresql database (I am proposing to use >> postgresql as the backend). Since this deals with multi-year archive >> and I believe 'partitioing' is an ideal feature to handle this >> problem. So here is the design scheme: >> >> CREATE TABLE logon_success( >> name varchar(32) not null, >> srcip inet not null, >> date date not null, >> time time not null, >> ... >> ); >> >> >> CREATE TABLE logon_success_yy${year}mm${month}( >> CHECK (date >= DATE '$year-$month-01' AND date < DATE >> '$next_year-$next_month-1') >> ) >> INHERITS ($tname) >> ; >> >> As you can see from the sample code, I am using perl to dynamically >> generate children tables as I parse log files in a daily cron job >> script. Once the log file is analyzed and archived in the database, I >> have a simple web UI that sysadmin can select and view user logon >> events. I have built a sample framework and it works so far. Keep in >> mind, this reporting system is not limited to just user logon, it >> should also work with system events such as services >> failures/startup, hardware failures, etc >> >> Now here are my questions: >> 1) Should I use database to implement such a reporting system? Are >> there any alternatives, architects, designs? >> 2) Is partitioning a good approach to speed up log query/view? The >> user comment in partitioning in pgsql manual seems to indicate >> partitioning may be slower than non-partitioned table under certain >> circumstances. >> 3) How to avoid repetitive log entry scanning since my cron job >> script is run daily but logrotate runs weekly? This means everytime >> my script will be parsing duplicate entries. >> 4) When parsing log files, it's quite possible that there are >> identical entries, for example a user logins really fast, resulting 2 >> or more identical entries..In this case can I still use primary >> key/index at all? If I can, how do I design primary key or index to >> speed up query? >> 3) What are the most glaring limitations and flaws in my design? >> >> Thank you for taking time to review and answer my questions! Let me >> know if I am not clear on any specific detail.. >> >> Fei >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 6: explain analyze is your friend > Let me add one more question, what are the best approaches to analyze > postgresql query performance and how to improve postgresql query > performance? > Fei My initial testing has not shown any significant difference between a partitioning approach and a plain (all entries in master) database approach... 2005-01-01 | 00:27:55 | firewood | ssh | Login Successful | None | local | user9819 | 192.168.1.31 My test was based on two artificial tables that has 1700 records per day from 2004-02-01 to 2007-04-27, around 2 million entries that are identical in both tables. My test script: echo Testing database $t1 time based time psql -p 5583 netilla postgres << EOF select count(date) from $t1 where date > '2005-03-01' and date < '2006-12-11'; \q EOF echo Testing database $t2 time based time psql -p 5583 netilla postgres << EOF select count(date) from $t2 where date > '2005-03-01' and date < '2006-12-11'; \q EOF Result: ./timing_test.sh Testing database logon_test time based count --------- 1121472 (1 row) 0.00user 0.00system 0:02.92elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+456minor)pagefaults 0swaps Testing database logon_test2 time based count --------- 1121472 (1 row) 0.00user 0.00system 0:02.52elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+456minor)pagefaults 0swaps But the numbers are really not static and logon_test2 (with partitioning) sometimes behave worse than logon_test... Fei