Re: performance for high-volume log insertion

Поиск
Список
Период
Сортировка
От Richard Huxton
Тема Re: performance for high-volume log insertion
Дата
Msg-id 49ED8A37.4030509@archonet.com
обсуждение исходный текст
Ответ на Re: performance for high-volume log insertion  (david@lang.hm)
Ответы Re: performance for high-volume log insertion  (Kenneth Marshall <ktm@rice.edu>)
Список pgsql-performance
david@lang.hm wrote:
> On Tue, 21 Apr 2009, Stephen Frost wrote:
>
>> * david@lang.hm (david@lang.hm) wrote:
>>> while I fully understand the 'benchmark your situation' need, this isn't
>>> that simple.
>>
>> It really is.  You know your application, you know it's primary use
>> cases, and probably have some data to play with.  You're certainly in a
>> much better situation to at least *try* and benchmark it than we are.
>
> rsyslog is a syslog server. it replaces (or for debian and fedora, has
> replaced) your standard syslog daemon. it recieves log messages from
> every app on your system (and possibly others), filters, maniulates
> them, and then stores them somewhere. among the places that it can store
> the logs are database servers (native support for MySQL, PostgreSQL, and
> Oracle. plus libdbi for others)

Well, from a performance standpoint the obvious things to do are:
1. Keep a connection open, do NOT reconnect for each log-statement
2. Batch log statements together where possible
3. Use prepared statements
4. Partition the tables by day/week/month/year (configurable I suppose)

The first two are vital, the third takes you a step further. The fourth
is a long-term admin thing.

And possibly
5. Have two connections, one for fatal/error etc and one for info/debug
level log statements (configurable split?). Then you can use the
synchronous_commit setting on the less important ones. Might buy you
some performance on a busy system.

http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS

> other apps then search and report on the data after it is stored. what
> apps?, I don't know either. pick your favorite reporting tool and you'll
> be a step ahead of me (I don't know a really good reporting tool)
>
> as for sample data, you have syslog messages, just like I do. so you
> have the same access to data that I have.
>
> how would you want to query them? how would people far less experianced
> that you want to query them?
>
> I can speculate that some people would do two columns (time, everything
> else), others will do three (time, server, everything else), and others
> will go further (I know some who would like to extract IP addresses
> embedded in a message into their own column). some people will index on
> the time and host, others will want to do full-text searches of everything.

Well, assuming it looks much like traditional syslog, I would do
something like: (timestamp, host, facility, priority, message). It's
easy enough to stitch back together if people want that.

PostgreSQL's full-text indexing is quite well suited to logfiles I'd
have thought, since it knows about filenames, urls etc already.

If you want to get fancy, add a msg_type column and one subsidiary table
for each msg_type. So - you might have smtp_connect_from (hostname,
ip_addr). A set of perl regexps can match and extract the fields for
these extra tables, or you could do it with triggers inside the
database. I think it makes sense to do it in the application. Easier for
users to contribute new patterns/extractions. Meanwhile, the core table
is untouched so you don't *need* to know about these extra tables.

If you have subsidiary tables, you'll want to partition those too and
perhaps stick them in their own schema (logs200901, logs200902 etc).

--
   Richard Huxton
   Archonet Ltd

В списке pgsql-performance по дате отправления:

Предыдущее
От: david@lang.hm
Дата:
Сообщение: Re: performance for high-volume log insertion
Следующее
От: Matthew Wakeling
Дата:
Сообщение: Re: GiST index performance