Re: performance for high-volume log insertion

Поиск
Список
Период
Сортировка
От Kenneth Marshall
Тема Re: performance for high-volume log insertion
Дата
Msg-id 20090421133330.GZ18845@it.is.rice.edu
обсуждение исходный текст
Ответ на Re: performance for high-volume log insertion  (Richard Huxton <dev@archonet.com>)
Ответы Re: performance for high-volume log insertion  (david@lang.hm)
Список pgsql-performance
Hi,

I just finished reading this thread. We are currently working on
setting up a central log system using rsyslog and PostgreSQL. It
works well once we patched the memory leak. We also looked at what
could be done to improve the efficiency of the DB interface. On the
rsyslog side, moving to prepared queries allows you to remove the
escaping that needs to be done currently before attempting to
insert the data into the SQL backend as well as removing the parsing
and planning time from the insert. This is a big win for high insert
rates, which is what we are talking about. The escaping process is
also a big CPU user in rsyslog which then hands the escaped string
to the backend which then has to undo everything that had been done
and parse/plan the resulting query. This can use a surprising amount
of additional CPU. Even if you cannot support a general prepared
query interface, by specifying what the query should look like you
can handle much of the low-hanging fruit query-wise.

We are currently using a date based trigger to use a new partition
each day and keep 2 months of logs currently. This can be usefully
managed on the backend database, but if rsyslog supported changing
the insert to the new table on a time basis, the CPU used by the
trigger to support this on the backend could be reclaimed. This
would be a win for any DB backend. As you move to the new partition,
issuing a truncate to clear the table would simplify the DB interfaces.

Another performance enhancement already mentioned, would be to
allow certain extra fields in the DB to be automatically populated
as a function of the log messages. For example, logging the mail queue
id for messages from mail systems would make it much easier to locate
particular mail transactions in large amounts of data.

To sum up, eliminating the escaping in rsyslog through the use of
prepared queries would reduce the CPU load on the DB backend. Batching
the inserts will also net you a big performance increase. Some DB-based
applications allow for the specification of several types of queries,
one for single inserts and then a second to support multiple inserts
(copy). Rsyslog already supports the queuing pieces to allow you to
batch inserts. Just some ideas.

Regards,
Ken


On Tue, Apr 21, 2009 at 09:56:23AM +0100, Richard Huxton wrote:
> david@lang.hm wrote:
>> On Tue, 21 Apr 2009, Stephen Frost wrote:
>>> * david@lang.hm (david@lang.hm) wrote:
>>>> while I fully understand the 'benchmark your situation' need, this isn't
>>>> that simple.
>>>
>>> It really is.  You know your application, you know it's primary use
>>> cases, and probably have some data to play with.  You're certainly in a
>>> much better situation to at least *try* and benchmark it than we are.
>> rsyslog is a syslog server. it replaces (or for debian and fedora, has
>> replaced) your standard syslog daemon. it recieves log messages from every
>> app on your system (and possibly others), filters, maniulates them, and
>> then stores them somewhere. among the places that it can store the logs
>> are database servers (native support for MySQL, PostgreSQL, and Oracle.
>> plus libdbi for others)
>
> Well, from a performance standpoint the obvious things to do are:
> 1. Keep a connection open, do NOT reconnect for each log-statement
> 2. Batch log statements together where possible
> 3. Use prepared statements
> 4. Partition the tables by day/week/month/year (configurable I suppose)
>
> The first two are vital, the third takes you a step further. The fourth is
> a long-term admin thing.
>
> And possibly
> 5. Have two connections, one for fatal/error etc and one for info/debug
> level log statements (configurable split?). Then you can use the
> synchronous_commit setting on the less important ones. Might buy you some
> performance on a busy system.
>
> http://www.postgresql.org/docs/8.3/interactive/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS
>
>> other apps then search and report on the data after it is stored. what
>> apps?, I don't know either. pick your favorite reporting tool and you'll
>> be a step ahead of me (I don't know a really good reporting tool)
>> as for sample data, you have syslog messages, just like I do. so you have
>> the same access to data that I have.
>> how would you want to query them? how would people far less experianced
>> that you want to query them?
>> I can speculate that some people would do two columns (time, everything
>> else), others will do three (time, server, everything else), and others
>> will go further (I know some who would like to extract IP addresses
>> embedded in a message into their own column). some people will index on
>> the time and host, others will want to do full-text searches of
>> everything.
>
> Well, assuming it looks much like traditional syslog, I would do something
> like: (timestamp, host, facility, priority, message). It's easy enough to
> stitch back together if people want that.
>
> PostgreSQL's full-text indexing is quite well suited to logfiles I'd have
> thought, since it knows about filenames, urls etc already.
>
> If you want to get fancy, add a msg_type column and one subsidiary table
> for each msg_type. So - you might have smtp_connect_from (hostname,
> ip_addr). A set of perl regexps can match and extract the fields for these
> extra tables, or you could do it with triggers inside the database. I think
> it makes sense to do it in the application. Easier for users to contribute
> new patterns/extractions. Meanwhile, the core table is untouched so you
> don't *need* to know about these extra tables.
>
> If you have subsidiary tables, you'll want to partition those too and
> perhaps stick them in their own schema (logs200901, logs200902 etc).
>
> --
>   Richard Huxton
>   Archonet Ltd
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

В списке pgsql-performance по дате отправления:

Предыдущее
От: Matthew Wakeling
Дата:
Сообщение: Re: GiST index performance
Следующее
От: "John L. Clark"
Дата:
Сообщение: WHERE condition not being pushed down to union parts