Re: PG vs ElasticSearch for Logs
От | Chris Mair |
---|---|
Тема | Re: PG vs ElasticSearch for Logs |
Дата | |
Msg-id | 0b59e1fe90091337e7cda62c5dd9d112@smtp.hushmail.com обсуждение исходный текст |
Ответ на | Re: PG vs ElasticSearch for Logs (Thomas Güttler <guettliml@thomas-guettler.de>) |
Ответы |
Re: PG vs ElasticSearch for Logs
(Thomas Güttler <guettliml@thomas-guettler.de>)
|
Список | pgsql-general |
On 19/08/16 10:57, Thomas Güttler wrote: >>> What do you think? >> >> I store most of my logs in flat textfiles syslog style, and use grep for adhoc querying. >> >> 200K rows/day, thats 1.4 million/week, 6 million/month, pretty soon you're talking big tables. >> >> in fact thats several rows/second on a 24/7 basis > > There is no need to store them more then 6 weeks in my current use case. Hi, to me this kind of data looks like something Postgres can handle with ease. We're talking about 8.4M rows here. Coincidentally, I was trying out the new parallel query feature in the 9.6 beta just now and decided to use your numbers as a test case :) I can create 8.4M records having a timestamp and a random ~ 250 character string in ~ 31 seconds: pg96=# select now() + (sec / 200000.0 * 86400.0 || ' seconds')::interval as ts, pg96-# repeat(random()::text, 15) as msg pg96-# into t1 pg96-# from generate_series(1, 6 * 7 * 200000) as sec; SELECT 8400000 Time: 30858.274 ms Table size is 2.4 GB. This gives about 6 weeks. A query to scan the whole thing on the narrow column takes ~ 400 msec, like this: pg96=# select min(ts), max(ts) from t1; min | max -------------------------------+------------------------------- 2016-08-19 20:17:24.921333+00 | 2016-09-30 20:17:24.489333+00 (1 row) Time: 409.468 ms Even running an unanchored regular expression (!) on the wider column is doable: pg96=# select count(*) from t1 where msg ~ '12345'; count ------- 955 (1 row) Time: 3146.838 ms If you have some filter, not everything needs to be regexped and this gets pretty fast: pg96=# select count(*) from t1 where ts between '2016-08-25' and '2016-08-26' and msg ~ '12345'; count ------- 24 (1 row) Time: 391.577 ms All this is without indices. Your data is more structured than my test, so undoubtly you will get some gain from indices... Here is something more analytical - basically same as the count(*) above: pg96=# select ts::date, count(*) from t1 where msg ~ '12345' group by ts::date order by ts::date; ts | count ------------+------- 2016-08-19 | 26 2016-08-20 | 28 [...] 2016-09-28 | 21 2016-09-29 | 33 (42 rows) Time: 3157.010 ms Note, however, that I'm using 9.6 beta with the parallel query feature: the sequential scans with the regexp is run in parallel on 6 workers... this gives me a speed-up of a factor 4-5 (machine has 8 logical CPUs) and the whole table fits in cache. For a use case as this, the parallel query feature in 9.6 is so good it's almost like cheating ;) Bye, Chris.
В списке pgsql-general по дате отправления:
Предыдущее
От: Victor BlomqvistДата:
Сообщение: Re: Limit Heap Fetches / Rows Removed by Filter in Index Scans