Обсуждение: DB design advice

Поиск
Список
Период
Сортировка

DB design advice

От
Toni Alfirević
Дата:
Hi,

i'm in need of an advice and here is my situation. 
I've started putting stuff on paper, drawing my application and everything seems simple enough. My application will store large amount of string type entries (let's call them web server logs for the sake of this mail) 
and by "large amount" I'm thinking in terms of couple of gigabytes a day (could be more but at first it will be several hundredth megabytes a day).

Relevant data structure looks something like this:

Logs
-----------------
Id - bigserial
Value - text
Timestamp - timestamp
UserId - FK_Users_Id

Tags
-----------------
Id - bigserial
Value - text
Timestamp - timestamp
LogId - FK_Logs_Id

So, pretty simple stuff there. Log entry that can have multiple tags related and those logs are related to user, nothing fancy or even complex.

And app should allow user to search trough his logs filtered by one or more tags and time interval. Here is one more constraint, since those messages are not really web server logs all messages have the same relevance,
and by same relevace I mean that messages created 2 years ago are not less likely to be searched for. This would translate to following situation... If this system should survive for, let's say, 3 years and DB size were to be 1TB,
system should be able to process search request within couple of seconds (arbitrary number that should be as low as possible and not high enough for users to percieve application as slow of sluggish).

App will be used by, let's say, hundreth users, could be more.
Read / search speeds are rather important and writes should be reliable, it would not be preferable to lose "Log or two every now and then".

Now that you gave general idea on what I plan to do here is the actual advice that I need.

I know it will seem like a pretty vague question but what advice would you give me at this point, before i start coding stuff, prototyping my app. But let me be more specific... I'm interested in stuff like index types on columns,
maybe modifying table relations. Maybe even experiences using search servers like sorl, sphynx. Or even stuff like pgStrom (http://wiki.postgresql.org/wiki/PGStrom). I'm opened to all suggestions, even to something like "Toni you're stupid,
you should be using cassandraDB for this, not postgresql". And keep in mind that my general concern is to design DB and app layer so it can handle at some point TBs of data so please try to give me advice that would result in performance increase.

Thank you,

--
Toni Alfirevic

Re: DB design advice

От
Sergey Konoplev
Дата:
On Thu, Jun 6, 2013 at 2:29 AM, Toni Alfirević <toni.alfirevic@gmail.com> wrote:
> And app should allow user to search trough his logs filtered by one or more
> tags and time interval. Here is one more constraint, since those messages

Is it supposed to return all the data from this interval at once, or
to be implemented like a kind of pagination/slider, that gets a
limited amount of entries since the beginning of the interval, and
then, after user requested more, it gets entries that are after the
last one from the previous bunch, and so on?

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com


Re: DB design advice

От
Toni Alfirević
Дата:
On 7. 6. 2013., at 07:39, Sergey Konoplev <gray.ru@gmail.com> wrote:

> On Thu, Jun 6, 2013 at 2:29 AM, Toni Alfirević <toni.alfirevic@gmail.com> wrote:
>> And app should allow user to search trough his logs filtered by one or more
>> tags and time interval. Here is one more constraint, since those messages
>
> Is it supposed to return all the data from this interval at once, or
> to be implemented like a kind of pagination/slider, that gets a
> limited amount of entries since the beginning of the interval, and
> then, after user requested more, it gets entries that are after the
> last one from the previous bunch, and so on?

Hi Sergey,

It will most likely return subsets of data just like you said (pagination).

But there will be a situation where all data will need to be analyzed for, let's call it', a report. On the other hands
thosereports are not that sensitive when it comes to performance. It won't be an issue if it takes a minute, two or
threeto generate it. 

Re: DB design advice

От
Sergey Konoplev
Дата:
On Thu, Jun 6, 2013 at 11:43 PM, Toni Alfirević
<toni.alfirevic@gmail.com> wrote:
> It will most likely return subsets of data just like you said (pagination).

Okay. What do we have?

Table log (id bigserial primary key, ts timestamptz, value text,
user_id bigint) with (user_id) foreign key .

Table tag (id bigserial primary key, value text).

Table log_tag (log_id bigint, log_ts timestamptz, log_user_id bigint,
tag_id bigint) with foreign keys (log_id, log_ts, log_user_id) and
(tag_id).

Index log_tag_pk (log_user_id, tag_id, log_ts, log_id) on log_tag.

It it is not so critical, you can make it without foreign keys to gain
some more performance of course, but I wont recommend it on early
stages of your software until it is mature and covered by tests good
enough, so you could believe that it wont break your data integrity.
You can always do this later.

We also have a plpgsql stored function get_log_by_tags(i_log_user_id
bigint, i_tag_ids bigint[], i_ts timestamptz, i_limit) returning set
of log.

This function initiates an empty resulting array _result_logs log[].

Then it takes each _tag_id from i_tag_igs with in a for loop.

And selects all log entries for i_log_user_id and _tag_id inserted
after or before i_ts, depending on the order you want to output your
log records by, and limiting it by i_limit. Then it concatenates these
entries with the resulting array, orders the result by ts and limits
it with i_limit.

This query is supposed to be in the for loop and represent the
paragraph above. I will use descending order as an example.

select into _result_logs array(
    select ll from (
        (
            select log.* from log_tag join log on log.id = log_id
            where log_user_id = i_log_user_id
            and tag_id = _tag_id and log_ts > i_ts
            order by log_ts desc limit i_limit
        ) union (
            select (l::log).* from unnest(_result_logs) as l
        )
    ) as ll
    order by ts limit i_limit
)

I did not check this query so it might contains some typos, etc.

Finally it returns the result of unnesting _result_logs.

select (l::log).* from unnest(_result_logs) as l

If you need to be able to go back and forth you can send, for example,
add an i_back boolean parameter to the function, and according to its
value choose either log_ts > i_ts with desc or log_ts < i_ts with asc
queries, using if/else.

That is is.

> But there will be a situation where all data will need to be analyzed for, let's call it', a report. On the other
handsthose reports are not that sensitive when it comes to performance. It won't be an issue if it takes a minute, two
orthree to generate it. 

Then you can use simple queries like this one here.

select count(1) from log_tag
where
    log_user_id = 987 and
    tag_id in (123, 456) and
    log_ts between now() and now() - '1 week'::interval

It will be okay for kind of reports in most cases.

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com


Re: DB design advice

От
Toni Alfirević
Дата:
On 8. 6. 2013., at 09:17, Sergey Konoplev <gray.ru@gmail.com> wrote:

> On Thu, Jun 6, 2013 at 11:43 PM, Toni Alfirević
> <toni.alfirevic@gmail.com> wrote:
>> It will most likely return subsets of data just like you said (pagination).
>
> Okay. What do we have?
>
> Table log (id bigserial primary key, ts timestamptz, value text,
> user_id bigint) with (user_id) foreign key .
>
> Table tag (id bigserial primary key, value text).
>
> Table log_tag (log_id bigint, log_ts timestamptz, log_user_id bigint,
> tag_id bigint) with foreign keys (log_id, log_ts, log_user_id) and
> (tag_id).
>
> Index log_tag_pk (log_user_id, tag_id, log_ts, log_id) on log_tag.
>
> It it is not so critical, you can make it without foreign keys to gain
> some more performance of course, but I wont recommend it on early
> stages of your software until it is mature and covered by tests good
> enough, so you could believe that it wont break your data integrity.
> You can always do this later.
>
> We also have a plpgsql stored function get_log_by_tags(i_log_user_id
> bigint, i_tag_ids bigint[], i_ts timestamptz, i_limit) returning set
> of log.
>
> This function initiates an empty resulting array _result_logs log[].
>
> Then it takes each _tag_id from i_tag_igs with in a for loop.
>
> And selects all log entries for i_log_user_id and _tag_id inserted
> after or before i_ts, depending on the order you want to output your
> log records by, and limiting it by i_limit. Then it concatenates these
> entries with the resulting array, orders the result by ts and limits
> it with i_limit.
>
> This query is supposed to be in the for loop and represent the
> paragraph above. I will use descending order as an example.
>
> select into _result_logs array(
>    select ll from (
>        (
>            select log.* from log_tag join log on log.id = log_id
>            where log_user_id = i_log_user_id
>            and tag_id = _tag_id and log_ts > i_ts
>            order by log_ts desc limit i_limit
>        ) union (
>            select (l::log).* from unnest(_result_logs) as l
>        )
>    ) as ll
>    order by ts limit i_limit
> )
>
> I did not check this query so it might contains some typos, etc.
>
> Finally it returns the result of unnesting _result_logs.
>
> select (l::log).* from unnest(_result_logs) as l
>
> If you need to be able to go back and forth you can send, for example,
> add an i_back boolean parameter to the function, and according to its
> value choose either log_ts > i_ts with desc or log_ts < i_ts with asc
> queries, using if/else.
>
> That is is.
>
>> But there will be a situation where all data will need to be analyzed for, let's call it', a report. On the other
handsthose reports are not that sensitive when it comes to performance. It won't be an issue if it takes a minute, two
orthree to generate it. 
>
> Then you can use simple queries like this one here.
>
> select count(1) from log_tag
> where
>    log_user_id = 987 and
>    tag_id in (123, 456) and
>    log_ts between now() and now() - '1 week'::interval
>
> It will be okay for kind of reports in most cases.

Sergey, you've been more then helpfull. Thank you.

I'll take all of this into consideration.

At one point I'll give you beta access so you can check out what you helped to create ;)

Cheers!
T.

Re: DB design advice

От
Sergey Konoplev
Дата:
On Tue, Jun 11, 2013 at 2:48 AM, Toni Alfirević
<toni.alfirevic@gmail.com> wrote:
> Sergey, you've been more then helpfull. Thank you.
> I'll take all of this into consideration.

Nice to know it.

If you want, I could check if everything is optimal when you finish
all the DDL and stored function(s) for this part of schema.

> At one point I'll give you beta access so you can check out what you helped to create ;)

Thank you :)

--
Kind regards,
Sergey Konoplev
PostgreSQL Consultant and DBA

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gray.ru@gmail.com