Обсуждение: efficient data reduction (and deduping)

Поиск
Список
Период
Сортировка

efficient data reduction (and deduping)

От
Alessandro Gagliardi
Дата:
Hi folks,

I have a system that racks up about 40M log lines per day. I'm able to COPY the log files into a PostgreSQL table that looks like this:

CREATE TABLE activity_unlogged
(
  user_id character(24) NOT NULL,
  client_ip inet,
  hr_timestamp timestamp without time zone,
  locale character varying,
  log_id character(36),
  method character varying(6),
  server_ip inet,
  uri character varying,
  user_agent character varying
)

Now, I want to reduce that data to get the last activity that was performed by each user in any given hour. It should fit into a table like this:

CREATE TABLE hourly_activity
(
  activity_hour timestamp without time zone NOT NULL,
  user_id character(24) NOT NULL,
  client_ip inet,
  hr_timestamp timestamp without time zone,
  locale character varying,
  log_id character(36),
  method character varying(6),
  server_ip inet,
  uri character varying,
  user_agent character varying,
  CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
)

where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary key constraint) 

I am attempting to do that with the following:

INSERT INTO hourly_activity 
    SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour, activity_unlogged.user_id,
                    client_ip, hr_timestamp, locale, log_id, method, server_ip, uri, user_agent 
        FROM activity_unlogged, 
            (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
                FROM activity_unlogged GROUP BY user_id, date_trunc('hour', hr_timestamp)) AS last_activity 
    WHERE activity_unlogged.user_id = last_activity.user_id AND activity_unlogged.hr_timestamp = last_activity.last_timestamp;

I have two problems:
  1. It's incredibly slow (like: hours). I assume this is because I am scanning through a huge unindexed table twice. I imagine there is a more efficient way to do this, but I can't think of what it is. If I were doing this in a procedural programming language, it might look something like:
    for row in activity_unlogged:
        if (date_trunc('hour', hr_timestamp), user_id) in hourly_activity[(activity_hour, user_id)]:
            if hr_timestamp > 
    hourly_activity[(date_trunc('hour', hr_timestamp), user_id)][hr_timestamp]:
                
    hourly_activity <- row # UPDATE
        else:
            hourly_activity <- row # INSERT
    I suspect some implementation of this (hopefully my pseudocode is at least somewhat comprehensible) would be very slow as well, but at least it would only go through activity_unlogged once. (Then again, it would have to rescan hourly_activity each time, so it really wouldn't be any faster at all, would it?) I feel like there must be a more efficient way to do this in SQL though I can't put my finger on it.
  2. Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE activity_unlogged.user_id = last_activity.user_id AND activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to multiple records leading to a primary key collision. In such cases, I don't really care which of the two rows are picked, I just want to make sure that no more than one row is inserted per user per hour. In fact, though I would prefer to get the last row for each hour, I could probably get much the same effect if I just limited it to one per hour. Though I don't know if that really helps at all.

Re: efficient data reduction (and deduping)

От
Claudio Freire
Дата:
On Thu, Mar 1, 2012 at 3:27 PM, Alessandro Gagliardi
<alessandro@path.com> wrote:
> INSERT INTO hourly_activity
>     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> activity_unlogged.user_id,
>                     client_ip, hr_timestamp, locale, log_id, method,
> server_ip, uri, user_agent
>         FROM activity_unlogged,
>             (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
>                 FROM activity_unlogged GROUP BY user_id, date_trunc('hour',
> hr_timestamp)) AS last_activity
>     WHERE activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp;

Try

INSERT INTO hourly_activity
SELECT ... everything from au1 ...
FROM activity_unlogged au1
LEFT JOIN activity_unlogged au2 ON au2.user_id = au1.user_id
                                                    AND
date_trunc('hour', au2.hr_timestamp) = date_trunc('hour',
au1.hr_timestamp)
                                                    AND
au2.hr_timestamp < au1.hr_timestamp
WHERE au2.user_id is null;

Re: efficient data reduction (and deduping)

От
Craig James
Дата:
On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
<alessandro@path.com> wrote:
> Hi folks,
>
> I have a system that racks up about 40M log lines per day. I'm able to COPY
> the log files into a PostgreSQL table that looks like this:

Since you're using a COPY command and the table has a simple column
with exactly the value you want, why not filter it using grep(1) or
something similar and load the filtered result directly into the
hourly table?

Craig

>
> CREATE TABLE activity_unlogged
> (
>   user_id character(24) NOT NULL,
>   client_ip inet,
>   hr_timestamp timestamp without time zone,
>   locale character varying,
>   log_id character(36),
>   method character varying(6),
>   server_ip inet,
>   uri character varying,
>   user_agent character varying
> )
>
> Now, I want to reduce that data to get the last activity that was performed
> by each user in any given hour. It should fit into a table like this:
>
> CREATE TABLE hourly_activity
> (
>   activity_hour timestamp without time zone NOT NULL,
>   user_id character(24) NOT NULL,
>   client_ip inet,
>   hr_timestamp timestamp without time zone,
>   locale character varying,
>   log_id character(36),
>   method character varying(6),
>   server_ip inet,
>   uri character varying,
>   user_agent character varying,
>   CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
> )
>
> where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary
> key constraint)
>
> I am attempting to do that with the following:
>
> INSERT INTO hourly_activity
>     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> activity_unlogged.user_id,
>                     client_ip, hr_timestamp, locale, log_id, method,
> server_ip, uri, user_agent
>         FROM activity_unlogged,
>             (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
>                 FROM activity_unlogged GROUP BY user_id, date_trunc('hour',
> hr_timestamp)) AS last_activity
>     WHERE activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp;
>
> I have two problems:
>
> It's incredibly slow (like: hours). I assume this is because I am scanning
> through a huge unindexed table twice. I imagine there is a more efficient
> way to do this, but I can't think of what it is. If I were doing this in a
> procedural programming language, it might look something like:
> for row in activity_unlogged:
>     if (date_trunc('hour', hr_timestamp), user_id) in
> hourly_activity[(activity_hour, user_id)]:
>         if hr_timestamp > hourly_activity[(date_trunc('hour',
> hr_timestamp), user_id)][hr_timestamp]:
>             hourly_activity <- row # UPDATE
>     else:
>         hourly_activity <- row # INSERT
> I suspect some implementation of this (hopefully my pseudocode is at least
> somewhat comprehensible) would be very slow as well, but at least it would
> only go through activity_unlogged once. (Then again, it would have to
> rescan hourly_activity each time, so it really wouldn't be any faster at
> all, would it?) I feel like there must be a more efficient way to do this in
> SQL though I can't put my finger on it.
> Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
> activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
> multiple records leading to a primary key collision. In such cases, I don't
> really care which of the two rows are picked, I just want to make sure that
> no more than one row is inserted per user per hour. In fact, though I would
> prefer to get the last row for each hour, I could probably get much the same
> effect if I just limited it to one per hour. Though I don't know if that
> really helps at all.

Re: efficient data reduction (and deduping)

От
Peter van Hardenberg
Дата:
On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
<alessandro@path.com> wrote:
> Now, I want to reduce that data to get the last activity that was performed
> by each user in any given hour. It should fit into a table like this:
>

How about:

1) Create an expression based index on date_trunc('hour', hr_timestamp)
2) Create a view on that showing the last value
3) If you want to throw away the data use CREATE TABLE AS on the
results of the view.

You may also want to investigate window functions.

-p

--
Peter van Hardenberg
San Francisco, California
"Everything was beautiful, and nothing hurt." -- Kurt Vonnegut

Re: efficient data reduction (and deduping)

От
"Kevin Grittner"
Дата:
Alessandro Gagliardi <alessandro@path.com> wrote:

>   hr_timestamp timestamp without time zone,

In addition to the responses which more directly answer your
question, I feel I should point out that this will not represent a
single moment in time.  At the end of Daylight Saving Time, the
value will jump backward and you will run through a range of time
which will overlap existing entries.  There is almost never a good
reason to use TIMESTAMP WITHOUT TIME  ZONE -- TIMESTAMP WITH TIME
ZONE is required if you want the value to represent a moment in
time.

-Kevin

Re: efficient data reduction (and deduping)

От
Alessandro Gagliardi
Дата:
I was thinking of adding an index, but thought it would be pointless since I would only be using the index once before dropping the table (after its loaded into hourly_activity). I assumed it would take longer to create the index and then use it than to just seq scan once or twice. Am I wrong in that assumption?

On Thu, Mar 1, 2012 at 10:40 AM, Peter van Hardenberg <pvh@pvh.ca> wrote:
On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
<alessandro@path.com> wrote:
> Now, I want to reduce that data to get the last activity that was performed
> by each user in any given hour. It should fit into a table like this:
>

How about:

1) Create an expression based index on date_trunc('hour', hr_timestamp)
2) Create a view on that showing the last value
3) If you want to throw away the data use CREATE TABLE AS on the
results of the view.

You may also want to investigate window functions.

-p

--
Peter van Hardenberg
San Francisco, California
"Everything was beautiful, and nothing hurt." -- Kurt Vonnegut

Re: efficient data reduction (and deduping)

От
Alessandro Gagliardi
Дата:
All of our servers run in UTC specifically to avoid this sort of problem. It's kind of annoying actually, because we're a San Francisco company and so whenever I have to do daily analytics, I have to shift everything to Pacific. But in this case it's handy. Thanks for the keen eye though.

On Thu, Mar 1, 2012 at 10:51 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Alessandro Gagliardi <alessandro@path.com> wrote:

>   hr_timestamp timestamp without time zone,

In addition to the responses which more directly answer your
question, I feel I should point out that this will not represent a
single moment in time.  At the end of Daylight Saving Time, the
value will jump backward and you will run through a range of time
which will overlap existing entries.  There is almost never a good
reason to use TIMESTAMP WITHOUT TIME  ZONE -- TIMESTAMP WITH TIME
ZONE is required if you want the value to represent a moment in
time.

-Kevin

Re: efficient data reduction (and deduping)

От
Alessandro Gagliardi
Дата:
Hah! Yeah, that might would work. Except that I suck at grep. :(
Perhaps that's a weakness I should remedy.

On Thu, Mar 1, 2012 at 10:35 AM, Craig James <cjames@emolecules.com> wrote:
On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
<alessandro@path.com> wrote:
> Hi folks,
>
> I have a system that racks up about 40M log lines per day. I'm able to COPY
> the log files into a PostgreSQL table that looks like this:

Since you're using a COPY command and the table has a simple column
with exactly the value you want, why not filter it using grep(1) or
something similar and load the filtered result directly into the
hourly table?

Craig

>
> CREATE TABLE activity_unlogged
> (
>   user_id character(24) NOT NULL,
>   client_ip inet,
>   hr_timestamp timestamp without time zone,
>   locale character varying,
>   log_id character(36),
>   method character varying(6),
>   server_ip inet,
>   uri character varying,
>   user_agent character varying
> )
>
> Now, I want to reduce that data to get the last activity that was performed
> by each user in any given hour. It should fit into a table like this:
>
> CREATE TABLE hourly_activity
> (
>   activity_hour timestamp without time zone NOT NULL,
>   user_id character(24) NOT NULL,
>   client_ip inet,
>   hr_timestamp timestamp without time zone,
>   locale character varying,
>   log_id character(36),
>   method character varying(6),
>   server_ip inet,
>   uri character varying,
>   user_agent character varying,
>   CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
> )
>
> where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary
> key constraint)
>
> I am attempting to do that with the following:
>
> INSERT INTO hourly_activity
>     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> activity_unlogged.user_id,
>                     client_ip, hr_timestamp, locale, log_id, method,
> server_ip, uri, user_agent
>         FROM activity_unlogged,
>             (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
>                 FROM activity_unlogged GROUP BY user_id, date_trunc('hour',
> hr_timestamp)) AS last_activity
>     WHERE activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp;
>
> I have two problems:
>
> It's incredibly slow (like: hours). I assume this is because I am scanning
> through a huge unindexed table twice. I imagine there is a more efficient
> way to do this, but I can't think of what it is. If I were doing this in a
> procedural programming language, it might look something like:
> for row in activity_unlogged:
>     if (date_trunc('hour', hr_timestamp), user_id) in
> hourly_activity[(activity_hour, user_id)]:
>         if hr_timestamp > hourly_activity[(date_trunc('hour',
> hr_timestamp), user_id)][hr_timestamp]:
>             hourly_activity <- row # UPDATE
>     else:
>         hourly_activity <- row # INSERT
> I suspect some implementation of this (hopefully my pseudocode is at least
> somewhat comprehensible) would be very slow as well, but at least it would
> only go through activity_unlogged once. (Then again, it would have to
> rescan hourly_activity each time, so it really wouldn't be any faster at
> all, would it?) I feel like there must be a more efficient way to do this in
> SQL though I can't put my finger on it.
> Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
> activity_unlogged.user_id = last_activity.user_id AND
> activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
> multiple records leading to a primary key collision. In such cases, I don't
> really care which of the two rows are picked, I just want to make sure that
> no more than one row is inserted per user per hour. In fact, though I would
> prefer to get the last row for each hour, I could probably get much the same
> effect if I just limited it to one per hour. Though I don't know if that
> really helps at all.

Re: efficient data reduction (and deduping)

От
Alessandro Gagliardi
Дата:
Interesting solution. If I'm not mistaken, this does solve the problem of having two entries for the same user at the exact same time (which violates my pk constraint) but it does so by leaving both of them out (since there is no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?

On Thu, Mar 1, 2012 at 10:35 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
Try

INSERT INTO hourly_activity
SELECT ... everything from au1 ...
FROM activity_unlogged au1
LEFT JOIN activity_unlogged au2 ON au2.user_id = au1.user_id
                                                   AND
date_trunc('hour', au2.hr_timestamp) = date_trunc('hour',
au1.hr_timestamp)
                                                   AND
au2.hr_timestamp < au1.hr_timestamp
WHERE au2.user_id is null;

Re: efficient data reduction (and deduping)

От
Claudio Freire
Дата:
On Thu, Mar 1, 2012 at 4:35 PM, Alessandro Gagliardi
<alessandro@path.com> wrote:
> Interesting solution. If I'm not mistaken, this does solve the problem of
> having two entries for the same user at the exact same time (which violates
> my pk constraint) but it does so by leaving both of them out (since there is
> no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?

Yes, but it would have to be same *exact* time (not same hour).

You can use more fields to desambiguate too, ie:

au1.hr_timestamp > au2.hr_timestamp or (au1.hr_timestamp ==
au2.hr_timestamp and au1.some_other_field > au2.some_other_field)

If you have a sequential id to use in desambiguation, it would be best.

Re: efficient data reduction (and deduping)

От
Claudio Freire
Дата:
On Thu, Mar 1, 2012 at 4:39 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Interesting solution. If I'm not mistaken, this does solve the problem of
>> having two entries for the same user at the exact same time (which violates
>> my pk constraint) but it does so by leaving both of them out (since there is
>> no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?
>
> Yes, but it would have to be same *exact* time (not same hour).
>
> You can use more fields to desambiguate too, ie:
>
> au1.hr_timestamp > au2.hr_timestamp or (au1.hr_timestamp ==
> au2.hr_timestamp and au1.some_other_field > au2.some_other_field)
>
> If you have a sequential id to use in desambiguation, it would be best.

Sorry for double posting - but you can also *generate* such an identifier:

create sequence temp_seq;

with identified_au as ( select nextval('temp_seq') as id, * from
hourly_activity )
INSERT INTO hourly_activity
SELECT ... everything from au1 ...
FROM identified_au au1
LEFT JOIN identified_au au2 ON au2.user_id = au1.user_id
                                                   AND
date_trunc('hour', au2.hr_timestamp) = date_trunc('hour',
au1.hr_timestamp)
                                                   AND
au2.hr_timestamp < au1.hr_timestamp OR (au2.hr_timestamp =
au1.hr_timestamp AND au2.id < au1.id)
WHERE au2.user_id is null;

Should work if you have 9.x

Re: efficient data reduction (and deduping)

От
Alessandro Gagliardi
Дата:
Ah, yes, that makes sense. Thank you!

On Thu, Mar 1, 2012 at 11:39 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Thu, Mar 1, 2012 at 4:35 PM, Alessandro Gagliardi
<alessandro@path.com> wrote:
> Interesting solution. If I'm not mistaken, this does solve the problem of
> having two entries for the same user at the exact same time (which violates
> my pk constraint) but it does so by leaving both of them out (since there is
> no au1.hr_timestamp > au2.hr_timestamp in that case). Is that right?

Yes, but it would have to be same *exact* time (not same hour).

You can use more fields to desambiguate too, ie:

au1.hr_timestamp > au2.hr_timestamp or (au1.hr_timestamp ==
au2.hr_timestamp and au1.some_other_field > au2.some_other_field)

If you have a sequential id to use in desambiguation, it would be best.

Re: efficient data reduction (and deduping)

От
"Kevin Grittner"
Дата:
Alessandro Gagliardi <alessandro@path.com> wrote:

> All of our servers run in UTC specifically to avoid this sort of
> problem. It's kind of annoying actually, because we're a San
> Francisco company and so whenever I have to do daily analytics, I
> have to shift everything to Pacific. But in this case it's handy.

If that's working for you, you might just want to leave it alone;
but just so you know:

If you declare the column as TIMESTAMP WITH TIME ZONE, that it
stores the timestamp as UTC regardless of what your server's
definition of time zone is.  (It doesn't actually store the time
zone -- it just normalizes the time into UTC for storage.)  On
retrieval it shows that UTC moment in the local timezone.  So, all
that work you're doing to switch the time zone info around would be
pretty automatic if you used the other type.

-Kevin