Обсуждение: Large Table - Slow Window Functions (Better Approach?)

Поиск
Список
Период
Сортировка

Large Table - Slow Window Functions (Better Approach?)

От
Jeff Adams - NOAA Affiliate
Дата:

Greetings,

 

I have a large table (~90 million rows) containing vessel positions. In addition to a column that contains the location information (the_geom), the table also contains two columns that are used to uniquely identify the vessel (mmsi and name) and a column containing the Unix time (epoch) at which the position information was logged. I frequently need to assign records to vessel transits. To do this, I currently create a CTE that uses a Window function (partitioning the data by mmsi and name ordered by epoch) to examine the time that has elapsed between successive position reports for individual vessels. For every position record for a vessel (as identified using mmsi and name), if the time elapsed between the current position record and the previous record (using the lag function) is less than or equal to 2 hours, I assign the record a value of 0 to a CTE column named tr_index. If the time elapsed is greater than 2 hours, I assign the record a value of 1 to the tr_index column. I then use the CTE to generate transit numbers by summing the values in the tr_index field across a Window that also partitions the data by mmsi and name and is ordered by epoch. This works, but is very slow (hours). The table is indexed (multi-column index on mmsi, name and index on epoch). Does anyone see a way to get what I am after in a more efficient manner. What I am after is an assignment of transit number to vessels' position records based on whether the records were within two hours of each other. The SQL that I used is provided below. Any advice would be greatly appreciated...

 

WITH

cte_01 AS

(

SELECT

a.id,

a.mmsi,

a.name,

a.epoch,

a.the_geom

CASE

  WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1

  ELSE 0

END AS tr_index 

FROM table a

WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)

)

 

 

SELECT

a.id,

a.mmsi,

a.name,

a.epoch,

a.the_geom,

1 + sum(a.tr_index) OVER w AS transit,

a.active

FROM cte_01 a

WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)

--
Jeff

Re: Large Table - Slow Window Functions (Better Approach?)

От
Pavel Stehule
Дата:
Hello

you can try procedural solution - use a cursor over ordered data in
plpgsql and returns table

Regards

Pavel Stehule

2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
> Greetings,
>
>
>
> I have a large table (~90 million rows) containing vessel positions. In
> addition to a column that contains the location information (the_geom), the
> table also contains two columns that are used to uniquely identify the
> vessel (mmsi and name) and a column containing the Unix time (epoch) at
> which the position information was logged. I frequently need to assign
> records to vessel transits. To do this, I currently create a CTE that uses a
> Window function (partitioning the data by mmsi and name ordered by epoch) to
> examine the time that has elapsed between successive position reports for
> individual vessels. For every position record for a vessel (as identified
> using mmsi and name), if the time elapsed between the current position
> record and the previous record (using the lag function) is less than or
> equal to 2 hours, I assign the record a value of 0 to a CTE column named
> tr_index. If the time elapsed is greater than 2 hours, I assign the record a
> value of 1 to the tr_index column. I then use the CTE to generate transit
> numbers by summing the values in the tr_index field across a Window that
> also partitions the data by mmsi and name and is ordered by epoch. This
> works, but is very slow (hours). The table is indexed (multi-column index on
> mmsi, name and index on epoch). Does anyone see a way to get what I am after
> in a more efficient manner. What I am after is an assignment of transit
> number to vessels' position records based on whether the records were within
> two hours of each other. The SQL that I used is provided below. Any advice
> would be greatly appreciated...
>
>
>
> WITH
>
> cte_01 AS
>
> (
>
> SELECT
>
> a.id,
>
> a.mmsi,
>
> a.name,
>
> a.epoch,
>
> a.the_geom
>
> CASE
>
>   WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
>
>   ELSE 0
>
> END AS tr_index
>
> FROM table a
>
> WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>
> )
>
>
>
>
>
> SELECT
>
> a.id,
>
> a.mmsi,
>
> a.name,
>
> a.epoch,
>
> a.the_geom,
>
> 1 + sum(a.tr_index) OVER w AS transit,
>
> a.active
>
> FROM cte_01 a
>
> WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>
>
>
> --
> Jeff


Re: Large Table - Slow Window Functions (Better Approach?)

От
Jeff Adams - NOAA Affiliate
Дата:
Pavel,

Thanks for the response. I have not yet had the opportunity to use cursors, but am now curious. Could you perhaps provide a bit more detail as to what the implementation of your suggested approach would look like?

On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hello

you can try procedural solution - use a cursor over ordered data in
plpgsql and returns table

Regards

Pavel Stehule

2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
> Greetings,
>
>
>
> I have a large table (~90 million rows) containing vessel positions. In
> addition to a column that contains the location information (the_geom), the
> table also contains two columns that are used to uniquely identify the
> vessel (mmsi and name) and a column containing the Unix time (epoch) at
> which the position information was logged. I frequently need to assign
> records to vessel transits. To do this, I currently create a CTE that uses a
> Window function (partitioning the data by mmsi and name ordered by epoch) to
> examine the time that has elapsed between successive position reports for
> individual vessels. For every position record for a vessel (as identified
> using mmsi and name), if the time elapsed between the current position
> record and the previous record (using the lag function) is less than or
> equal to 2 hours, I assign the record a value of 0 to a CTE column named
> tr_index. If the time elapsed is greater than 2 hours, I assign the record a
> value of 1 to the tr_index column. I then use the CTE to generate transit
> numbers by summing the values in the tr_index field across a Window that
> also partitions the data by mmsi and name and is ordered by epoch. This
> works, but is very slow (hours). The table is indexed (multi-column index on
> mmsi, name and index on epoch). Does anyone see a way to get what I am after
> in a more efficient manner. What I am after is an assignment of transit
> number to vessels' position records based on whether the records were within
> two hours of each other. The SQL that I used is provided below. Any advice
> would be greatly appreciated...
>
>
>
> WITH
>
> cte_01 AS
>
> (
>
> SELECT
>
> a.id,
>
> a.mmsi,
>
> a.name,
>
> a.epoch,
>
> a.the_geom
>
> CASE
>
>   WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
>
>   ELSE 0
>
> END AS tr_index
>
> FROM table a
>
> WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>
> )
>
>
>
>
>
> SELECT
>
> a.id,
>
> a.mmsi,
>
> a.name,
>
> a.epoch,
>
> a.the_geom,
>
> 1 + sum(a.tr_index) OVER w AS transit,
>
> a.active
>
> FROM cte_01 a
>
> WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>
>
>
> --
> Jeff



--
Jeffrey D. Adams
Contractor
OAI, Inc.
In support of:
National Marine Fisheries Service
Office of Protected Resources
1315 East West Hwy, Building SSMC3
Silver Spring, MD 20910-3282
phone: (301) 427-8434
fax: (301) 713-0376

Re: Large Table - Slow Window Functions (Better Approach?)

От
Pavel Stehule
Дата:
2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
> Pavel,
>
> Thanks for the response. I have not yet had the opportunity to use cursors,
> but am now curious. Could you perhaps provide a bit more detail as to what
> the implementation of your suggested approach would look like?

an example:

$$
DECLARE
  r record;
  prev_r record;

BEGIN
  FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
  LOOP
    IF prev_r IS NOT NULL THEN
      /* do some counting */
      prev_r contains previous row, r contains current row
      do some
      RETURN NEXT .. /* return data in defined order */
    END IF;
    prev_r = r;
  END LOOP;


Probably slow part of your query is sorting - first can be accelerated
by index, but second (as CTE result cannot) - you can try increase
work_mem ??

Regards

Pavel

>
>
> On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <pavel.stehule@gmail.com>
> wrote:
>>
>> Hello
>>
>> you can try procedural solution - use a cursor over ordered data in
>> plpgsql and returns table
>>
>> Regards
>>
>> Pavel Stehule
>>
>> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
>> > Greetings,
>> >
>> >
>> >
>> > I have a large table (~90 million rows) containing vessel positions. In
>> > addition to a column that contains the location information (the_geom),
>> > the
>> > table also contains two columns that are used to uniquely identify the
>> > vessel (mmsi and name) and a column containing the Unix time (epoch) at
>> > which the position information was logged. I frequently need to assign
>> > records to vessel transits. To do this, I currently create a CTE that
>> > uses a
>> > Window function (partitioning the data by mmsi and name ordered by
>> > epoch) to
>> > examine the time that has elapsed between successive position reports
>> > for
>> > individual vessels. For every position record for a vessel (as
>> > identified
>> > using mmsi and name), if the time elapsed between the current position
>> > record and the previous record (using the lag function) is less than or
>> > equal to 2 hours, I assign the record a value of 0 to a CTE column named
>> > tr_index. If the time elapsed is greater than 2 hours, I assign the
>> > record a
>> > value of 1 to the tr_index column. I then use the CTE to generate
>> > transit
>> > numbers by summing the values in the tr_index field across a Window that
>> > also partitions the data by mmsi and name and is ordered by epoch. This
>> > works, but is very slow (hours). The table is indexed (multi-column
>> > index on
>> > mmsi, name and index on epoch). Does anyone see a way to get what I am
>> > after
>> > in a more efficient manner. What I am after is an assignment of transit
>> > number to vessels' position records based on whether the records were
>> > within
>> > two hours of each other. The SQL that I used is provided below. Any
>> > advice
>> > would be greatly appreciated...
>> >
>> >
>> >
>> > WITH
>> >
>> > cte_01 AS
>> >
>> > (
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom
>> >
>> > CASE
>> >
>> >   WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
>> >
>> >   ELSE 0
>> >
>> > END AS tr_index
>> >
>> > FROM table a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> > )
>> >
>> >
>> >
>> >
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom,
>> >
>> > 1 + sum(a.tr_index) OVER w AS transit,
>> >
>> > a.active
>> >
>> > FROM cte_01 a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> >
>> >
>> > --
>> > Jeff
>
>
>
>
> --
> Jeffrey D. Adams
> Contractor
> OAI, Inc.
> In support of:
> National Marine Fisheries Service
> Office of Protected Resources
> 1315 East West Hwy, Building SSMC3
> Silver Spring, MD 20910-3282
> phone: (301) 427-8434
> fax: (301) 713-0376


Re: Large Table - Slow Window Functions (Better Approach?)

От
Jeff Adams - NOAA Affiliate
Дата:
Thanks again. The sorting does appear to be the issue. I will test out your cursor idea...

On Mon, Mar 11, 2013 at 11:34 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
> Pavel,
>
> Thanks for the response. I have not yet had the opportunity to use cursors,
> but am now curious. Could you perhaps provide a bit more detail as to what
> the implementation of your suggested approach would look like?

an example:

$$
DECLARE
  r record;
  prev_r record;

BEGIN
  FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
  LOOP
    IF prev_r IS NOT NULL THEN
      /* do some counting */
      prev_r contains previous row, r contains current row
      do some
      RETURN NEXT .. /* return data in defined order */
    END IF;
    prev_r = r;
  END LOOP;


Probably slow part of your query is sorting - first can be accelerated
by index, but second (as CTE result cannot) - you can try increase
work_mem ??

Regards

Pavel

>
>
> On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <pavel.stehule@gmail.com>
> wrote:
>>
>> Hello
>>
>> you can try procedural solution - use a cursor over ordered data in
>> plpgsql and returns table
>>
>> Regards
>>
>> Pavel Stehule
>>
>> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
>> > Greetings,
>> >
>> >
>> >
>> > I have a large table (~90 million rows) containing vessel positions. In
>> > addition to a column that contains the location information (the_geom),
>> > the
>> > table also contains two columns that are used to uniquely identify the
>> > vessel (mmsi and name) and a column containing the Unix time (epoch) at
>> > which the position information was logged. I frequently need to assign
>> > records to vessel transits. To do this, I currently create a CTE that
>> > uses a
>> > Window function (partitioning the data by mmsi and name ordered by
>> > epoch) to
>> > examine the time that has elapsed between successive position reports
>> > for
>> > individual vessels. For every position record for a vessel (as
>> > identified
>> > using mmsi and name), if the time elapsed between the current position
>> > record and the previous record (using the lag function) is less than or
>> > equal to 2 hours, I assign the record a value of 0 to a CTE column named
>> > tr_index. If the time elapsed is greater than 2 hours, I assign the
>> > record a
>> > value of 1 to the tr_index column. I then use the CTE to generate
>> > transit
>> > numbers by summing the values in the tr_index field across a Window that
>> > also partitions the data by mmsi and name and is ordered by epoch. This
>> > works, but is very slow (hours). The table is indexed (multi-column
>> > index on
>> > mmsi, name and index on epoch). Does anyone see a way to get what I am
>> > after
>> > in a more efficient manner. What I am after is an assignment of transit
>> > number to vessels' position records based on whether the records were
>> > within
>> > two hours of each other. The SQL that I used is provided below. Any
>> > advice
>> > would be greatly appreciated...
>> >
>> >
>> >
>> > WITH
>> >
>> > cte_01 AS
>> >
>> > (
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom
>> >
>> > CASE
>> >
>> >   WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
>> >
>> >   ELSE 0
>> >
>> > END AS tr_index
>> >
>> > FROM table a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> > )
>> >
>> >
>> >
>> >
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom,
>> >
>> > 1 + sum(a.tr_index) OVER w AS transit,
>> >
>> > a.active
>> >
>> > FROM cte_01 a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> >
>> >
>> > --
>> > Jeff

Re: Large Table - Slow Window Functions (Better Approach?)

От
Jeff Adams - NOAA Affiliate
Дата:
So, I tested out the cursor approach, and it still chugs along for hours. If the result set is large (and the available memory to process small), does it matter what goes on within the cursor. Will it still choke trying assemble and spit out the large result set?

On Mon, Mar 11, 2013 at 11:48 AM, Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov> wrote:
Thanks again. The sorting does appear to be the issue. I will test out your cursor idea...


On Mon, Mar 11, 2013 at 11:34 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
> Pavel,
>
> Thanks for the response. I have not yet had the opportunity to use cursors,
> but am now curious. Could you perhaps provide a bit more detail as to what
> the implementation of your suggested approach would look like?

an example:

$$
DECLARE
  r record;
  prev_r record;

BEGIN
  FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
  LOOP
    IF prev_r IS NOT NULL THEN
      /* do some counting */
      prev_r contains previous row, r contains current row
      do some
      RETURN NEXT .. /* return data in defined order */
    END IF;
    prev_r = r;
  END LOOP;


Probably slow part of your query is sorting - first can be accelerated
by index, but second (as CTE result cannot) - you can try increase
work_mem ??

Regards

Pavel

>
>
> On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <pavel.stehule@gmail.com>
> wrote:
>>
>> Hello
>>
>> you can try procedural solution - use a cursor over ordered data in
>> plpgsql and returns table
>>
>> Regards
>>
>> Pavel Stehule
>>
>> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>:
>> > Greetings,
>> >
>> >
>> >
>> > I have a large table (~90 million rows) containing vessel positions. In
>> > addition to a column that contains the location information (the_geom),
>> > the
>> > table also contains two columns that are used to uniquely identify the
>> > vessel (mmsi and name) and a column containing the Unix time (epoch) at
>> > which the position information was logged. I frequently need to assign
>> > records to vessel transits. To do this, I currently create a CTE that
>> > uses a
>> > Window function (partitioning the data by mmsi and name ordered by
>> > epoch) to
>> > examine the time that has elapsed between successive position reports
>> > for
>> > individual vessels. For every position record for a vessel (as
>> > identified
>> > using mmsi and name), if the time elapsed between the current position
>> > record and the previous record (using the lag function) is less than or
>> > equal to 2 hours, I assign the record a value of 0 to a CTE column named
>> > tr_index. If the time elapsed is greater than 2 hours, I assign the
>> > record a
>> > value of 1 to the tr_index column. I then use the CTE to generate
>> > transit
>> > numbers by summing the values in the tr_index field across a Window that
>> > also partitions the data by mmsi and name and is ordered by epoch. This
>> > works, but is very slow (hours). The table is indexed (multi-column
>> > index on
>> > mmsi, name and index on epoch). Does anyone see a way to get what I am
>> > after
>> > in a more efficient manner. What I am after is an assignment of transit
>> > number to vessels' position records based on whether the records were
>> > within
>> > two hours of each other. The SQL that I used is provided below. Any
>> > advice
>> > would be greatly appreciated...
>> >
>> >
>> >
>> > WITH
>> >
>> > cte_01 AS
>> >
>> > (
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom
>> >
>> > CASE
>> >
>> >   WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
>> >
>> >   ELSE 0
>> >
>> > END AS tr_index
>> >
>> > FROM table a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> > )
>> >
>> >
>> >
>> >
>> >
>> > SELECT
>> >
>> > a.id,
>> >
>> > a.mmsi,
>> >
>> > a.name,
>> >
>> > a.epoch,
>> >
>> > a.the_geom,
>> >
>> > 1 + sum(a.tr_index) OVER w AS transit,
>> >
>> > a.active
>> >
>> > FROM cte_01 a
>> >
>> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >
>> >
>> >
>> > --
>> > Jeff



--
Jeffrey D. Adams
Contractor
OAI, Inc.
In support of:
National Marine Fisheries Service
Office of Protected Resources
1315 East West Hwy, Building SSMC3
Silver Spring, MD 20910-3282
phone: (301) 427-8434
fax: (301) 713-0376

Re: Large Table - Slow Window Functions (Better Approach?)

От
Victor Yegorov
Дата:
2013/3/11 Jeff Adams - NOAA Affiliate <jeff.adams@noaa.gov>

Greetings,

 

I have a large table (~90 million rows) containing vessel positions.

  ...



Could you kindly provide a script to create the table and populate it with several sample
rows, please? Also, provide the desired output for the sample rows.

It would be good to take a look on the “EXPLAIN ANALYZE” output of your query,
please, share http://explain.depesz.com/ link.


--
Victor Y. Yegorov