Обсуждение: Optimize date query for large child tables: GiST or GIN?

Поиск
Список
Период
Сортировка

Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

I recently switched to PostgreSQL from MySQL so that I can use PL/R for data analysis. The query in MySQL form (against a more complex table structure) takes ~5 seconds to run. The query in PostgreSQL I have yet to let finish, as it takes over a minute. I think I have the correct table structure in place (it is much simpler than the former structure in MySQL), however the query executes a full table scan against the parent table's 273 million rows.

Questions

What is the proper way to index the dates to avoid full table scans?

Options I have considered:
  • GIN
  • GiST
  • Rewrite the WHERE clause
  • Separate year_taken, month_taken, and day_taken columns to the tables
Details

The HashAggregate from the plan shows a cost of 10006220141.11, which is, I suspect, on the astronomically huge side. There is a full table scan on the measurement table (itself having neither data nor indexes) being performed. The table aggregates 237 million rows from its child tables. The sluggishness comes from this part of the query:

      m.taken BETWEEN
        /* Start date. */
      (extract( YEAR FROM m.taken )||'-01-01')::date AND
        /* End date. Calculated by checking to see if the end date wraps
          into the next year. If it does, then add 1 to the current year.
        */
        (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
          sign(
            (extract( YEAR FROM m.taken )||'-12-31')::date -
            (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
        ) AS text)||'-12-31')::date


There are 72 child tables, each having a year index and a station index, which are defined as follows:

    CREATE TABLE climate.measurement_12_013 (
    -- Inherited from table climate.measurement_12_013:  id bigint NOT NULL DEFAULT nextval('climate.measurement_id_seq'::regclass),
    -- Inherited from table climate.measurement_12_013:  station_id integer NOT NULL,
    -- Inherited from table climate.measurement_12_013:  taken date NOT NULL,
    -- Inherited from table climate.measurement_12_013:  amount numeric(8,2) NOT NULL,
    -- Inherited from table climate.measurement_12_013:  category_id smallint NOT NULL,
    -- Inherited from table climate.measurement_12_013:  flag character varying(1) NOT NULL DEFAULT ' '::character varying,
      CONSTRAINT measurement_12_013_category_id_check CHECK (category_id = 7),
      CONSTRAINT measurement_12_013_taken_check CHECK (date_part('month'::text, taken)::integer = 12)
    )
    INHERITS (climate.measurement)

    CREATE INDEX measurement_12_013_s_idx
      ON climate.measurement_12_013
      USING btree
      (station_id);
    CREATE INDEX measurement_12_013_y_idx
      ON climate.measurement_12_013
      USING btree
      (date_part('year'::text, taken));

(Foreign key constraints to be added later.)

The following query runs abysmally slow due to a full table scan:

    SELECT
      count(1) AS measurements,
      avg(m.amount) AS amount
    FROM
      climate.measurement m
    WHERE
      m.station_id IN (
        SELECT
          s.id
        FROM
          climate.station s,
          climate.city c
        WHERE
            /* For one city... */
            c.id = 5182 AND

            /* Where stations are within an elevation range... */
            s.elevation BETWEEN 0 AND 3000 AND

            /* and within a specific radius... */
            6371.009 * SQRT(
              POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
                (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
                  POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
            ) <= 50
        ) AND

      /* Data before 1900 is shaky; insufficient after 2009. */
      extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

      /* Whittled down by category... */
      m.category_id = 1 AND

      /* Between the selected days and years... */
      m.taken BETWEEN
       /* Start date. */
       (extract( YEAR FROM m.taken )||'-01-01')::date AND
        /* End date. Calculated by checking to see if the end date wraps
           into the next year. If it does, then add 1 to the current year.
        */
        (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
          sign(
            (extract( YEAR FROM m.taken )||'-12-31')::date -
            (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
        ) AS text)||'-12-31')::date
    GROUP BY
      extract( YEAR FROM m.taken )

What are your thoughts?

Thank you!

Re: Optimize date query for large child tables: GiST or GIN?

От
Yeb Havinga
Дата:
Hello David,
> The table aggregates 237 million rows from its child tables. The
> sluggishness comes from this part of the query:
>
>       m.taken BETWEEN
>         /* Start date. */
>       (extract( YEAR FROM m.taken )||'-01-01')::date AND
>         /* End date. Calculated by checking to see if the end date wraps
>           into the next year. If it does, then add 1 to the current year.
>         */
>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>           sign(
>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>         ) AS text)||'-12-31')::date
Either I had too less coffee and completely misunderstand this
expression, or it is always true and can be omitted. Could you explain a
bit what this part tries to do and maybe also show it's original
counterpart in the source database?

regards,
Yeb Havinga


Re: Optimize date query for large child tables: GiST or GIN?

От
Thom Brown
Дата:
On 20 May 2010 06:06, David Jarvis <thangalin@gmail.com> wrote:
> Hi,
>
> I recently switched to PostgreSQL from MySQL so that I can use PL/R for data
> analysis. The query in MySQL form (against a more complex table structure)
> takes ~5 seconds to run. The query in PostgreSQL I have yet to let finish,
> as it takes over a minute. I think I have the correct table structure in
> place (it is much simpler than the former structure in MySQL), however the
> query executes a full table scan against the parent table's 273 million
> rows.
>
> Questions
>
> What is the proper way to index the dates to avoid full table scans?
>
> Options I have considered:
>
> GIN
> GiST
> Rewrite the WHERE clause
> Separate year_taken, month_taken, and day_taken columns to the tables
>
> Details
>
> The HashAggregate from the plan shows a cost of 10006220141.11, which is, I
> suspect, on the astronomically huge side. There is a full table scan on the
> measurement table (itself having neither data nor indexes) being performed.
> The table aggregates 237 million rows from its child tables. The
> sluggishness comes from this part of the query:
>
>       m.taken BETWEEN
>         /* Start date. */
>       (extract( YEAR FROM m.taken )||'-01-01')::date AND
>         /* End date. Calculated by checking to see if the end date wraps
>           into the next year. If it does, then add 1 to the current year.
>         */
>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>           sign(
>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>         ) AS text)||'-12-31')::date
>
> There are 72 child tables, each having a year index and a station index,
> which are defined as follows:
>
>     CREATE TABLE climate.measurement_12_013 (
>     -- Inherited from table climate.measurement_12_013:  id bigint NOT NULL
> DEFAULT nextval('climate.measurement_id_seq'::regclass),
>     -- Inherited from table climate.measurement_12_013:  station_id integer
> NOT NULL,
>     -- Inherited from table climate.measurement_12_013:  taken date NOT
> NULL,
>     -- Inherited from table climate.measurement_12_013:  amount numeric(8,2)
> NOT NULL,
>     -- Inherited from table climate.measurement_12_013:  category_id
> smallint NOT NULL,
>     -- Inherited from table climate.measurement_12_013:  flag character
> varying(1) NOT NULL DEFAULT ' '::character varying,
>       CONSTRAINT measurement_12_013_category_id_check CHECK (category_id =
> 7),
>       CONSTRAINT measurement_12_013_taken_check CHECK
> (date_part('month'::text, taken)::integer = 12)
>     )
>     INHERITS (climate.measurement)
>
>     CREATE INDEX measurement_12_013_s_idx
>       ON climate.measurement_12_013
>       USING btree
>       (station_id);
>     CREATE INDEX measurement_12_013_y_idx
>       ON climate.measurement_12_013
>       USING btree
>       (date_part('year'::text, taken));
>
> (Foreign key constraints to be added later.)
>
> The following query runs abysmally slow due to a full table scan:
>
>     SELECT
>       count(1) AS measurements,
>       avg(m.amount) AS amount
>     FROM
>       climate.measurement m
>     WHERE
>       m.station_id IN (
>         SELECT
>           s.id
>         FROM
>           climate.station s,
>           climate.city c
>         WHERE
>             /* For one city... */
>             c.id = 5182 AND
>
>             /* Where stations are within an elevation range... */
>             s.elevation BETWEEN 0 AND 3000 AND
>
>             /* and within a specific radius... */
>             6371.009 * SQRT(
>               POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
>                 (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
>                   POW(RADIANS(c.longitude_decimal - s.longitude_decimal),
> 2))
>             ) <= 50
>         ) AND
>
>       /* Data before 1900 is shaky; insufficient after 2009. */
>       extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND
>
>       /* Whittled down by category... */
>       m.category_id = 1 AND
>
>       /* Between the selected days and years... */
>       m.taken BETWEEN
>        /* Start date. */
>        (extract( YEAR FROM m.taken )||'-01-01')::date AND
>         /* End date. Calculated by checking to see if the end date wraps
>            into the next year. If it does, then add 1 to the current year.
>         */
>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>           sign(
>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>         ) AS text)||'-12-31')::date
>     GROUP BY
>       extract( YEAR FROM m.taken )
>
> What are your thoughts?
>
> Thank you!
>
>

Could you provide the EXPLAIN output for that slow query?

Thom

Re: Optimize date query for large child tables: GiST or GIN?

От
Matthew Wakeling
Дата:
On Wed, 19 May 2010, David Jarvis wrote:
>      extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

That portion of the WHERE clause cannot use an index on m.taken. Postgres
does not look inside functions (like extract) to see if something
indexable is present. To get an index to work, you could create an index
on (extract(YEAR FROM m.taken)).

Matthew

--
 Here we go - the Fairy Godmother redundancy proof.
                                        -- Computer Science Lecturer

Re: Optimize date query for large child tables: GiST or GIN?

От
Tom Lane
Дата:
Matthew Wakeling <matthew@flymine.org> writes:
> On Wed, 19 May 2010, David Jarvis wrote:
>> extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

> That portion of the WHERE clause cannot use an index on m.taken. Postgres
> does not look inside functions (like extract) to see if something
> indexable is present. To get an index to work, you could create an index
> on (extract(YEAR FROM m.taken)).

What you really need to do is not do date arithmetic using text-string
operations.  The planner has no intelligence about that whatsoever.
Convert the operations to something natural using real date or timestamp
types, and then look at what indexes you need.

            regards, tom lane

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

I have posted an image of the user inputs here:


The problem is that I am given a range of days (Dec 22 - Mar 22) over a range of years (1900 - 2009) and the range of days can span from one year to the next. This is not the same as saying Dec 22, 1900 to Mar 22, 2009, for which I do not need date math.

What you really need to do is not do date arithmetic using text-string
operations.  The planner has no intelligence about that whatsoever.
Convert the operations to something natural using real date or timestamp
types, and then look at what indexes you need.

Any suggestions on how to go about this?

Thanks again!

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Thom Brown
Дата:
On 20 May 2010 17:36, David Jarvis <thangalin@gmail.com> wrote:
> Hi, Thom.
>
> The query is given two items:
>
> Range of years
> Range of days
>
> I need to select all data between the range of days (e.g., Dec 22 - Mar 22)
> over the range of years (e.g., 1950 - 1970), such as shown here:
>
> http://i.imgur.com/MUkuZ.png
>
> For Jun 1 to Jul 1 it would be no problem because they the same year. But
> for Dec 22 to Mar 22, it is difficult because Mar 22 is in the next year
> (relative to Dec 22).
>
> How do I do that without strings?
>
> Dave
>
>

Okay, get your app to convert the month-date to a day of year, so we
have year_start, year_end, day_of_year_start, day_of_year_end

and your where clause would something like this:

WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
AND (
    extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
    OR (
        extract(DOY from m.taken) >= day_of_year_start OR extract(DOY from
m.taken) <= day_of_year_end
    )
)

... substituting the placeholders where they appear.

So if we had:

year1=1941
year2=1952
day_of_year_start=244 (based on input date of 1st September)
day_of_year_end=94 (based on 4th April)

We'd have:

WHERE extract(YEAR from m.taken) BETWEEN 1941 and 1952
AND (
    extract(DOY from m.taken) BETWEEN 244 AND 94
    OR (
        extract(DOY from m.taken) >= 244 OR extract(DOY from m.taken) <= 94
    )
)

Then you could add expression indexes for the YEAR and DOY extract parts, like:

CREATE INDEX idx_taken_doy ON climate.measurement (EXTRACT(DOY from taken));
CREATE INDEX idx_taken_year ON climate.measurement (EXTRACT(YEAR from taken));

Although maybe you don't need those, depending on how the date
datatype matching works in the planner with the EXTRACT function.

Regards

Thom

Re: Optimize date query for large child tables: GiST or GIN?

От
Thom Brown
Дата:
On 20 May 2010 19:36, Thom Brown <thombrown@gmail.com> wrote:
> On 20 May 2010 17:36, David Jarvis <thangalin@gmail.com> wrote:
>> Hi, Thom.
>>
>> The query is given two items:
>>
>> Range of years
>> Range of days
>>
>> I need to select all data between the range of days (e.g., Dec 22 - Mar 22)
>> over the range of years (e.g., 1950 - 1970), such as shown here:
>>
>> http://i.imgur.com/MUkuZ.png
>>
>> For Jun 1 to Jul 1 it would be no problem because they the same year. But
>> for Dec 22 to Mar 22, it is difficult because Mar 22 is in the next year
>> (relative to Dec 22).
>>
>> How do I do that without strings?
>>
>> Dave
>>
>>
>
> Okay, get your app to convert the month-date to a day of year, so we
> have year_start, year_end, day_of_year_start, day_of_year_end
>
> and your where clause would something like this:
>
> WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
> AND (
>        extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
>        OR (
>                extract(DOY from m.taken) >= day_of_year_start OR extract(DOY from
> m.taken) <= day_of_year_end
>        )
> )
>
> ... substituting the placeholders where they appear.
>
> So if we had:
>
> year1=1941
> year2=1952
> day_of_year_start=244 (based on input date of 1st September)
> day_of_year_end=94 (based on 4th April)
>
> We'd have:
>
> WHERE extract(YEAR from m.taken) BETWEEN 1941 and 1952
> AND (
>        extract(DOY from m.taken) BETWEEN 244 AND 94
>        OR (
>                extract(DOY from m.taken) >= 244 OR extract(DOY from m.taken) <= 94
>        )
> )
>
> Then you could add expression indexes for the YEAR and DOY extract parts, like:
>
> CREATE INDEX idx_taken_doy ON climate.measurement (EXTRACT(DOY from taken));
> CREATE INDEX idx_taken_year ON climate.measurement (EXTRACT(YEAR from taken));
>
> Although maybe you don't need those, depending on how the date
> datatype matching works in the planner with the EXTRACT function.
>
> Regards
>
> Thom
>

Actually, you could change that last bit from:

 OR (
                extract(DOY from m.taken) >= day_of_year_start OR
extract(DOY from m.taken) <= day_of_year_end
       )

to

OR extract(DOY from m.taken) NOT BETWEEN day_of_year_end AND day_of_year_start

That would be tidier and simpler :)

Thom

Re: Optimize date query for large child tables: GiST or GIN?

От
Tom Lane
Дата:
Thom Brown <thombrown@gmail.com> writes:
> On 20 May 2010 17:36, David Jarvis <thangalin@gmail.com> wrote:
> Okay, get your app to convert the month-date to a day of year, so we
> have year_start, year_end, day_of_year_start, day_of_year_end

> and your where clause would something like this:

> WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
> AND (
>     extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
>     OR (
>         extract(DOY from m.taken) >= day_of_year_start OR extract(DOY from
> m.taken) <= day_of_year_end
>     )
> )

extract(DOY) seems a bit problematic here, because its day numbering is
going to be different between leap years and non-leap years, and David's
problem statement doesn't allow for off-by-one errors.  You could
certainly invent your own function that worked similarly but always
translated a given month/day to the same number.

The other thing that's messy here is the wraparound requirement.
Rather than trying an OR like the above (which I think doesn't quite
work anyway --- won't it select everything?), it would be better if
you can have the app distinguish wraparound from non-wraparound cases
and issue different queries in the two cases.  In the non-wrap case
(start_day < end_day) it's pretty easy, just
    my_doy(m.taken) BETWEEN start_val AND end_val
The easy way to handle the wrap case is
    my_doy(m.taken) <= start_val OR my_doy(m.taken) >= end_val
although I can't help feeling there should be a smarter way to do
this where you can use an AND range check on some modified expression
derived from the date.

            regards, tom lane

Re: Optimize date query for large child tables: GiST or GIN?

От
Thom Brown
Дата:
On 20 May 2010 20:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thom Brown <thombrown@gmail.com> writes:
>> On 20 May 2010 17:36, David Jarvis <thangalin@gmail.com> wrote:
>> Okay, get your app to convert the month-date to a day of year, so we
>> have year_start, year_end, day_of_year_start, day_of_year_end
>
>> and your where clause would something like this:
>
>> WHERE extract(YEAR from m.taken) BETWEEN year1 and year2
>> AND (
>>       extract(DOY from m.taken) BETWEEN day_of_year_start AND day_of_year_end
>>       OR (
>>               extract(DOY from m.taken) >= day_of_year_start OR extract(DOY from
>> m.taken) <= day_of_year_end
>>       )
>> )
>
> extract(DOY) seems a bit problematic here, because its day numbering is
> going to be different between leap years and non-leap years, and David's
> problem statement doesn't allow for off-by-one errors.  You could
> certainly invent your own function that worked similarly but always
> translated a given month/day to the same number.
>
> The other thing that's messy here is the wraparound requirement.
> Rather than trying an OR like the above (which I think doesn't quite
> work anyway --- won't it select everything?)

No.  It only would if using BETWEEN SYMMETRIC.

Like if m.taken is '2003-02-03', using a start day of year as 11th Nov
and end as 17th Feb, it would match the 2nd part of the outer OR
expression.  If you changed the end day of year to 2nd Feb, it would
yield no result as nothing is between 11th Nov and 17th Feb as it's a
negative difference, and 2nd Feb is lower than the taken date so fails
to match the first half of the inner most OR expression.

> , it would be better if
> you can have the app distinguish wraparound from non-wraparound cases
> and issue different queries in the two cases.  In the non-wrap case
> (start_day < end_day) it's pretty easy, just
>        my_doy(m.taken) BETWEEN start_val AND end_val
> The easy way to handle the wrap case is
>        my_doy(m.taken) <= start_val OR my_doy(m.taken) >= end_val
> although I can't help feeling there should be a smarter way to do
> this where you can use an AND range check on some modified expression
> derived from the date.
>
>                        regards, tom lane
>

Yes, I guess I agree that the app can run different queries depending
on which date is higher.  I hadn't factored leap years into the
equation.  Can't think of what could be done for those cases off the
top of my head.  What is really needed is a way to match against day
and month parts instead of day, month and year.... without resorting
to casting to text of course.

Thom

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
When using MySQL, the performance was okay (~5 seconds per query) using:

  date( concat_ws( '-', y.year, m.month, d.day ) ) between
    -- Start date.
    date( concat_ws( '-', y.year, $P{Month1}, $P{Day1} ) ) AND
    -- End date. Calculated by checking to see if the end date wraps
    -- into the next year. If it does, then add 1 to the current year.
    --
    date(
      concat_ws( '-',
        y.year + greatest( -1 *
          sign(
            datediff(
              date(
                concat_ws('-', y.year, $P{Month2}, $P{Day2} )
              ),
              date(
                concat_ws('-', y.year, $P{Month1}, $P{Day1} )
              )
            )
          ), 0
        ), $P{Month2}, $P{Day2}
      )
    )

This calculated the correct start days and end days, including leap years.

With MySQL, I "normalized" the date into three different tables: year references, month references, and day references. The days contained only the day (of the month) the measurement was made and the measured value. The month references contained the month number for the measurement. The year references had the years and station. Each table had its own index on the year, month, or day.

When I had proposed that solution to the mailing list, I was introduced to a more PostgreSQL-way, which was to use indexes on the date field.

In PostgreSQL, I have a single "measurement" table for the data (divided into 72 child tables), which includes the date and station. I like this because it feels clean and it is easier to understand. So far, however, it has not been fast.

I was thinking that I could add three more columns to the measurement table:

year_taken, month_taken, day_taken

Then index those. That should allow me to avoid extracting years, months, and days from the m.taken date column.

What do you think?

Thanks again!
Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Tom Lane
Дата:
David Jarvis <thangalin@gmail.com> writes:
> I was thinking that I could add three more columns to the measurement table:
> year_taken, month_taken, day_taken
> Then index those. That should allow me to avoid extracting years, months,
> and days from the *m.taken* date column.

You could, but I don't think there's any advantage to that versus
putting indexes on extract(day from taken) etc.  The extra fields
eat more space in the table proper, and the functional index isn't
really any more expensive than a plain index.  Not to mention that
you can have bugs with changing the date and forgetting to update
the derived columns, etc etc.

            regards, tom lane

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
What if I were to have the application pass in two sets of date ranges?

For the condition of Dec 22 to Mar 22:

Dec 22 would become:
  • Dec 22 - Dec 31
Mar 22 would become:
  • Jan 1 - Mar 22
The first range would always be for the current year; the second range would always be for the year following the current year.

Would that allow PostgreSQL to use the index?

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Tom Lane
Дата:
David Jarvis <thangalin@gmail.com> writes:
> What if I were to have the application pass in two sets of date ranges?
> For the condition of Dec 22 to Mar 22:
> Dec 22 would become:
>    - Dec 22 - Dec 31
> Mar 22 would become:
>    - Jan 1 - Mar 22

I think what you're essentially describing here is removing the OR from
the query in favor of issuing two queries and then combining the results
in the app.  Yeah, you could do that, but one would hope that it isn't
faster ;-)

            regards, tom lane

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
I was hoping to eliminate this part of the query:

        (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
          sign(
            (extract( YEAR FROM m.taken )||'-12-31')::date -
            (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
        ) AS text)||'-12-31')::date

That uses functions to create the dates, which is definitely the problem. I'd still have the query return all the results for both data sets. If providing the query with two data sets won't work, what will?

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Tom Lane
Дата:
David Jarvis <thangalin@gmail.com> writes:
> I was hoping to eliminate this part of the query:
>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>           sign(
>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>         ) AS text)||'-12-31')::date

> That uses functions to create the dates, which is definitely the problem.

Well, it's not the functions per se that's the problem, it's the lack of
a useful index on the expression.  But as somebody remarked upthread,
that expression doesn't look correct at all.  Doesn't the whole
greatest() subexpression reduce to a constant?

            regards, tom lane

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

I was still referring to the measurement table. You have an index on stationid, but still seem to be getting a sequential scan. Maybe the planner does not realise that you are selecting a small number of stations. Posting an EXPLAIN ANALYSE would really help here.

Here is the result from an EXPLAIN ANALYZE:

"HashAggregate  (cost=5486752.27..5486756.27 rows=200 width=12) (actual time=314328.657..314328.728 rows=110 loops=1)"
"  ->  Hash Semi Join  (cost=1045.52..5451155.11 rows=4746289 width=12) (actual time=197.950..313605.795 rows=463926 loops=1)"
"        Hash Cond: (m.station_id = s.id)"
"        ->  Append  (cost=0.00..5343318.08 rows=4746289 width=16) (actual time=74.411..306533.820 rows=42737997 loops=1)"
"              ->  Seq Scan on measurement m  (cost=0.00..148.00 rows=1 width=20) (actual time=0.001..0.001 rows=0 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_01_001 m  (cost=0.00..438102.26 rows=389080 width=16) (actual time=74.409..24800.171 rows=3503256 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_02_001 m  (cost=0.00..399834.28 rows=354646 width=16) (actual time=29.217..22209.877 rows=3196631 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_03_001 m  (cost=0.00..438380.23 rows=389148 width=16) (actual time=15.915..24366.766 rows=3503937 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_04_001 m  (cost=0.00..432850.57 rows=384539 width=16) (actual time=15.852..24280.031 rows=3461931 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_05_001 m  (cost=0.00..466852.96 rows=415704 width=16) (actual time=19.495..26158.828 rows=3737276 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_06_001 m  (cost=0.00..458098.05 rows=407244 width=16) (actual time=25.062..26054.019 rows=3668108 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_07_001 m  (cost=0.00..472679.60 rows=420736 width=16) (actual time=17.852..26829.286 rows=3784626 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_08_001 m  (cost=0.00..471200.02 rows=418722 width=16) (actual time=20.781..26875.574 rows=3772848 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_09_001 m  (cost=0.00..447468.05 rows=397415 width=16) (actual time=17.454..25355.688 rows=3580395 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_10_001 m  (cost=0.00..449691.17 rows=399362 width=16) (actual time=17.911..25144.829 rows=3594957 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_11_001 m  (cost=0.00..429363.73 rows=380826 width=16) (actual time=18.944..24106.477 rows=3430085 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"              ->  Seq Scan on measurement_12_001 m  (cost=0.00..438649.19 rows=388866 width=16) (actual time=22.830..24466.324 rows=3503947 loops=1)"
"                    Filter: ((category_id = 1) AND (date_part('year'::text, (taken)::timestamp without time zone) >= 1900::double precision) AND (date_part('year'::text, (taken)::timestamp without time zone) <= 2009::double precision) AND (taken >= (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date) AND (taken <= ((((date_part('year'::text, (taken)::timestamp without time zone) + GREATEST(((-1)::double precision * sign((((((date_part('year'::text, (taken)::timestamp without time zone))::text || '-12-31'::text))::date - (((date_part('year'::text, (taken)::timestamp without time zone))::text || '-01-01'::text))::date))::double precision)), 0::double precision)))::text || '-12-31'::text))::date))"
"        ->  Hash  (cost=994.94..994.94 rows=4046 width=4) (actual time=120.793..120.793 rows=129 loops=1)"
"              ->  Nested Loop  (cost=0.00..994.94 rows=4046 width=4) (actual time=71.112..120.728 rows=129 loops=1)"
"                    Join Filter: ((6371.009::double precision * sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double precision), 2::double precision) + (cos((radians(((c.latitude_decimal + s.latitude_decimal))::double precision) / 2::double precision)) * pow(radians(((c.longitude_decimal - s.longitude_decimal))::double precision), 2::double precision))))) <= 50::double precision)"
"                    ->  Index Scan using city_pkey1 on city c  (cost=0.00..6.27 rows=1 width=16) (actual time=61.311..61.314 rows=1 loops=1)"
"                          Index Cond: (id = 5182)"
"                    ->  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20) (actual time=9.745..19.035 rows=12139 loops=1)"
"                          Filter: ((s.elevation >= 0) AND (s.elevation <= 3000))"
"Total runtime: 314329.201 ms"

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
The greatest() expression reduces to either the current year (year + 0) or the next year (year + 1) by taking the sign of the difference in start/end days. This allows me to derive an end date, such as:

Dec 22, 1900 to Mar 22, 1901

Then I check if the measured date falls between those two dates.

The expression might not be correct as I'm still quite new to PostgreSQL's syntax.

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Stephen Frost
Дата:
* David Jarvis (thangalin@gmail.com) wrote:
> I was hoping to eliminate this part of the query:
>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>           sign(
>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>         ) AS text)||'-12-31')::date
>
> That uses functions to create the dates, which is definitely the problem.
[...]
> The greatest() expression reduces to either the current year (year + 0) or
> the next year (year + 1) by taking the sign of the difference in start/end
> days. This allows me to derive an end date, such as:
>
> Dec 22, 1900 to Mar 22, 1901

Something in here really smells fishy to me.  Those extract's above are
working on values which are from the table..  Why aren't you using these
functions to figure out how to construct the actual dates based on the
values provided by the *user*..?

Looking at your screenshot, I think you need to take those two date
values that the user provides, make them into actual dates (maybe you
need a CASE statement or something similar, that shouldn't be that hard,
and PG should just run that whole bit once, since to PG's point of view,
it's all constants), and then use those dates to query the tables.

Also, you're trying to do constraint_exclusion, but have you made sure
that it's turned on?  And have you made sure that those constraints are
really the right ones and that they make sense?  You're using a bunch of
extract()'s there too, why not just specify a CHECK constraint on the
date ranges which are allowed in the table..?

Maybe I've misunderstood the whole point here, but I don't think so.

    Thanks,

        Stephen

Вложения

Re: Optimize date query for large child tables: GiST or GIN?

От
Yeb Havinga
Дата:
Tom Lane wrote:
> David Jarvis <thangalin@gmail.com> writes:
>
>> I was hoping to eliminate this part of the query:
>>         (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
>>           sign(
>>             (extract( YEAR FROM m.taken )||'-12-31')::date -
>>             (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
>>         ) AS text)||'-12-31')::date
>>
>> That uses functions to create the dates, which is definitely the problem.
>>
>
> Well, it's not the functions per se that's the problem, it's the lack of
> a useful index on the expression.  But as somebody remarked upthread,
> that expression doesn't look correct at all.  Doesn't the whole
> greatest() subexpression reduce to a constant?
>
That somebody was probably me. I still think the whole BETWEEN
expression is a tautology. A small test did not provide a
counterexample. In the select below everything but the select was
copy/pasted.

create table m (taken timestamptz);
insert into m values (now());
insert into m values ('1900-12-31');
insert into m values ('2000-04-06');
select m.taken BETWEEN
        /* Start date. */
      (extract( YEAR FROM m.taken )||'-01-01')::date AND
        /* End date. Calculated by checking to see if the end date wraps
          into the next year. If it does, then add 1 to the current year.
        */
        (cast(extract( YEAR FROM m.taken ) + greatest( -1 *
          sign(
            (extract( YEAR FROM m.taken )||'-12-31')::date -
            (extract( YEAR FROM m.taken )||'-01-01')::date ), 0
        ) AS text)||'-12-31')::date from m;
  ?column?
----------
 t
 t
 t
(3 rows)

Another thing is that IF the climate measurements is partitioned on time
(e.g each year?), then a function based index on the year part of
m.taken is useless, pardon my french. I'm not sure if it is partitioned
that way but it is an interesting thing to inspect, and perhaps rewrite
the query to use constraint exclusion.

regards,
Yeb Havinga


Re: Optimize date query for large child tables: GiST or GIN?

От
Stephen Frost
Дата:
* David Jarvis (thangalin@gmail.com) wrote:
> There are 72 child tables, each having a year index and a station index,
> which are defined as follows:

Soooo, my thoughts:

Partition by something that makes sense...  Typically, I'd say that you
would do it by the category id and when the measurement was taken.  Then
set up the appropriate check constraints on that so that PG can use
constraint_exclusion to identify what table it needs to actually go look
in.  How much data are we talking about, by the way? (# of rows)  If
you're not in the milions, partitioning at all is probably overkill and
might be part of the problem here..

create table climate.measurement_12_013 (
    id bigint not null DEFAULT nextval('climate.measurement_id_seq'::regclass),
    station_id integer not null,
    taken date not null,
    amount numeric(8,2) not null,
    category_id integer not null,
    flag varchar(1) not null default ' ',
    check (category_id = 7),
    check (taken >= '1913-12-01' and taken <= '1913-12-31')
    )
    inherits (climate.measurement);

    CREATE INDEX measurement_12_013_s_idx
      ON climate.measurement_12_013
      USING btree
      (station_id);

    CREATE INDEX measurement_12_013_d_idx
      ON climate.measurement_12_013
      USING btree
      (taken);

    SELECT
      count(1) AS measurements,
      avg(m.amount) AS amount
    FROM
      climate.measurement m
    WHERE
      m.station_id IN (
        SELECT
          s.id
        FROM
          climate.station s,
          climate.city c
        WHERE
            /* For one city... */
            c.id = 5182 AND

            /* Where stations are within an elevation range... */
            s.elevation BETWEEN 0 AND 3000 AND

            /* and within a specific radius... */
            -- Seriously, you should be using PostGIS here, that can
            -- then use a GIST index to do this alot faster with a
            -- bounding box...
            6371.009 * SQRT(
              POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
                (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
                  POW(RADIANS(c.longitude_decimal - s.longitude_decimal),
2))
            ) <= 50
        ) AND

      /* Data before 1900 is shaky; insufficient after 2009. */
      -- I have no idea why this is here..  Aren't you forcing
      -- this already in your application code that's checking
      -- user input values?  Also, do you actually *have* any
      -- data outside this range?  If so, just pull out the
      -- tables with that data from the inheiritance
      -- m.taken >= '1900-01-01' AND m.taken <= '2009-12-31'
      -- extract( YEAR FROM m.taken ) BETWEEN 1900 AND 2009 AND

      /* Whittled down by category... */
      m.category_id = 1 AND

      /* Between the selected days and years... */
       CASE
         WHEN (user_start_year || user_start_day <= user_stop_year || user_stop) THEN
         m.taken BETWEEN user_start_year || user_start_day  AND user_stop_year || user_stop
         WHEN (user_start_year || user_start_day > user_stop_year || user_stop) THEN
         m.taken BETWEEN (user_start_year || user_start_day)::date  AND
         ((user_stop_year || user_stop)::date + '1
         year'::interval)::date
    -- I don't think you need/want this..?
    -- GROUP BY
    --  extract( YEAR FROM m.taken )

        Enjoy,

            Stephen

Вложения

Re: Optimize date query for large child tables: GiST or GIN?

От
Stephen Frost
Дата:
* David Jarvis (thangalin@gmail.com) wrote:
> I was still referring to the measurement table. You have an index on
> > stationid, but still seem to be getting a sequential scan. Maybe the planner
> > does not realise that you are selecting a small number of stations. Posting
> > an EXPLAIN ANALYSE would really help here.
> >
>
> Here is the result from an *EXPLAIN ANALYZE*:

Yeah..  this is a horrible, horrible plan.  It does look like you've got
some serious data tho, at least.  Basically, PG is sequentially scanning
through all of the tables in your partitioning setup.  What is
constraint_exclusion set to?  What version of PG is this?  Do the
results og this query look at all correct to you?

Have you considered an index on elevation, btw?  How many records in
that city table are there and how many are actually in that range?

    Thanks,

        Stephen

Вложения

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

~300 million measurements
~12000 stations (not 70000 as I mentioned before)
~5500 cities

some serious data tho, at least.  Basically, PG is sequentially scanning
through all of the tables in your partitioning setup.  What is
constraint_exclusion set to?  What version of PG is this?  Do the
results og this query look at all correct to you?

PG 8.4

show constraint_exclusion;
partition
 
With so much data, it is really hard to tell if the query looks okay without having it visualized. I can't visualize it until I have the query set up correctly. At the moment it looks like the query is wrong. :-(

Have you considered an index on elevation, btw?  How many records in
that city table are there and how many are actually in that range?

I've since added a constraint on elevation; it'll help a bit:

CREATE INDEX station_elevation_idx
  ON climate.station
  USING btree
  (elevation);

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

       check (taken >= '1913-12-01' and taken <= '1913-12-31')

I don't think I want to constrain by year, for a few reasons:

1. There are a lot of years -- over 110.
2. There will be more years added (both in the future for 2010 and in the past as I get data from other sources).

Currently I have it constrained by month and category. Each table then has about 3 million rows (which is 216 million, but some tables have more, which brings it to 273 million).
 
     /* Data before 1900 is shaky; insufficient after 2009. */
         -- I have no idea why this is here..  Aren't you forcing

Mostly temporary. It is also constrained by the user interface; however that will likely change in the future. It should not be present in the database structure itself.

 
     /* Between the selected days and years... */

          CASE
            WHEN (user_start_year || user_start_day <= user_stop_year || user_stop) THEN
            m.taken BETWEEN user_start_year || user_start_day  AND user_stop_year || user_stop
            WHEN (user_start_year || user_start_day > user_stop_year || user_stop) THEN
            m.taken BETWEEN (user_start_year || user_start_day)::date  AND
                ((user_stop_year || user_stop)::date + '1
                year'::interval)::date
       -- I don't think you need/want this..?

User selects this:

1. Years: 1950 to 1974
2. Days: Dec 22 to Mar 22

This means that the query must average data between Dec 22 1950 and Mar 22 1951 for the year of 1950. For 1951, the range is Dec 22 1951 to Mar 22 1952, and so on. If we switch the calendar (or alter the seasons) so that winter starts Jan 1st (or ends Dec 31), then I could simplify the query. ;-)

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

Something in here really smells fishy to me.  Those extract's above are
working on values which are from the table..  Why aren't you using these
functions to figure out how to construct the actual dates based on the
values provided by the *user*..?

Because I've only been using PostgreSQL for one week. For the last several years I've been developing with Oracle on mid-sized systems (40 million books, 5 million reservations per year, etc.). And even then, primarily on the user-facing side of the applications.

Looking at your screenshot, I think you need to take those two date
values that the user provides, make them into actual dates (maybe you
need a CASE statement or something similar, that shouldn't be that hard,

So the user selects Dec 22 and Mar 22 for 1900 to 2009 and the system feeds the report a WHERE clause that looks like:

  m.taken BETWEEN '22-12-1900'::date AND '22-03-1901'::date and
  m.taken BETWEEN '22-12-1901'::date AND '22-03-1902'::date and
  m.taken BETWEEN '22-12-1902'::date AND '22-03-1903'::date and ...

That tightly couples the report query to the code that sets the report engine parameters. One of the parameters would be SQL code in the form of a dynamically crafted WHERE clause. I'd rather keep the SQL code that is used to create the report entirely with the report engine if at all possible.
 
Also, you're trying to do constraint_exclusion, but have you made sure
that it's turned on?  And have you made sure that those constraints are
really the right ones and that they make sense?  You're using a bunch of
extract()'s there too, why not just specify a CHECK constraint on the
date ranges which are allowed in the table..?

I don't know what the date ranges are? So I can't partition them by year?

Right now I created 72 child tables by using the category and month. This may have been a bad choice. But at least all the data is in the system now so dissecting or integrating it back in different ways shouldn't take days.

Thanks everyone for all your help, I really appreciate the time you've taken to guide me in the right direction to make the system as fast as it can be.

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
I took out the date conditions:

SELECT
  m.*
FROM
  climate.measurement m
WHERE
  m.category_id = 1 and
  m.station_id = 2043

This uses the station indexes:

"Result  (cost=0.00..21781.18 rows=8090 width=28)"
"  ->  Append  (cost=0.00..21781.18 rows=8090 width=28)"
"        ->  Seq Scan on measurement m  (cost=0.00..28.00 rows=1 width=38)"
"              Filter: ((category_id = 1) AND (station_id = 2043))"
"        ->  Bitmap Heap Scan on measurement_01_001 m  (cost=11.79..1815.67 rows=677 width=28)"
"              Recheck Cond: (station_id = 2043)"
"              Filter: (category_id = 1)"
"              ->  Bitmap Index Scan on measurement_01_001_s_idx  (cost=0.00..11.62 rows=677 width=0)"
"                    Index Cond: (station_id = 2043)"
"        ->  Bitmap Heap Scan on measurement_02_001 m  (cost=14.47..1682.18 rows=627 width=28)"
"              Recheck Cond: (station_id = 2043)"
"              Filter: (category_id = 1)"
"              ->  Bitmap Index Scan on measurement_02_001_s_idx  (cost=0.00..14.32 rows=627 width=0)"
"                    Index Cond: (station_id = 2043)"

2500+ rows in 185 milliseconds.

That is pretty good (I'll need it to be better but for now it works).

Then combined the selection of the station:

SELECT
  m.*
FROM
  climate.measurement m,
  (SELECT
     s.id
   FROM
     climate.station s,
     climate.city c
   WHERE
     c.id = 5182 AND
     s.elevation BETWEEN 0 AND 3000 AND
     6371.009 * SQRT(
       POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
       (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
        POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
     ) <= 25
   ) t
WHERE
  m.category_id = 1 and
  m.station_id = t.id

The station index is no longer used, resulting in full table scans:

"Hash Join  (cost=1045.52..1341150.09 rows=14556695 width=28)"
"  Hash Cond: (m.station_id = s.id)"
"  ->  Append  (cost=0.00..867011.99 rows=43670085 width=28)"
"        ->  Seq Scan on measurement m  (cost=0.00..25.00 rows=6 width=38)"
"              Filter: (category_id = 1)"
"        ->  Seq Scan on measurement_01_001 m  (cost=0.00..71086.96 rows=3580637 width=28)"
"              Filter: (category_id = 1)"
"        ->  Seq Scan on measurement_02_001 m  (cost=0.00..64877.40 rows=3267872 width=28)"
"              Filter: (category_id = 1)"
"        ->  Seq Scan on measurement_03_001 m  (cost=0.00..71131.44 rows=3582915 width=28)"
"              Filter: (category_id = 1)"

How do I avoid the FTS?

(I know about PostGIS but I can only learn and do so much at once.) ;-)

Here's the station query:

SELECT
  s.id
FROM
  climate.station s,
  climate.city c
WHERE
  c.id = 5182 AND
  s.elevation BETWEEN 0 AND 3000 AND
  6371.009 * SQRT(
    POW(RADIANS(c.latitude_decimal - s.latitude_decimal), 2) +
    (COS(RADIANS(c.latitude_decimal + s.latitude_decimal) / 2) *
    POW(RADIANS(c.longitude_decimal - s.longitude_decimal), 2))
  ) <= 25

And its EXPLAIN:

"Nested Loop  (cost=0.00..994.94 rows=4046 width=4)"
"  Join Filter: ((6371.009::double precision * sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double precision), 2::double precision) + (cos((radians(((c.latitude_decimal + s.latitude_decimal))::double precision) / 2::double precision)) * pow(radians(((c.longitude_decimal - s.longitude_decimal))::double precision), 2::double precision))))) <= 25::double precision)"
"  ->  Index Scan using city_pkey1 on city c  (cost=0.00..6.27 rows=1 width=16)"
"        Index Cond: (id = 5182)"
"  ->  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20)"
"        Filter: ((s.elevation >= 0) AND (s.elevation <= 3000))"

I get a set of 78 rows returned in very little time.

Thanks again!
Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Matthew Wakeling
Дата:
On Thu, 20 May 2010, David Jarvis wrote:
> I took out the date conditions:
>
> SELECT
>  m.*
> FROM
>  climate.measurement m
> WHERE
>  m.category_id = 1 and
>  m.station_id = 2043
>
> This uses the station indexes:

Yes, because there is only one station_id selected. That's exactly what an
index is for.

> Then combined the selection of the station:
> The station index is no longer used, resulting in full table scans:

> "Nested Loop  (cost=0.00..994.94 rows=4046 width=4)"
> "  Join Filter: ((6371.009::double precision *
> sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double
> precision), 2::double precision) + (cos((radians(((c.latitude_decimal +
> s.latitude_decimal))::double precision) / 2::double precision)) *
> pow(radians(((c.longitude_decimal - s.longitude_decimal))::double
> precision), 2::double precision))))) <= 25::double precision)"
> "  ->  Index Scan using city_pkey1 on city c  (cost=0.00..6.27 rows=1
> width=16)"
> "        Index Cond: (id = 5182)"
> "  ->  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20)"
> "        Filter: ((s.elevation >= 0) AND (s.elevation <= 3000))"
>
> I get a set of 78 rows returned in very little time.

(An EXPLAIN ANALYSE would be better here). Look at the expected number of
stations returned. It expects 4046 which is a large proportion of the
available stations. It therefore expects to have to touch a large
proportion of the measurement table, therefore it thinks that it will be
fastest to do a seq scan. In actual fact, for 78 stations, the index would
be faster, but for 4046 it wouldn't.

If you will be querying by season quite regularly, had you considered
partitioning by season?

Matthew

--
 Geography is going places.

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

(An EXPLAIN ANALYSE would be better here). Look at the expected number of stations

"Nested Loop  (cost=0.00..994.94 rows=4046 width=4) (actual time=0.053..41.173 rows=78 loops=1)"
"  Join Filter: ((6371.009::double precision * sqrt((pow(radians(((c.latitude_decimal - s.latitude_decimal))::double precision), 2::double precision) + (cos((radians(((c.latitude_decimal + s.latitude_decimal))::double precision) / 2::double precision)) * pow(radians(((c.longitude_decimal - s.longitude_decimal))::double precision), 2::double precision))))) <= 25::double precision)"
"  ->  Index Scan using city_pkey1 on city c  (cost=0.00..6.27 rows=1 width=16) (actual time=0.014..0.016 rows=1 loops=1)"
"        Index Cond: (id = 5182)"
"  ->  Seq Scan on station s  (cost=0.00..321.08 rows=12138 width=20) (actual time=0.007..5.256 rows=12139 loops=1)"
"        Filter: ((s.elevation >= 0) AND (s.elevation <= 3000))"
"Total runtime: 41.235 ms"

expects to have to touch a large proportion of the measurement table, therefore it thinks that it will be fastest to do a seq scan. In actual fact, for 78 stations, the index would be faster, but for 4046 it wouldn't.

This is rather unexpected. I'd have figured it would use the actual number.
 
If you will be querying by season quite regularly, had you considered partitioning by season?

I have no idea what the "regular" queries will be. The purpose of the system is to open the data up to the public using a simple user interface so that they can generate their own custom reports. That user interface allows people to pick year intervals, day ranges, elevations, categories (temperature, precipitation, snow depth, etc.), and lat/long perimeter coordinates (encompassing any number of stations) or a city and radius.

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Yeb Havinga
Дата:
David Jarvis wrote:
>
>     Also, you're trying to do constraint_exclusion, but have you made sure
>     that it's turned on?  And have you made sure that those
>     constraints are
>     really the right ones and that they make sense?  You're using a
>     bunch of
>     extract()'s there too, why not just specify a CHECK constraint on the
>     date ranges which are allowed in the table..?
>
>
> I don't know what the date ranges are? So I can't partition them by year?
>
> Right now I created 72 child tables by using the category and month.
> This may have been a bad choice. But at least all the data is in the
> system now so dissecting or integrating it back in different ways
> shouldn't take days.
>
> Thanks everyone for all your help, I really appreciate the time you've
> taken to guide me in the right direction to make the system as fast as
> it can be.

My $0.02 - its hard to comment inline due to the number of responses,
but: the partitioning is only useful for speed, if it matches how your
queries select data. For time based data I would for sure go for year
based indexing. If you want a fixed number of partitions, you could
perhaps do something like year % 64. I did a test to see of the
constraint exclusion could work with extract but that failed:

test=# create table parent(t timestamptz);
test=# create table child1(check ((extract(year from t)::int % 2)=0))
inherits( parent);
test=# create table child2(check ((extract(year from t)::int % 2)=1))
inherits(parent);
test=# explain select * from parent where (extract(year from t)::int %
2) = 0;
                                QUERY PLAN
---------------------------------------------------------------------------
 Result  (cost=0.00..158.40 rows=33 width=8)
   ->  Append  (cost=0.00..158.40 rows=33 width=8)
         ->  Seq Scan on parent  (cost=0.00..52.80 rows=11 width=8)
               Filter: (((date_part('year'::text, t))::integer % 2) = 0)
         ->  Seq Scan on child1 parent  (cost=0.00..52.80 rows=11 width=8)
               Filter: (((date_part('year'::text, t))::integer % 2) = 0)
         ->  Seq Scan on child2 parent  (cost=0.00..52.80 rows=11 width=8)
               Filter: (((date_part('year'::text, t))::integer % 2) = 0)

It hits all partitions even when I requested for a single year.

So an extra column would be needed, attempt 2 with added year smallint.

test=# create table parent(t timestamptz, y smallint);
test=# create table child1(check ((y % 2)=0)) inherits( parent);
test=# create table child2(check ((y % 2)=1)) inherits( parent);
test=# explain select * from parent where (y % 2) between 0 and 0;
                                   QUERY
PLAN
---------------------------------------------------------------------------------
 Result  (cost=0.00..122.00 rows=20 width=10)
   ->  Append  (cost=0.00..122.00 rows=20 width=10)
         ->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
               Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer %
2) <= 0))
         ->  Seq Scan on child1 parent  (cost=0.00..61.00 rows=10 width=10)
               Filter: ((((y)::integer % 2) >= 0) AND (((y)::integer %
2) <= 0))

This works: only one child table hit.

That made me think: if you'd scan two consecutive years, you'd always
hit two different partitions. For your use case it'd be nice if some
year wraparounds would fall in the same partition. The following query
shows partition numbers for 1900 - 2010 with 4 consecutive years in the
same partition. It also shows that in this case 32 partitions is enough:

test=# select x, (x / 4) % 32 from generate_series(1900,2010) as x(x);
 x   | ?column?
------+----------
 1900 |       27
 1901 |       27
 1902 |       27
 1903 |       27
 1904 |       28
 1905 |       28
etc
 1918 |       31
 1919 |       31
 1920 |        0
 1921 |        0
etc
 2005 |       21
 2006 |       21
 2007 |       21
 2008 |       22
 2009 |       22
 2010 |       22
(111 rows)

This would mean that a extra smallint column is needed which would
inflate the 300M relation with.. almost a GB, but I still think it'd be
a good idea.

create or replace function yearmod(int) RETURNS int
as 'select (($1 >> 2) %32);'
language sql
immutable
strict;

create table parent(t timestamptz, y smallint);

select 'create table child'||x||'(check (yearmod(y)='||x-1||'))
inherits(parent);' from generate_series(1,32) as x(x);
                           ?column?
---------------------------------------------------------------
 create table child1(check (yearmod(y)=0)) inherits(parent);
 create table child2(check (yearmod(y)=1)) inherits(parent);
 create table child3(check (yearmod(y)=2)) inherits(parent);
etc
 create table child30(check (yearmod(y)=29)) inherits(parent);
 create table child31(check (yearmod(y)=30)) inherits(parent);
 create table child32(check (yearmod(y)=31)) inherits(parent);
(32 rows)

Copy and paste output of this query in psql to create child tables.

Example with period 1970 to 1980:

test=# explain select * from parent where yearmod(y) between
yearmod(1970) and yearmod(1980);
                                 QUERY
PLAN
-----------------------------------------------------------------------------
 Result  (cost=0.00..305.00 rows=50 width=10)
   ->  Append  (cost=0.00..305.00 rows=50 width=10)
         ->  Seq Scan on parent  (cost=0.00..61.00 rows=10 width=10)
               Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
         ->  Seq Scan on child13 parent  (cost=0.00..61.00 rows=10 width=10)
               Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
         ->  Seq Scan on child14 parent  (cost=0.00..61.00 rows=10 width=10)
               Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
         ->  Seq Scan on child15 parent  (cost=0.00..61.00 rows=10 width=10)
               Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
         ->  Seq Scan on child16 parent  (cost=0.00..61.00 rows=10 width=10)
               Filter: ((((y / 4) % 32) >= 12) AND (((y / 4) % 32) <= 15))
(12 rows)

This works: query for 11 consecutive years hits only 4 from 31.

But the between fails for yearmods that wrap the 31 boundary, what
happens here between 1910 and 1920

test=# explain select * from parent where yearmod(y) between
yearmod(1910) and yearmod(1920);
                QUERY PLAN
------------------------------------------
 Result  (cost=0.00..0.01 rows=1 width=0)
   One-Time Filter: false
(2 rows)

So for the wraparound case we need a CASE:

test=# explain select * from parent where case when yearmod(1910) <=
yearmod(1920)
then yearmod(y) between yearmod(1910) and yearmod(1920)
else (yearmod(y) >= yearmod(1910) or yearmod(y) <= yearmod(1920)) end;
                                  QUERY
PLAN
-------------------------------------------------------------------------------
 Result  (cost=0.00..305.00 rows=5665 width=10)
   ->  Append  (cost=0.00..305.00 rows=5665 width=10)
         ->  Seq Scan on parent  (cost=0.00..61.00 rows=1133 width=10)
               Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
         ->  Seq Scan on child1 parent  (cost=0.00..61.00 rows=1133
width=10)
               Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
         ->  Seq Scan on child30 parent  (cost=0.00..61.00 rows=1133
width=10)
               Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
         ->  Seq Scan on child31 parent  (cost=0.00..61.00 rows=1133
width=10)
               Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
         ->  Seq Scan on child32 parent  (cost=0.00..61.00 rows=1133
width=10)
               Filter: ((((y / 4) % 32) >= 29) OR (((y / 4) % 32) <= 0))
(12 rows)

This should work for all year ranges and I think is a good solution for
partitioning on year with a fixed amount of partitions.

 From the optimizer perspective I wonder what the best access path for
this kind of query would be (if there would be no partitions). Building
on ideas from one of Thom Brown's first replies with indexes on year and
doy, and Tom Lane's remark about the leap year problem. Suppose the leap
years did not exist, having a index on year, and having a different
index on doy, sounds like a bitmap and of a scan of both the year and
doy indexes could provide a optimal path. Maybe this would still be
possible, if the leap year problem could be 'fixed' by a additional
condition in the where clause that filters the surplus records.

regards,
Yeb Havinga


Re: Optimize date query for large child tables: GiST or GIN?

От
Yeb Havinga
Дата:
There is a thing that might lead to confusion in the previous post:
> create or replace function yearmod(int) RETURNS int
> as 'select (($1 >> 2) %32);'
> language sql
> immutable
> strict;
is equivalent with

create or replace function yearmod(int) RETURNS int
as 'select (($1 / 4) %32);'
language sql
immutable
strict;

and that is the function that was used with all the other output (it can
be seen inlined in the explain output). I did not catch this until after
the post.

regards,
Yeb Havinga



Re: Optimize date query for large child tables: GiST or GIN?

От
Matthew Wakeling
Дата:
On Fri, 21 May 2010, Yeb Havinga wrote:
> For time based data I would for sure go for year based indexing.

On the contrary, most of the queries seem to be over many years, but
rather restricting on the time of year. Therefore, partitioning by month
or some other per-year method would seem sensible.

Regarding the leap year problem, you might consider creating a modified
day of year field, which always assumes that the year contains a leap day.
Then a given number always resolves to a given date, regardless of year.
If you then partition (or index) on that field, then you may get a
benefit.

In this case, partitioning is only really useful when you are going to be
forced to do seq scans. If you can get a suitably selective index, in the
case where you are selecting a small proportion of the data, then I would
concentrate on getting the index right, rather than the partition, and
maybe even not do partitioning.

Matthew

--
 Trying to write a program that can't be written is... well, it can be an
 enormous amount of fun!                 -- Computer Science Lecturer

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi, Yeb.

This is starting to go back to the design I used with MySQL:
  • YEAR_REF - Has year and station
  • MONTH_REF - Has month, category, and yea referencer
  • MEASUREMENT - Has month reference, amount, and day
Normalizing by date parts was fast. Partitioning the tables by year won't do much good -- users will probably choose 1900 to 2009, predominately.

I thought about splitting the data by station by category, but that's ~73000 tables. My understanding is that PostgreSQL uses files per index, which would be messy at the OS level (Linux 2.6.31). Even by station alone is 12139 tables, which might be tolerable for now, but with an order of magnitude more stations on the distant horizon, it will not scale.

I also thought about splitting the data by station district by category -- there are 79 districts, yielding 474 child tables, which is ~575000 rows per child table. Most of the time I'd imagine only one or two districts would be selected. (Again, hard to know exactly.)

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Yeb Havinga
Дата:
Matthew Wakeling wrote:
> On Fri, 21 May 2010, Yeb Havinga wrote:
>> For time based data I would for sure go for year based indexing.
>
> On the contrary, most of the queries seem to be over many years, but
> rather restricting on the time of year. Therefore, partitioning by
> month or some other per-year method would seem sensible.
The fact is that at the time I wrote my mail, I had not read a specifion
of distribution of parameters (or I missed it). That's why the sentence
of my mail before the one you quoted said: "the partitioning is only
useful for speed, if it matches how your queries select data.". In most
of the databases I've worked with, the recent data was queried most
(accounting, medical) but I can see that for climate analysis this might
be different.
> Regarding the leap year problem, you might consider creating a
> modified day of year field, which always assumes that the year
> contains a leap day. Then a given number always resolves to a given
> date, regardless of year. If you then partition (or index) on that
> field, then you may get a benefit.
Shouldn't it be just the other way around - assume all years are non
leap years for the doy part field to be indexed.

regards,
Yeb Havinga


Re: Optimize date query for large child tables: GiST or GIN?

От
Yeb Havinga
Дата:
David Jarvis wrote:
> Hi, Yeb.
>
> This is starting to go back to the design I used with MySQL:
>
>     * YEAR_REF - Has year and station
>     * MONTH_REF - Has month, category, and yea referencer
>     * MEASUREMENT - Has month reference, amount, and day
>
> Normalizing by date parts was fast. Partitioning the tables by year
> won't do much good -- users will probably choose 1900 to 2009,
> predominately.
Ok, in that case it is a bad idea.
> I thought about splitting the data by station by category, but that's
> ~73000 tables. My understanding is that PostgreSQL uses files per
> index, which would be messy at the OS level (Linux 2.6.31). Even by
> station alone is 12139 tables, which might be tolerable for now, but
> with an order of magnitude more stations on the distant horizon, it
> will not scale.
Yes, I've read a few times now that PG's partitioning doesn't scale
beyond a few 100 partitions.
> I also thought about splitting the data by station district by
> category -- there are 79 districts, yielding 474 child tables, which
> is ~575000 rows per child table. Most of the time I'd imagine only one
> or two districts would be selected. (Again, hard to know exactly.)
I agee with Matthew Wakeling in a different post: its probably wise to
first see how fast things can get by using indexes. Only if that fails
to be fast, partitioning might be an option. (Though sequentially
scanning 0.5M rows is not cheap).

I experimented a bit with a doy and year function.

-- note: leap year fix must still be added
create or replace function doy(timestamptz) RETURNS float8
as 'select extract(doy from $1);'
language sql
immutable
strict;
create or replace function year(timestamptz) RETURNS float8
as 'select extract(year from $1);'
language sql
immutable
strict;

\d parent
             Table "public.parent"
 Column |           Type           | Modifiers
--------+--------------------------+-----------
 t      | timestamp with time zone |
 y      | smallint                 |
Indexes:
    "doy_i" btree (doy(t))
    "year_i" btree (year(t))

A plan like the following is probably what you want

test=# explain select * from parent where doy(t) between 10 and 20 and
year(t) between 1900 and 2009;

QUERY
PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on parent  (cost=9.95..14.97 rows=1 width=10)
   Recheck Cond: ((year(t) >= 1900::double precision) AND (year(t) <=
2009::double precision) AND (doy(t) >= 10::double precision) AND (doy(t)
<= 20::double precision))
   ->  BitmapAnd  (cost=9.95..9.95 rows=1 width=0)
         ->  Bitmap Index Scan on year_i  (cost=0.00..4.85 rows=10 width=0)
               Index Cond: ((year(t) >= 1900::double precision) AND
(year(t) <= 2009::double precision))
         ->  Bitmap Index Scan on doy_i  (cost=0.00..4.85 rows=10 width=0)
               Index Cond: ((doy(t) >= 10::double precision) AND (doy(t)
<= 20::double precision))
(7 rows)

regards,
Yeb Havinga




Re: Optimize date query for large child tables: GiST or GIN?

От
Matthew Wakeling
Дата:
>> Regarding the leap year problem, you might consider creating a modified day
>> of year field, which always assumes that the year contains a leap day. Then
>> a given number always resolves to a given date, regardless of year. If you
>> then partition (or index) on that field, then you may get a benefit.
On Fri, 21 May 2010, Yeb Havinga wrote:
> Shouldn't it be just the other way around - assume all years are non leap
> years for the doy part field to be indexed.

The mapping doesn't matter massively, as long as all days of the year can
be mapped uniquely onto a number, and the numbers are sequential. Your
suggestion does not satisfy the first of those two requirements.

If you assume that all yeasr are leap years, then you merely skip a number
in the middle of the year, which isn't a problem when you want to check
for days between two bounds. However, if you assume non leap year, then
there is no representation for the 29th of February, so not all data
points will have a representative number to insert into the database.

Matthew

--
 No, C++ isn't equal to D. 'C' is undeclared, so we assume it's an int,
 with a default value of zero.  Hence, C++ should really be called 1.
 -- met24, commenting on the quote "C++ -- shouldn't it be called D?"

Re: Optimize date query for large child tables: GiST or GIN?

От
Stephen Frost
Дата:
* Yeb Havinga (yebhavinga@gmail.com) wrote:
>> Normalizing by date parts was fast. Partitioning the tables by year
>> won't do much good -- users will probably choose 1900 to 2009,
>> predominately.
> Ok, in that case it is a bad idea.

Yeah, now that I understand what the user actually wants, I can
certainly understand that you wouldn't want to partition by year.  It
does strike me that perhaps you could partition by day ranges, but you'd
have to store them as something other than the 'date' type, which is
certainly frustrating, but you're not really operating on these in a
'normal' fashion as you would with a date.

The next question I would have, however, is if you could pre-aggregate
some of this data..  If users are going to typically use 1900-2009 for
years, then could the information about all of those years be aggregated
apriori to make those queries faster?

>> I thought about splitting the data by station by category, but that's
>> ~73000 tables.

Do not get hung up on having to have a separate table for every unique
value in the column- you don't need that.  constraint_exclusion will
work just fine with ranges too- the problem is that you need to have
ranges that make sense with the data type you're using and with the
queries you're running.  That doesn't really work here with the
measurement_date, but it might work just fine with your station_id
field.

>> I also thought about splitting the data by station district by
>> category -- there are 79 districts, yielding 474 child tables, which
>> is ~575000 rows per child table. Most of the time I'd imagine only one
>> or two districts would be selected. (Again, hard to know exactly.)

Also realize that PG will use multiple files for a single table once the
size of that table goes beyond 1G.

> I agee with Matthew Wakeling in a different post: its probably wise to
> first see how fast things can get by using indexes. Only if that fails
> to be fast, partitioning might be an option. (Though sequentially
> scanning 0.5M rows is not cheap).

I would agree with this too- get it working first, then look at
partitioning.  Even more so- work on a smaller data set to begin with
while you're figuring out how to get the right answer in a generally
efficient way (not doing seq. scans through everything because you're
operating on every row for something).  It needs to be a couple
hundred-thousand rows, but it doesn't need to be the full data set, imv.

    Thanks,

        Stephen

Вложения

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

CREATE INDEX measurement_01_001_y_idx
 ON climate.measurement_01_001
 USING btree
 (date_part('year'::text, taken));

Is that equivalent to what you suggest?

No. It is not the same function, so Postgres has no way to know it produces the same results (if it does).

This is what I ran:

CREATE INDEX
  measurement_013_taken_year_idx
ON
  climate.measurement_013
  (EXTRACT( YEAR FROM taken ));

This is what pgadmin3 shows me:

CREATE INDEX measurement_013_taken_year_idx
  ON climate.measurement_013
  USING btree
  (date_part('year'::text, taken));

As far as I can tell, it appears they are equivalent?

Either way, the cost for performing a GROUP BY is high (I ran once with extract and once with date_part). The date_part EXPLAIN ANALYSE resulted in:

"Limit  (cost=1748024.65..1748028.65 rows=200 width=12) (actual time=65471.448..65471.542 rows=101 loops=1)"

The EXTRACT EXPLAIN ANALYSE came to:

"Limit  (cost=1748024.65..1748028.65 rows=200 width=12) (actual time=44913.263..44913.330 rows=101 loops=1)"

If PG treats them differently, I'd like to know how so that I can do the right thing. As it is, I cannot see the difference in performance between date_part and EXTRACT.

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

certainly understand that you wouldn't want to partition by year.  It
>

Definitely not.


> does strike me that perhaps you could partition by day ranges, but you'd
>

I don't think that will work; users can choose any day range, with the most
common as Jan 1 - Dec 31, followed by seasonal ranges, followed by arbitrary
ranges.


> some of this data..  If users are going to typically use 1900-2009 for
> years, then could the information about all of those years be aggregated
> apriori to make those queries faster?
>

I'm not sure what you mean. I could create a separate table that lumps the
aggregated averages per year per station per category, but that will only
help in the one case. There are five different reporting pages (Basic
through Guru). On three of those pages the user must select arbitrary day
ranges. On one of those pages, the user can select a season, which then maps
to, for all intents and purposes, an arbitrary day range.

Only the most basic page do not offer the user a day range selection.


> Do not get hung up on having to have a separate table for every unique
> value in the column- you don't need that.  constraint_exclusion will
>

That's good advice. I have repartitioned the data into seven tables: one per
category.


> I agee with Matthew Wakeling in a different post: its probably wise to
> I would agree with this too- get it working first, then look at
> partitioning.  Even more so- work on a smaller data set to begin with
>

The query speed has now much improved thanks to everybody's advice.

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

The problem is now solved (in theory).

Well, it's not the functions per se that's the problem, it's the lack of
a useful index on the expression.

The measurement table indexes (on date and weather station) were not being used because the only given date ranges (e.g., 1900 - 2009) were causing the planner to do a full table scan, which is correct. What I had to do was find a way to reduce the dates so that the planner would actually use the index, rather than doing a full table scan on 43 million records. By passing in 1955 - 1960 the full table scan went away in favour of an index scan, as expected.

Each weather station has a known lifespan (per climate category). That is, not all weather stations between 1880 and 2009 collected data.  For example, one weather station monitored the maximum daily temperature between 2006-11-29 and 2009-12-31. Some stations span more than 30 years, but I believe those are in the minority (e.g., 1896-12-01 to 1959-01-31). (I'll be able to verify once the analysis is finished.)

I will add another table that maps the stations to category and min/max dates. I can then use this reference table which should (theory part here) tell the planner to use the index.

What is really impressive, though... If my understanding is correct...

PostgreSQL scanned 43 million rows 78 times, returning results in ~90 sec.

Thanks again for all your help, everybody. I sincerely appreciate your patience, comments, and ideas.

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
Matthew Wakeling
Дата:
On Sun, 23 May 2010, David Jarvis wrote:
> The measurement table indexes (on date and weather station) were not being
> used because the only given date ranges (e.g., 1900 - 2009) were causing the
> planner to do a full table scan, which is correct.

I wonder if you might see some benefit from CLUSTERing the tables on the
index.

Matthew

--
 And the lexer will say "Oh look, there's a null string. Oooh, there's
 another. And another.", and will fall over spectacularly when it realises
 there are actually rather a lot.
         - Computer Science Lecturer (edited)

Re: Optimize date query for large child tables: GiST or GIN?

От
Alvaro Herrera
Дата:
Excerpts from Matthew Wakeling's message of mar jun 01 05:55:35 -0400 2010:
> On Sun, 23 May 2010, David Jarvis wrote:
> > The measurement table indexes (on date and weather station) were not being
> > used because the only given date ranges (e.g., 1900 - 2009) were causing the
> > planner to do a full table scan, which is correct.
>
> I wonder if you might see some benefit from CLUSTERing the tables on the
> index.

Eh, isn't this a GIN or GiST index?  I don't think you can cluster on
those, can you?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Optimize date query for large child tables: GiST or GIN?

От
Alvaro Herrera
Дата:
Excerpts from David Jarvis's message of mar jun 01 14:01:22 -0400 2010:
> Sorry, Alvaro.
>
> I was contemplating using a GIN or GiST index as a way of optimizing the
> query.

My fault -- I didn't read the whole thread.

> Instead, I found that re-inserting the data in order of station ID (the
> primary look-up column) and then CLUSTER'ing on the station ID, taken date,
> and category index increased the speed by an order of magnitude.

Hmm, that's nice, though I cannot but wonder whether the exclusive lock
required by CLUSTER is going to be a problem in the long run.

> I might be able to drop the station/taken/category index in favour of the
> simple station index and CLUSTER on that, instead (saving plenty of disk
> space). Either way, it's fast right now so I'm not keen to try and make it
> much faster.

Hm, keep in mind that if the station clause alone is not selective
enough, scanning it may be too expensive.  The current three column
index is probably a lot faster to search (though of course it's causing
more work to be kept up to date on insertions).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Sorry, Alvaro.

I was contemplating using a GIN or GiST index as a way of optimizing the query.

Instead, I found that re-inserting the data in order of station ID (the primary look-up column) and then CLUSTER'ing on the station ID, taken date, and category index increased the speed by an order of magnitude.

I might be able to drop the station/taken/category index in favour of the simple station index and CLUSTER on that, instead (saving plenty of disk space). Either way, it's fast right now so I'm not keen to try and make it much faster.

Dave

Re: Optimize date query for large child tables: GiST or GIN?

От
David Jarvis
Дата:
Hi,

Hmm, that's nice, though I cannot but wonder whether the exclusive lock
required by CLUSTER is going to be a problem in the long run.

Not an issue; the inserts are one-time (or very rare; at most: once a year).
 
Hm, keep in mind that if the station clause alone is not selective
enough, scanning it may be too expensive.  The current three column

The seven child tables (split on category ID) have the following indexes:
  1. Primary key (unique ID, sequence)
  2. Station ID (table data is physically inserted by station ID order)
  3. Station ID, Date, and Category ID (this index is CLUSTER'ed)
I agree that the last index is probably all that is necessary. 99% of the searches use the station ID, date, and category. I don't think PostgreSQL necessarily uses that last index, though.

Dave