Re: SQL - planet redundant data

Поиск
Список
Период
Сортировка
От Poul Jensen
Тема Re: SQL - planet redundant data
Дата
Msg-id 4325EF4B.2060000@gfy.ku.dk
обсуждение исходный текст
Ответ на Re: SQL - planet redundant data  ("John D. Burger" <john@mitre.org>)
Ответы Re: SQL - planet redundant data  (Brent Wood <b.wood@niwa.co.nz>)
Список pgsql-general
Thank you, John!
I misunderstood you the first time, but I now see we have the same thing
in mind.

> So you'd have most of your data in a main table:
>
>   create table observations (
>     obsID            integer    primary key,    -- Maybe a BIGINT
>     temperature    float,
>     etc.
>   );
>
> and some other "compressed" tables for those features that have long
> runs of repetitive values:
>
>   create table obsYears (
>     startObs    integer    primary key    references observations
> (obsID),
>     endObs    integer                references observations (obsID),
>     year        integer);
>
>   create table obsStations (
>     startObs    integer    primary key    references observations
> (obsID),
>     endObs    integer                references observations (obsID),
>     stationID    integer);
>
> (Caution, I haven't checked these for syntax.)  I've introduced an
> observation ID, and then I have "compressed" tables that map =ranges=
> of these IDs to values that are constant for long stretches.  Each
> year occupies only one row, same with each station.

One complication: Applying the observation ID you're in effect ordering
the rows. If you order them chronologically, "year" is perfectly lined
up, giving you one row pr. value in your compressed table, but e.g.
"month" will be split up in n_years*12 stretches of obsIDs, and
"station_id" may not have any continuous stretches of obsIDs at all. I
don't see any solution to this, but better compression can be achieved
by ordering rows optimally when applying the obsID. The reply to Tom
Lane in my previous post suggested one way to do this - it may not
always be optimal, but at least it's simple.

> Now you can do queries like this, say, for temperature statistics in a
> particular year:
>
>   select avg(temperature), stddev(temperature) from observations,
> obsYears
>     where obsID between startObs and endObs
>     and year = 2001;

This works! I had not yet realized how to make this connection between
two tables, so that was a major help - thank you.

>
> You could join in other compressed tables in the same way.  In fact,
> you could glue them all together with a VIEW, and you'd be able to
> treat the whole thing like one giant table, with much of the
> redundancy removed.

That is exactly what I want, and now I finally see how to do it (I
think!). However, it is a considerable amount of work to set this up
manually, plus, it has been a headache realizing how to get there at
all. I'm hoping that one or more of the developers think it would be a
good idea for PostgreSQL to perform an internal table optimization
process using run-length encoding. Imagine you could just throw all your
data into one table, run OPTIMIZE TABLE and you'd be done. With SQL
being all about tables I'm surprised this idea (or something even
better) hasn't been implemented already.

Poul Jensen

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: ERROR: type "temp_gc" already exists
Следующее
От: "W. van den Akker"
Дата:
Сообщение: PQtrace doesn't work