Обсуждение: SELECT question

Поиск
Список
Период
Сортировка

SELECT question

От
Michelle Konzack
Дата:
*********************************************************************
* Do not Cc: me, because I am on THIS list, if I write here.        *
* Keine Cc: an mich, bin auf DIESER Liste wenn ich hier schreibe.   *
* Ne me mettez pas en Cc:, je suis sur CETTE liste, si j'ecris ici. *
*********************************************************************

Hello,

I have a very huge table of references from Websites (html, pics, ...)
where the elements stored on a fileserver using sha384.

Now the indextable hold all filenames and download dates but now I like
to get a "snapshoot on a paticular day.

How must I create the SELCT statement to get ALL files valid on a
particular day?

Note:  There can be every time a new index.html for example but images
       change only once a month...

       So I need all elements valable on the paticular day which mean,
       I need to select that LAST version of the elements...

Greetings
    Michelle Konzack


--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
Michelle Konzack   Apt. 917                  ICQ #328449886
                   50, rue de Soultz         MSN LinuxMichi
0033/6/61925193    67100 Strasbourg/France   IRC #Debian (irc.icq.com)

Вложения

Re: SELECT question

От
Michael Glaesemann
Дата:
On Aug 17, 2007, at 7:27 , Michelle Konzack wrote:

> *********************************************************************
> * Do not Cc: me, because I am on THIS list, if I write here.        *

You might want to consider changing your mailing list subscription
settings to "eliminatecc", e.g., send email to
majordomo@postgresql.org (not the list address!) with body

set pgsql-general eliminatecc

This should prevent the mailing list from sending you a second copy.

> I have a very huge table of references from Websites (html, pics, ...)
> where the elements stored on a fileserver using sha384.
>
> Now the indextable hold all filenames and download dates but now I
> like
> to get a "snapshoot on a paticular day.
>
> How must I create the SELCT statement to get ALL files valid on a
> particular day?
>
> Note:  There can be every time a new index.html for example but images
>        change only once a month...
>
>        So I need all elements valable on the paticular day which mean,
>        I need to select that LAST version of the elements...

I think what you want is something like:

SELECT DISTINCT ON (website_reference) website_reference,
download_date, file_path
FROM indextable
WHERE download_date <= ? -- whatever date you're interested in
ORDER BY website_reference, download_date DESC;

This should return the most recent website_reference and its
download_date that's earlier than the download_date specified in the
WHERE clause.

DISTINCT ON is a (very helpful) PostgreSQL extension. You can get
similar results using a subquery;

SELECT website_reference, download_date, file_path
FROM indextable
NATURAL JOIN (
     SELECT website_reference, max(download_date) as download_date
     FROM indextable
     WHERE download_date <= ?
     GROUP BY website_reference
     ) most_recent_versions;

This may return more than one row per website_reference if the
website_reference has more than on file_path for a particular
download_date.

Does this help? If not, could you give a bit more of a concrete example?

(Is is just me or have there been a lot of queries that can be solved
using DISTINCT ON recently?)

Michael Glaesemann
grzm seespotcode net


Вложения

Re: SELECT question

От
Michelle Konzack
Дата:
Am 2007-08-17 12:53:41, schrieb Michael Glaesemann:
>
> On Aug 17, 2007, at 7:27 , Michelle Konzack wrote:
>
> >*********************************************************************
> >* Do not Cc: me, because I am on THIS list, if I write here.        *
>
> You might want to consider changing your mailing list subscription
> settings to "eliminatecc", e.g., send email to
> majordomo@postgresql.org (not the list address!) with body
>
> set pgsql-general eliminatecc
>
> This should prevent the mailing list from sending you a second copy.

Which mean, my "INBOX.ML_pgsql.general/" will never receive
messages and break all threads where someone send me CC's...

> I think what you want is something like:
>
> SELECT DISTINCT ON (website_reference) website_reference,
> download_date, file_path
> FROM indextable
> WHERE download_date <= ? -- whatever date you're interested in
> ORDER BY website_reference, download_date DESC;
>
> This should return the most recent website_reference and its
> download_date that's earlier than the download_date specified in the
> WHERE clause.
>
> DISTINCT ON is a (very helpful) PostgreSQL extension. You can get
> similar results using a subquery;

I have never used "DISTINCT ON" (it was not known to me)
and was trying subqueries...  :-/

> SELECT website_reference, download_date, file_path
> FROM indextable
> NATURAL JOIN (
>     SELECT website_reference, max(download_date) as download_date
>     FROM indextable
>     WHERE download_date <= ?
>     GROUP BY website_reference
>     ) most_recent_versions;
>
> This may return more than one row per website_reference if the
> website_reference has more than on file_path for a particular
> download_date.
>
> Does this help? If not, could you give a bit more of a concrete example?

I have an Enterprise which do researches  :-)  and I have a local cache
of more then 150.000.000 URL's and its content (~8 TByte)...  (I have
hit over 2000 md5 collisons and now using sha384)  Also I get per day
nearly 100.000 new files...

OK, HTML pages are downloaded and go into the first table like

    indextable    FULL_URL, SHA384SUM

and the second table

    content        SERNUM (uniq), SHA384SUM (pri), LOCAL_PATH

the saved file get as the name the SHA384 name

If I open an HTML-URL with a specific date, it is parsed and the URL's
inline are adapted to make it work from my database, e.g.

    http://www.postgresql.org/index.html

will become

    http://webcache/show.php?date=123456789&url=http://www.postgresql.org/index.html

inline elements and already downloaded other links will bekome

    http://webcache/show.php?date=123456789&url=<original_url>

Thanks to the PostgreSQL developers that they have created "tablespace"
and "table partitioning" since searching in 150.000.000 ROW's is the
hell.

> (Is is just me or have there been a lot of queries that can be solved
> using DISTINCT ON recently?)

I do not know...  Since when does "DISTINCT ON" exist?

Greetings
    Michelle Konzack
    Systemadministrator
    Tamay Dogan Network
    Debian GNU/Linux Consultant


--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
Michelle Konzack   Apt. 917                  ICQ #328449886
                   50, rue de Soultz         MSN LinuxMichi
0033/6/61925193    67100 Strasbourg/France   IRC #Debian (irc.icq.com)

Вложения

Re: SELECT question

От
"Scott Marlowe"
Дата:
On 8/18/07, Michelle Konzack <linux4michelle@freenet.de> wrote:
> Am 2007-08-17 12:53:41, schrieb Michael Glaesemann:

> > (Is is just me or have there been a lot of queries that can be solved
> > using DISTINCT ON recently?)
>
> I do not know...  Since when does "DISTINCT ON" exist?

I have been lurking on this thread, don't have much to add, but
distinct on has been around in postgresql for as long as I can
remember.  It's not SQL spec, so it's not portable.  but it is
awefully useful.

Re: SELECT question

От
Alvaro Herrera
Дата:
Michelle Konzack wrote:
> Am 2007-08-17 12:53:41, schrieb Michael Glaesemann:
> >
> > On Aug 17, 2007, at 7:27 , Michelle Konzack wrote:
> >
> > >*********************************************************************
> > >* Do not Cc: me, because I am on THIS list, if I write here.        *
> >
> > You might want to consider changing your mailing list subscription
> > settings to "eliminatecc", e.g., send email to
> > majordomo@postgresql.org (not the list address!) with body
> >
> > set pgsql-general eliminatecc
> >
> > This should prevent the mailing list from sending you a second copy.
>
> Which mean, my "INBOX.ML_pgsql.general/" will never receive
> messages and break all threads where someone send me CC's...

So have messages which have the list on CC go to that folder as well,
:0:
* ^TO_pgsql-general@postgresql.org
INBOX.ML_pgsql.general/

Another choice you have is to tell mutt that you're subscribed to the
list, adding to .muttrc

lists pgsql-general@postgresql.org
subscribe pgsql-general@postgresql.org

then it will generate the Mail-Followup-To: header, omitting your
address, so you will only get it via Majordomo.  (It will still fail
sometimes if older MUAs don't honor that field).  I found it annoying
so I disabled it by setting
set followup_to=no

Another thing I did was removing of duplicates with procmail,
:0 Wh: msgid.lock
| formail -D 65536 $HOME/.msgid.cache

so I don't get two copies when I'm on the list and on CC.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
Management by consensus: I have decided; you concede.
(Leonard Liu)