Обсуждение: SELECT question
********************************************************************* * Do not Cc: me, because I am on THIS list, if I write here. * * Keine Cc: an mich, bin auf DIESER Liste wenn ich hier schreibe. * * Ne me mettez pas en Cc:, je suis sur CETTE liste, si j'ecris ici. * ********************************************************************* Hello, I have a very huge table of references from Websites (html, pics, ...) where the elements stored on a fileserver using sha384. Now the indextable hold all filenames and download dates but now I like to get a "snapshoot on a paticular day. How must I create the SELCT statement to get ALL files valid on a particular day? Note: There can be every time a new index.html for example but images change only once a month... So I need all elements valable on the paticular day which mean, I need to select that LAST version of the elements... Greetings Michelle Konzack -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ ##################### Debian GNU/Linux Consultant ##################### Michelle Konzack Apt. 917 ICQ #328449886 50, rue de Soultz MSN LinuxMichi 0033/6/61925193 67100 Strasbourg/France IRC #Debian (irc.icq.com)
Вложения
On Aug 17, 2007, at 7:27 , Michelle Konzack wrote: > ********************************************************************* > * Do not Cc: me, because I am on THIS list, if I write here. * You might want to consider changing your mailing list subscription settings to "eliminatecc", e.g., send email to majordomo@postgresql.org (not the list address!) with body set pgsql-general eliminatecc This should prevent the mailing list from sending you a second copy. > I have a very huge table of references from Websites (html, pics, ...) > where the elements stored on a fileserver using sha384. > > Now the indextable hold all filenames and download dates but now I > like > to get a "snapshoot on a paticular day. > > How must I create the SELCT statement to get ALL files valid on a > particular day? > > Note: There can be every time a new index.html for example but images > change only once a month... > > So I need all elements valable on the paticular day which mean, > I need to select that LAST version of the elements... I think what you want is something like: SELECT DISTINCT ON (website_reference) website_reference, download_date, file_path FROM indextable WHERE download_date <= ? -- whatever date you're interested in ORDER BY website_reference, download_date DESC; This should return the most recent website_reference and its download_date that's earlier than the download_date specified in the WHERE clause. DISTINCT ON is a (very helpful) PostgreSQL extension. You can get similar results using a subquery; SELECT website_reference, download_date, file_path FROM indextable NATURAL JOIN ( SELECT website_reference, max(download_date) as download_date FROM indextable WHERE download_date <= ? GROUP BY website_reference ) most_recent_versions; This may return more than one row per website_reference if the website_reference has more than on file_path for a particular download_date. Does this help? If not, could you give a bit more of a concrete example? (Is is just me or have there been a lot of queries that can be solved using DISTINCT ON recently?) Michael Glaesemann grzm seespotcode net
Вложения
Am 2007-08-17 12:53:41, schrieb Michael Glaesemann: > > On Aug 17, 2007, at 7:27 , Michelle Konzack wrote: > > >********************************************************************* > >* Do not Cc: me, because I am on THIS list, if I write here. * > > You might want to consider changing your mailing list subscription > settings to "eliminatecc", e.g., send email to > majordomo@postgresql.org (not the list address!) with body > > set pgsql-general eliminatecc > > This should prevent the mailing list from sending you a second copy. Which mean, my "INBOX.ML_pgsql.general/" will never receive messages and break all threads where someone send me CC's... > I think what you want is something like: > > SELECT DISTINCT ON (website_reference) website_reference, > download_date, file_path > FROM indextable > WHERE download_date <= ? -- whatever date you're interested in > ORDER BY website_reference, download_date DESC; > > This should return the most recent website_reference and its > download_date that's earlier than the download_date specified in the > WHERE clause. > > DISTINCT ON is a (very helpful) PostgreSQL extension. You can get > similar results using a subquery; I have never used "DISTINCT ON" (it was not known to me) and was trying subqueries... :-/ > SELECT website_reference, download_date, file_path > FROM indextable > NATURAL JOIN ( > SELECT website_reference, max(download_date) as download_date > FROM indextable > WHERE download_date <= ? > GROUP BY website_reference > ) most_recent_versions; > > This may return more than one row per website_reference if the > website_reference has more than on file_path for a particular > download_date. > > Does this help? If not, could you give a bit more of a concrete example? I have an Enterprise which do researches :-) and I have a local cache of more then 150.000.000 URL's and its content (~8 TByte)... (I have hit over 2000 md5 collisons and now using sha384) Also I get per day nearly 100.000 new files... OK, HTML pages are downloaded and go into the first table like indextable FULL_URL, SHA384SUM and the second table content SERNUM (uniq), SHA384SUM (pri), LOCAL_PATH the saved file get as the name the SHA384 name If I open an HTML-URL with a specific date, it is parsed and the URL's inline are adapted to make it work from my database, e.g. http://www.postgresql.org/index.html will become http://webcache/show.php?date=123456789&url=http://www.postgresql.org/index.html inline elements and already downloaded other links will bekome http://webcache/show.php?date=123456789&url=<original_url> Thanks to the PostgreSQL developers that they have created "tablespace" and "table partitioning" since searching in 150.000.000 ROW's is the hell. > (Is is just me or have there been a lot of queries that can be solved > using DISTINCT ON recently?) I do not know... Since when does "DISTINCT ON" exist? Greetings Michelle Konzack Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant -- Linux-User #280138 with the Linux Counter, http://counter.li.org/ ##################### Debian GNU/Linux Consultant ##################### Michelle Konzack Apt. 917 ICQ #328449886 50, rue de Soultz MSN LinuxMichi 0033/6/61925193 67100 Strasbourg/France IRC #Debian (irc.icq.com)
Вложения
On 8/18/07, Michelle Konzack <linux4michelle@freenet.de> wrote: > Am 2007-08-17 12:53:41, schrieb Michael Glaesemann: > > (Is is just me or have there been a lot of queries that can be solved > > using DISTINCT ON recently?) > > I do not know... Since when does "DISTINCT ON" exist? I have been lurking on this thread, don't have much to add, but distinct on has been around in postgresql for as long as I can remember. It's not SQL spec, so it's not portable. but it is awefully useful.
Michelle Konzack wrote: > Am 2007-08-17 12:53:41, schrieb Michael Glaesemann: > > > > On Aug 17, 2007, at 7:27 , Michelle Konzack wrote: > > > > >********************************************************************* > > >* Do not Cc: me, because I am on THIS list, if I write here. * > > > > You might want to consider changing your mailing list subscription > > settings to "eliminatecc", e.g., send email to > > majordomo@postgresql.org (not the list address!) with body > > > > set pgsql-general eliminatecc > > > > This should prevent the mailing list from sending you a second copy. > > Which mean, my "INBOX.ML_pgsql.general/" will never receive > messages and break all threads where someone send me CC's... So have messages which have the list on CC go to that folder as well, :0: * ^TO_pgsql-general@postgresql.org INBOX.ML_pgsql.general/ Another choice you have is to tell mutt that you're subscribed to the list, adding to .muttrc lists pgsql-general@postgresql.org subscribe pgsql-general@postgresql.org then it will generate the Mail-Followup-To: header, omitting your address, so you will only get it via Majordomo. (It will still fail sometimes if older MUAs don't honor that field). I found it annoying so I disabled it by setting set followup_to=no Another thing I did was removing of duplicates with procmail, :0 Wh: msgid.lock | formail -D 65536 $HOME/.msgid.cache so I don't get two copies when I'm on the list and on CC. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J Management by consensus: I have decided; you concede. (Leonard Liu)