Picking 25 samples of every domain

Поиск
Список
Период
Сортировка
От Gary Warner
Тема Picking 25 samples of every domain
Дата
Msg-id 5436FD55.2010806@askgar.com
обсуждение исходный текст
Ответы Re: Picking 25 samples of every domain  (Benoit Izac <benoit@izac.org>)
Список pgsql-novice
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have a set of Postgres tables that are related to URLs found in
email.  The tables are BIG.  40-50 million records per day.  We are
using them for some research into tricks spammers use to confound
blacklists.  When we parse the URLs, we pull out the "domain" portion of
each URL and store it in a field called "top_domain".  The full URL is
available as "link".

Through various forms of randomization, customization, and wild-carding,
a domain may have as many as 1 million URLs per day.  I am needing a
query that would grab a sample number of URLs per domain (let's say 25
for conversation) . . . something that in pseudo-code might look like

for each top_domain in urltable do
    select top_domain, link limit 25;

Thoughts on the fastest way to do a query like that?

Real examples (to sort of show the point . . .)


spam_urls=# select top_domain, link from spam_info_2014_09_20 where
top_domain =  'pacedoctor[.]ru' limit 10;  (To prevent risk, I'm
changing "." to "[.]"


  top_domain   |               link
- ---------------+-----------------------------------
 pacedoctor[.]ru | http://kftdouhixn[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://ozukalaj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://knlssrmp[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://aaztcobtwztx[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://kqyarrarn[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://zzj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://piqtj[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://wahednc[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://fjhmin[.]pacedoctor[.]ru
 pacedoctor[.]ru | http://wevtwo[.]pacedoctor[.]ru
(10 rows)

 top_domain |
link
-
------------+-----------------------------------------------------------------------
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex8[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex7[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex6[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex5[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex4[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex3[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex2[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-837-334251002/receptionist/tindex1[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/u[.]html
 i-cer[.]com  |
http://www[.]i-cer[.]com/6907-372-838-106207395/vonmill/rindex14[.]html
(10 rows)

  top_domain  |
link
-
--------------+---------------------------------------------------------------------
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/u[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/rindex5[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex4[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex3[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex2[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-440072897/yur/tindex1[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/u[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/rindex5[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex4[.]html
 ezzesoft[.]com |
http://www[.]ezzesoft[.]com/6903-320-722-238360728/kthomas/tindex3[.]html
(10 rows)






- --

Thanks for any suggestions!

_-_
gar
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlQ2/VUACgkQg79eYCOO6PtTiwCbBomLVmWST81FEI3eQX0g5HrI
C8EAnjoXM6gb4+sn4DROXd/3IvFRoBoV
=50PP
-----END PGP SIGNATURE-----



В списке pgsql-novice по дате отправления:

Предыдущее
От: Jude DaShiell
Дата:
Сообщение: Re: newbee, about a bulk loading from a cdv file
Следующее
От: Benoit Izac
Дата:
Сообщение: Re: Picking 25 samples of every domain