>> I am trying to understand how to use the full-text search parser for
>> URLS and hostnames to filter results from a text field containing URLS
>> based on domain, and also how to index text columns for fast
>> lookup/matching based on domain.
>> >> I have a PostgreSQL database containing documents and links downloaded
>> by a web crawler, with the following tables:
>> >> pages
>> >> ----------
>> >> id: Integer (primary key)
>> >> url: String (unique)
>> >> title: String
>> >> text: String
>> >> html: String
>> >> last_visit: DateTime
>> >> word_pos: TSVECTOR
>> >> >> >> links
>> >> ----------
>> >> id Integer (primary key)
>> >> source: String
>> >> target: String >> >> link_text: String
>> >> UNIQUE(source,target)
>> >> >> >> crawls
>> >> ---------
>> >> id: Integer (primary key)
>> >> query: String
>> >> >> >> crawl_results
>> >> -------------
>> >> id: Integer (primary key)
>> >> score: Integer (constraint 0<=score<=1)
>> >> crawl_id: Integer (foreign key, crawls.id)
>> >> page_id: Integer (foreign key, pages.id)
>> >> >> The `source` and `target` fields in the `links` table contain URLs. I am
>> running the following query to extract scored links from the top-ranking
>> search results, for pages that haven't been fetched yet:
>> >> WITH top_results AS >> >> (SELECT page_id, score FROM crawl_results >> >> WHERE
crawl_id=$1>> >> ORDER BY score LIMIT 100)
>> >> SELECT top_results.score, l.target
>> >> FROM top_results >> >> JOIN pages p ON top_results.page_id=p.id
>> >> JOIN links l on p.url=l.source >> >> WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE
l.target=pp.url)
>> >> >> However, *I would like to filter these results so that only one row is
>> returned for a given domain (the one with the lowest score)*. So for
>> instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
>> 'http://www.foo.com/zor'), I only want the first because it has same
>> domain `foo.com` and has the lower score.
>> >> I was able to find documentation for the builtin full text search
>> parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
>> which can parse URLS and extract the hostname. For instance, I can
>> extract the hostname from a URL as follows:
>>
Hi,
I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could use
select regex_replace('https?://(.*=)/.*', '\\1', url)
instead of the fulltext parser
Best regards
Wolfgang