Re: How to use full-text search URL parser to filter query results by domain name?

Поиск
Список
Период
Сортировка
От hamann.w@t-online.de
Тема Re: How to use full-text search URL parser to filter query results by domain name?
Дата
Msg-id wolfgang-1190407084206.A0129904@laptop.local
обсуждение исходный текст
Ответ на How to use full-text search URL parser to filter query results bydomain name?  (Jess Wren <jess.wren@interference.cc>)
Ответы Re: How to use full-text search URL parser to filter query results bydomain name?  (Jess Wren <jess.wren@interference.cc>)
Список pgsql-general
>> I am trying to understand how to use the full-text search parser for
>> URLS and hostnames to filter results from a text field containing URLS
>> based on domain, and also how to index text columns for fast
>> lookup/matching based on domain.
>> >> I have a PostgreSQL database containing documents and links downloaded
>> by a web crawler, with the following tables:
>> >>         pages
>> >>         ----------
>> >>         id:          Integer (primary key)
>> >>         url:         String  (unique)
>> >>         title:       String
>> >>         text:        String
>> >>         html:        String
>> >>         last_visit:  DateTime
>> >>         word_pos:    TSVECTOR
>> >>         >> >>         links
>> >>         ----------
>> >>         id         Integer (primary key)
>> >>         source:    String
>> >>         target:    String  >> >>         link_text: String
>> >>         UNIQUE(source,target)
>> >>         >> >>         crawls
>> >>         ---------
>> >>         id:         Integer (primary key)
>> >>         query:      String
>> >>         >> >>         crawl_results
>> >>         -------------
>> >>         id:       Integer (primary key)
>> >>         score:    Integer (constraint 0<=score<=1)
>> >>         crawl_id: Integer (foreign key, crawls.id)
>> >>         page_id:  Integer (foreign key, pages.id)
>> >> >> The `source` and `target` fields in the `links` table contain URLs. I am
>> running the following query to extract scored links from the top-ranking
>> search results, for pages that haven't been fetched yet:
>> >>         WITH top_results AS >> >>             (SELECT page_id, score FROM crawl_results >> >>             WHERE
crawl_id=$1>> >>             ORDER BY score LIMIT 100) 
>> >>         SELECT top_results.score, l.target
>> >>         FROM top_results >> >>             JOIN pages p ON top_results.page_id=p.id
>> >>             JOIN links l on p.url=l.source >> >>         WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE
l.target=pp.url)
>> >> >> However, *I would like to filter these results so that only one row is
>> returned for a given domain (the one with the lowest score)*. So for
>> instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
>> 'http://www.foo.com/zor'), I only want the first because it has same
>> domain `foo.com` and has the lower score.
>> >> I was able to find documentation for the builtin full text search
>> parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
>> which can parse URLS and extract the hostname. For instance, I can
>> extract the hostname from a URL as follows:
>>
Hi,

I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could use

select regex_replace('https?://(.*=)/.*', '\\1', url)

instead of the fulltext parser

Best regards
Wolfgang







В списке pgsql-general по дате отправления:

Предыдущее
От: Jess Wren
Дата:
Сообщение: How to use full-text search URL parser to filter query results bydomain name?
Следующее
От: Joe Conway
Дата:
Сообщение: Re: Query much slower on 9.6.5 than on 9.3.5