Re: Using PostgreSQL to store URLs for a web crawler

Поиск
Список
Период
Сортировка
От Fabio Pardi
Тема Re: Using PostgreSQL to store URLs for a web crawler
Дата
Msg-id 3f53e28c-c550-ada4-feb8-5c70daa10d8d@portavita.eu
обсуждение исходный текст
Ответ на Using PostgreSQL to store URLs for a web crawler  (Simon Connah <scopensource@gmail.com>)
Список pgsql-novice
Hi Simon,

no question is a stupid question.

Postgres can handle great deal of data if properly sized and configured.

Additionally, in your case, I would parse the reverse of the URL, to
make the indexed search faster.

regards,

fabio pardi

On 12/28/18 6:15 PM, Simon Connah wrote:
> First of all apologies if this is a stupid question but I've never used
> a database for something like this and I'm not sure if what I want is
> possible with PostgreSQL.
> 
> I'm writing a simple web crawler in Python and the general task it needs
> to do is first get a list of domain names and URL paths from the
> database and download the HTML associated with that URL and save it in
> an object store.
> 
> Then another process goes through the downloaded HTML and extracts all
> the links on the page and saves it to the database (if the URL does not
> already exist in the database) so the next time the web crawler runs it
> picks up even more URLs to crawl. This process is exponential so the
> number of URLs that are saved in the database will grow very quickly.
> 
> I'm just a bit concerned that saving so much data to PostgreSQL quickly
> would cause performance issues.
> 
> Is this something PostgreSQL could handle well? I'm not going to be
> running PostgreSQL server myself. I'll be using Amazon RDS to host it.
> 


В списке pgsql-novice по дате отправления:

Предыдущее
От: Simon Connah
Дата:
Сообщение: Using PostgreSQL to store URLs for a web crawler
Следующее
От: Daniel Heath
Дата:
Сообщение: I'd like to try to add a feature - guidance needed