Обсуждение: Using PostgreSQL to store URLs for a web crawler

Поиск
Список
Период
Сортировка

Using PostgreSQL to store URLs for a web crawler

От
Simon Connah
Дата:
First of all apologies if this is a stupid question but I've never used 
a database for something like this and I'm not sure if what I want is 
possible with PostgreSQL.

I'm writing a simple web crawler in Python and the general task it needs 
to do is first get a list of domain names and URL paths from the 
database and download the HTML associated with that URL and save it in 
an object store.

Then another process goes through the downloaded HTML and extracts all 
the links on the page and saves it to the database (if the URL does not 
already exist in the database) so the next time the web crawler runs it 
picks up even more URLs to crawl. This process is exponential so the 
number of URLs that are saved in the database will grow very quickly.

I'm just a bit concerned that saving so much data to PostgreSQL quickly 
would cause performance issues.

Is this something PostgreSQL could handle well? I'm not going to be 
running PostgreSQL server myself. I'll be using Amazon RDS to host it.


Re: Using PostgreSQL to store URLs for a web crawler

От
Fabio Pardi
Дата:
Hi Simon,

no question is a stupid question.

Postgres can handle great deal of data if properly sized and configured.

Additionally, in your case, I would parse the reverse of the URL, to
make the indexed search faster.

regards,

fabio pardi

On 12/28/18 6:15 PM, Simon Connah wrote:
> First of all apologies if this is a stupid question but I've never used
> a database for something like this and I'm not sure if what I want is
> possible with PostgreSQL.
> 
> I'm writing a simple web crawler in Python and the general task it needs
> to do is first get a list of domain names and URL paths from the
> database and download the HTML associated with that URL and save it in
> an object store.
> 
> Then another process goes through the downloaded HTML and extracts all
> the links on the page and saves it to the database (if the URL does not
> already exist in the database) so the next time the web crawler runs it
> picks up even more URLs to crawl. This process is exponential so the
> number of URLs that are saved in the database will grow very quickly.
> 
> I'm just a bit concerned that saving so much data to PostgreSQL quickly
> would cause performance issues.
> 
> Is this something PostgreSQL could handle well? I'm not going to be
> running PostgreSQL server myself. I'll be using Amazon RDS to host it.
>