Hi Simon,
no question is a stupid question.
Postgres can handle great deal of data if properly sized and configured.
Additionally, in your case, I would parse the reverse of the URL, to
make the indexed search faster.
regards,
fabio pardi
On 12/28/18 6:15 PM, Simon Connah wrote:
> First of all apologies if this is a stupid question but I've never used
> a database for something like this and I'm not sure if what I want is
> possible with PostgreSQL.
>
> I'm writing a simple web crawler in Python and the general task it needs
> to do is first get a list of domain names and URL paths from the
> database and download the HTML associated with that URL and save it in
> an object store.
>
> Then another process goes through the downloaded HTML and extracts all
> the links on the page and saves it to the database (if the URL does not
> already exist in the database) so the next time the web crawler runs it
> picks up even more URLs to crawl. This process is exponential so the
> number of URLs that are saved in the database will grow very quickly.
>
> I'm just a bit concerned that saving so much data to PostgreSQL quickly
> would cause performance issues.
>
> Is this something PostgreSQL could handle well? I'm not going to be
> running PostgreSQL server myself. I'll be using Amazon RDS to host it.
>