Обсуждение: stored proc and inserting hundreds of thousands of rows

Поиск
Список
Период
Сортировка

stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
I have a stored proc that potentially inserts hundreds of thousands, potentially millions, of rows (below).

This stored proc is part of the the sequence of creating an ad campaign and links an ad to documents it should be
displayedwith. 

A few of these stored procs can run concurrently as users create ad campaigns.

We have 2 million documents now and linking an ad to all of them takes 5 minutes on my top-of-the-line SSD MacBook Pro.

Last but not least, the system has to quickly serve ads while documents are being linked which is a problem at the
moment.

What can I do to make linking documents to ads faster or have less impact on the system. I would like the system to be
asresponsive with serving ads while the linking itself is allowed to take a few minutes.  

One thing I'm concerned with, for example, is the whole multi-million row insert running within the stored proc
transaction.I think inserting rows one by one or in small batches may be an improvement. I don't know how to accomplish
this,though. 

    Thanks, Joel

---

CREATE DOMAIN doc_id AS varchar(64);
CREATE DOMAIN id AS int;

CREATE TABLE doc_ads
(
  doc_id    id NOT NULL REFERENCES docs,
  ad_id     id NOT NULL REFERENCES ads,
  distance  float NOT NULL
);

CREATE INDEX doc_ads_idx ON doc_ads(doc_id);

CREATE OR REPLACE FUNCTION link_doc_to_ads(doc id, threshold float)
RETURNS void AS $$
BEGIN
  INSERT INTO doc_ads (doc_id, ad_id, distance)
  SELECT doc, (t).ad_id, (t).distance
  FROM (SELECT ads_within_distance(topics, threshold) AS t
       FROM docs
       WHERE id = doc) AS x;
  ANALYZE doc_ads;
END;
$$ LANGUAGE plpgsql;

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------




Re: stored proc and inserting hundreds of thousands of rows

От
"Kevin Grittner"
Дата:
Joel Reymont <joelr1@gmail.com> wrote:

> We have 2 million documents now and linking an ad to all of them
> takes 5 minutes on my top-of-the-line SSD MacBook Pro.

How long does it take to run just the SELECT part of the INSERT by
itself?

-Kevin

Re: stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
Calculating distance involves giving an array of 150 float8 to a pgsql
function, then calling a C function 2 million times (at the moment),
giving it two arrays of 150 float8.

Just calculating distance for 2 million rows and extracting the
distance takes less than a second. I think that includes sorting by
distance and sending 100 rows to the client.

Are you suggesting eliminating the physical linking and calculating
matching documents on the fly?

Is there a way to speed up my C function by giving it all the float
arrays, calling it once and having it return a set of matches? Would
this be faster than calling it from a select, once for each array?

Sent from my comfortable recliner

On 30/04/2011, at 18:28, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

> Joel Reymont <joelr1@gmail.com> wrote:
>
>> We have 2 million documents now and linking an ad to all of them
>> takes 5 minutes on my top-of-the-line SSD MacBook Pro.
>
> How long does it take to run just the SELECT part of the INSERT by
> itself?
>
> -Kevin

Re: stored proc and inserting hundreds of thousands of rows

От
"Pierre C"
Дата:
If you want to search by geographical coordinates, you could use a gist
index which can optimize that sort of things (like retrieving all rows
which fit in a box).

Re: stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
I'm calculating distance between probability vectors, e.g. topics that
a document belongs to and the topics of an ad.

The distance function is already a C function. Topics are float8[150].

Distance is calculated against all documents in the database so it's
arable scan.

Sent from my comfortable recliner

On 30/04/2011, at 19:04, Pierre C <lists@peufeu.com> wrote:

>
> If you want to search by geographical coordinates, you could use a gist index which can optimize that sort of things
(likeretrieving all rows which fit in a box). 

Re: stored proc and inserting hundreds of thousands of rows

От
"Kevin Grittner"
Дата:
[rearranging to correct for top-posting]

Joel Reymont <joelr1@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
>> Joel Reymont <joelr1@gmail.com> wrote:
>>
>>> We have 2 million documents now and linking an ad to all of them
>>> takes 5 minutes on my top-of-the-line SSD MacBook Pro.
>>
>> How long does it take to run just the SELECT part of the INSERT
>> by itself?

> Are you suggesting eliminating the physical linking and
> calculating matching documents on the fly?

I'm not suggesting anything other than it being a good idea to
determine where the time is being spent before trying to make it
faster.  You showed this as the apparent source of the five minute
delay:

  INSERT INTO doc_ads (doc_id, ad_id, distance)
  SELECT doc, (t).ad_id, (t).distance
  FROM (SELECT ads_within_distance(topics, threshold) AS t
       FROM docs
       WHERE id = doc) AS x;

What we don't know is how much of that time is due to writing to the
doc_ads table, and how much is due to reading the other tables.  We
can find that out by running this:

  SELECT doc, (t).ad_id, (t).distance
  FROM (SELECT ads_within_distance(topics, threshold) AS t
       FROM docs
       WHERE id = doc) AS x;

If this is where most of the time is, the next thing is to run it
with EXPLAIN ANALYZE, and post the output.  It's a whole different
set of things to try to tune if that part is fast and the INSERT
itself is slow.

Of course, be aware of caching effects when you time this.

-Kevin

Re: stored proc and inserting hundreds of thousands of rows

От
"Kevin Grittner"
Дата:
Joel Reymont <joelr1@gmail.com> wrote:

> I'm calculating distance between probability vectors, e.g. topics
> that a document belongs to and the topics of an ad.
>
> The distance function is already a C function. Topics are
> float8[150].
>
> Distance is calculated against all documents in the database

There's probably a way to index that so that you don't need to do a
full calculation against all documents in the database each time.
It may even be amenable to knnGiST indexing (a new feature coming in
9.1), which would let you do your select with an ORDER BY on the
distance.

PostgreSQL has a lot of very cool features you just don't have in
any other product!  :-)

-Kevin

Re: stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote:

> If this is where most of the time is, the next thing is to run it
> with EXPLAIN ANALYZE, and post the output.

I was absolutely wrong about the calculation taking < 1s, it actually takes about 30s for 2 million rows.

Still, the difference between 5 minutes and 30s must be the insert.

SELECT (t).doc_id, (t).distance
 FROM (SELECT docs_within_distance('{ 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.586099770475, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.167233562858, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667, 0.00166666666667,
0.00166666666667,0.00166666666667 }', 50.0) as t) AS x; 

This takes 27.44 seconds

EXPLAIN ANALYZE VERBOSE

Subquery Scan on x  (cost=0.00..0.27 rows=1 width=32) (actual time=22422.418..23835.468 rows=2000002 loops=1)
  Output: (t).doc_id, (t).distance
  ->  Result  (cost=0.00..0.26 rows=1 width=0) (actual time=22422.410..23184.086 rows=2000002 loops=1)
        Output: docs_within_distance(('{<array above goes here>}'::double precision[])::topics, 50::double precision)
Total runtime: 23948.563 ms

Topics is defined thusly: CREATE DOMAIN topics AS float[150];

Digging deeper into the distance function,

EXPLAIN ANALYZE VERBOSE
SELECT *
FROM (SELECT id, divergence(<array above>, topics) AS distance FROM docs) AS tab
WHERE tab.distance <= 50.0;

Subquery Scan on tab  (cost=0.00..383333.00 rows=666653 width=12) (actual time=0.027..20429.299 rows=2000002 loops=1)
  Output: tab.id, tab.distance
  Filter: (tab.distance <= 50::double precision)
  ->  Seq Scan on public.docs  (cost=0.00..358333.50 rows=1999960 width=36) (actual time=0.025..19908.200 rows=2000002
loops=1)
        Output: docs.id, divergence((<array above>::double precision[])::topics, docs.topics)
Total runtime: 20550.019 ms

I can't dig any deeper because divergence is a C function.

    Thanks, Joel

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------




Re: stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
On Apr 30, 2011, at 7:36 PM, Kevin Grittner wrote:

> It may even be amenable to knnGiST indexing (a new feature coming in
> 9.1), which would let you do your select with an ORDER BY on the
> distance.

I don't think I can wait for 9.1, need to go live in a month, with PostgreSQL or without.

> PostgreSQL has a lot of very cool features you just don't have in any other product!  :-)


There's a strong NoSQL lobby here and I'm trying my best to show that PG can handle the job!

--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------




Re: stored proc and inserting hundreds of thousands of rows

От
Jeff Janes
Дата:
On Sat, Apr 30, 2011 at 2:15 PM, Joel Reymont <joelr1@gmail.com> wrote:
>
> On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote:
>
>> If this is where most of the time is, the next thing is to run it
>> with EXPLAIN ANALYZE, and post the output.
>
> I was absolutely wrong about the calculation taking < 1s, it actually takes about 30s for 2 million rows.
>
> Still, the difference between 5 minutes and 30s must be the insert.

But what exactly are you inserting?  The queries you reported below
are not the same as the ones you originally described.

In particular, they do not seem to use the "threshold" parameter that
the original ones did, whose job is presumably to cut the 2 million
down to a much smaller number that meet the threshold.  But how much
smaller is that number?  This will have a large effect on how long the
insert takes.


...

> Digging deeper into the distance function,
>
> EXPLAIN ANALYZE VERBOSE
> SELECT *
> FROM (SELECT id, divergence(<array above>, topics) AS distance FROM docs) AS tab
> WHERE tab.distance <= 50.0;
>
> Subquery Scan on tab  (cost=0.00..383333.00 rows=666653 width=12) (actual time=0.027..20429.299 rows=2000002 loops=1)
>  Output: tab.id, tab.distance
>  Filter: (tab.distance <= 50::double precision)
>  ->  Seq Scan on public.docs  (cost=0.00..358333.50 rows=1999960 width=36) (actual time=0.025..19908.200 rows=2000002
loops=1)
>        Output: docs.id, divergence((<array above>::double precision[])::topics, docs.topics)

It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
anything.  Are you sure the parameter shouldn't be <=0.50 instead?

Also, you previously said you didn't mind of this process took a
couple minutes, as long as it didn't interfere with other things going
on in the database.  So you probably need to describe what those other
things going on in the database are.

Also, you might have a data correctness problem.  If the plan is to
scan new ads against all docs, and new docs against all ads; then if
new rows are added to each table during overlapping transaction, the
new ads against new docs comparison will not actually happen.  You
will probably need to add manual locking to get around this problem.

Cheers

Jeff

Re: stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote:

> But what exactly are you inserting?  The queries you reported below
> are not the same as the ones you originally described.

I posted the wrong query initially. The only difference is in the table that holds the probability array.

I'm inserting document id and ad id pairs to show that this ad is not linked to this document. The mapping table has a
primarykey on the serial document id. 

> In particular, they do not seem to use the "threshold" parameter that
> the original ones did, whose job is presumably to cut the 2 million
> down to a much smaller number that meet the threshold.  But how much
> smaller is that number?

The 5 minutes is with a threshold large enough to be irrelevant. I would like to optimize the process before I apply
thethreshold to cut down the number of rows. 

> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
> anything.  Are you sure the parameter shouldn't be <=0.50 instead?

No, ignore the threshold for now.

> Also, you previously said you didn't mind of this process took a
> couple minutes, as long as it didn't interfere with other things going
> on in the database.  So you probably need to describe what those other
> things going on in the database are.

Those other things are ad serving which boils down to a lookup of ad ids linked to the document.

This is a lookup from the mapping table using the primary key that goes on at the same time as a large number of
<doc,ad>mappings are being inserted into the same table. 

Documents are uploaded into the system at a rate of 10k per day, once every couple of seconds. I wish I could get rid
ofstoring the <doc,ad> mapping as that table is gonna grow absolutely huge when each new ad matches tens or hundreds of
thousandsof documents.  

I don't think I can do the matching when serving an ad, though, as I will still need to scan millions of probability
vectors(one per doc) to calculate the distance between current document and existing ads. 

Then again, the number of ads in the system will always be a fraction of the number of documents so, perhaps, the
matchingof document to ads can be done at runtime. 

> Also, you might have a data correctness problem.  If the plan is to
> scan new ads against all docs, and new docs against all ads;

That's basically it.

As new ads are entered, they need to be matched with existing documents.

As new documents are entered, they need to be matched with existing ads.

Both ads and docs are represented by probability vectors of 150 floats so it's the same distance calculation.

> then if new rows are added to each table during overlapping transaction, the
> new ads against new docs comparison will not actually happen.  You
> will probably need to add manual locking to get around this problem.

I'll ponder this, thanks for pointing it out!


--------------------------------------------------------------------------
- for hire: mac osx device driver ninja, kernel extensions and usb drivers
---------------------+------------+---------------------------------------
http://wagerlabs.com | @wagerlabs | http://www.linkedin.com/in/joelreymont
---------------------+------------+---------------------------------------




Re: stored proc and inserting hundreds of thousands of rows

От
Jeff Janes
Дата:
On Sat, Apr 30, 2011 at 3:29 PM, Joel Reymont <joelr1@gmail.com> wrote:
>
> On Apr 30, 2011, at 11:11 PM, Jeff Janes wrote:
>
>> But what exactly are you inserting?  The queries you reported below
>> are not the same as the ones you originally described.
>
> I posted the wrong query initially. The only difference is in the table that holds the probability array.
>
> I'm inserting document id and ad id pairs to show that this ad is not linked to this document. The mapping table has
aprimary key on the serial document id. 

Having the (doc_id, ad_id) pair be missing from the table is a far
more efficient way to show that the ad is not linked to the document
(i.e. that it is below the threshold).  Provided that you are careful
that there are no other reasons that the pair could be missing--but if
you are not careful about that, then I don't see how storing the full
matrix will save you anyway.

>
>> In particular, they do not seem to use the "threshold" parameter that
>> the original ones did, whose job is presumably to cut the 2 million
>> down to a much smaller number that meet the threshold.  But how much
>> smaller is that number?
>
> The 5 minutes is with a threshold large enough to be irrelevant. I would like to optimize the process before I apply
thethreshold to cut down the number of rows. 
>
>> It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
>> anything.  Are you sure the parameter shouldn't be <=0.50 instead?
>
> No, ignore the threshold for now.

OK, but it seems to me that you are starting out by ruling out the one
optimization that is most likely to work.

>> Also, you previously said you didn't mind of this process took a
>> couple minutes, as long as it didn't interfere with other things going
>> on in the database.  So you probably need to describe what those other
>> things going on in the database are.
>
> Those other things are ad serving which boils down to a lookup of ad ids linked to the document.
>
> This is a lookup from the mapping table using the primary key that goes on at the same time as a large number of
<doc,ad>mappings are being inserted into the same table. 

What numbers do you get for lookups per second when inserts are also
going on, versus when they are not going on?

The way I would approach this is by making two independent tasks, one
that insert records at your anticipated rate "insert into foo select
generate_series  from generate_series(1,100000);" in a loop, and
another than generates select load against a separate table (like
pgbench -S) and see how the two interact with each other by competing
for CPU and IO.

You could throttle the insert process by adding pg_sleep(<some
fraction of a second>) as a column in one of your selects, so it
pauses at every row.  But due to granularity of pg_sleep, you might
have to put it in a CASE expression so it is invoked on only a random
subset of the rows rather than each row.    But once throttled, will
it be able to keep up with the flow of new docs and ads?


>
> I don't think I can do the matching when serving an ad, though, as I will still need to scan millions of probability
vectors(one per doc) to calculate the distance between current document and existing ads. 

gist indices are designed to make this type of thing fast, by using
techniques to rule out most of those comparisons without actually
performing them.  I don't know enough about the
guts of either your distance function or the gist indexes to know if
you can do it this way, but if you can it would certainly be the way
to go.

Cheers,

Jeff

Re: stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
Calculating distance involves giving an array of 150 float8 to a pgsql
function, then calling a C function 2 million times (at the moment),
giving it two arrays of 150 float8.

Just calculating distance for 2 million rows and extracting the
distance takes less than a second. I think that includes sorting by
distance and sending 100 rows to the client.

Are you suggesting eliminating the physical linking and calculating
matching documents on the fly?

Is there a way to speed up my C function by giving it all the float
arrays, calling it once and having it return a set of matches? Would
this be faster than calling it from a select, once for each array?

Sent from my comfortable recliner

On 30/04/2011, at 18:28, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

> Joel Reymont <joelr1@gmail.com> wrote:
>
>> We have 2 million documents now and linking an ad to all of them
>> takes 5 minutes on my top-of-the-line SSD MacBook Pro.
>
> How long does it take to run just the SELECT part of the INSERT by
> itself?
>
> -Kevin

Re: stored proc and inserting hundreds of thousands of rows

От
Samuel Gendler
Дата:


On Sat, Apr 30, 2011 at 5:12 PM, Jeff Janes <jeff.janes@gmail.com> wrote:


gist indices are designed to make this type of thing fast, by using
techniques to rule out most of those comparisons without actually
performing them.  I don't know enough about the
guts of either your distance function or the gist indexes to know if
you can do it this way, but if you can it would certainly be the way
to go.

It is definitely a good idea to consider a gist index for eliminating most of a large dataset, if applicable.  Do a little reading on the topic and, hopefully, it's applicability (or not) will become apparent.

However, as someone who has built a number of ad servers over the years, for several of the larger ad networks, the first thing I'd do is separate your ad serving from real-time interaction with your database, no matter what the underlying technology.  If you ad serving is dependent on your database, it means that hiccups in the database will have an impact on ad serving, which is rarely tolerable.  And god forbid you should need to take the db down for a period of maintenance. The reliability and performance required of most ad servers is often greater than what should reasonably be expected of a relational database, particularly if there are other processes accessing the database, as is the case with your system. The first rule of ad serving is that no outage of backend systems should ever be able to prevent or impact front end ad serving. Some kind of in-memory cache of doc/ad mappings which the ad server interacts with will serve you in good stead and will be much easier to scale horizontally than most relational db architectures lend themselves to.  If you have an ever increasing set of documents and ads, you'll inevitably wind up 'sharding' your dataset across multiple db hosts in order to maintain performance - which creates a number of maintenance complexities. Easier to maintain a single database and do analytics over a single data source, but insulate it from the real-time performance requirements of your ad serving. Even something as simple as a process that pushes the most recent doc/ad mappings into a memcache instance could be sufficient - and you can scale your memcache across as many hosts as is necessary to deliver the lookup latencies that you require no matter how large the dataset.  Similarly, if you are updating the database from the ad server with each ad served in order to record an impression or click, you'll be far better off logging those and then processing the logs in bulk on a periodic basis.  If subsequent impressions are dependent upon what has already been served historically, then use your memcache instance (or whatever structure you eventually choose to utilize) to handle those lookups.  This gives you the convenience and flexibility of a relational system with SQL for access, but without the constraints of the capabilities of a single host limiting real-time performance of the system as a whole.


Re: stored proc and inserting hundreds of thousands of rows

От
Joel Reymont
Дата:
I'm calculating distance between probability vectors, e.g. topics that
a document belongs to and the topics of an ad.

The distance function is already a C function. Topics are float8[150].

Distance is calculated against all documents in the database so it's
arable scan.

Sent from my comfortable recliner

On 30/04/2011, at 19:04, Pierre C <lists@peufeu.com> wrote:

>
> If you want to search by geographical coordinates, you could use a gist index which can optimize that sort of things
(likeretrieving all rows which fit in a box). 

Re: stored proc and inserting hundreds of thousands of rows

От
Greg Smith
Дата:
On 04/30/2011 09:00 PM, Samuel Gendler wrote:
> Some kind of in-memory cache of doc/ad mappings which the ad server
> interacts with will serve you in good stead and will be much easier to
> scale horizontally than most relational db architectures lend
> themselves to...Even something as simple as a process that pushes the
> most recent doc/ad mappings into a memcache instance could be
> sufficient - and you can scale your memcache across as many hosts as
> is necessary to deliver the lookup latencies that you require no
> matter how large the dataset.

Many of the things I see people switching over to NoSQL key/value store
solutions would be served equally well on the performance side by a
memcache layer between the application and the database.  If you can map
the problem into key/value pairs for NoSQL, you can almost certainly do
that using a layer above PostgreSQL instead.

The main downside of that, what people seem to object to, is that it
makes for two pieces of software that need to be maintained; the NoSQL
solutions can do it with just one.  If you have more complicated queries
to run, too, the benefit to using a more complicated database should
outweigh that extra complexity though.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books