Re: stored proc and inserting hundreds of thousands of rows

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: stored proc and inserting hundreds of thousands of rows
Дата
Msg-id BANLkTikrXYJpJGzsUtpFgS8YiAO+d7y6PA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: stored proc and inserting hundreds of thousands of rows  (Joel Reymont <joelr1@gmail.com>)
Ответы Re: stored proc and inserting hundreds of thousands of rows
Список pgsql-performance
On Sat, Apr 30, 2011 at 2:15 PM, Joel Reymont <joelr1@gmail.com> wrote:
>
> On Apr 30, 2011, at 7:24 PM, Kevin Grittner wrote:
>
>> If this is where most of the time is, the next thing is to run it
>> with EXPLAIN ANALYZE, and post the output.
>
> I was absolutely wrong about the calculation taking < 1s, it actually takes about 30s for 2 million rows.
>
> Still, the difference between 5 minutes and 30s must be the insert.

But what exactly are you inserting?  The queries you reported below
are not the same as the ones you originally described.

In particular, they do not seem to use the "threshold" parameter that
the original ones did, whose job is presumably to cut the 2 million
down to a much smaller number that meet the threshold.  But how much
smaller is that number?  This will have a large effect on how long the
insert takes.


...

> Digging deeper into the distance function,
>
> EXPLAIN ANALYZE VERBOSE
> SELECT *
> FROM (SELECT id, divergence(<array above>, topics) AS distance FROM docs) AS tab
> WHERE tab.distance <= 50.0;
>
> Subquery Scan on tab  (cost=0.00..383333.00 rows=666653 width=12) (actual time=0.027..20429.299 rows=2000002 loops=1)
>  Output: tab.id, tab.distance
>  Filter: (tab.distance <= 50::double precision)
>  ->  Seq Scan on public.docs  (cost=0.00..358333.50 rows=1999960 width=36) (actual time=0.025..19908.200 rows=2000002
loops=1)
>        Output: docs.id, divergence((<array above>::double precision[])::topics, docs.topics)

It looks like "WHERE tab.distance <= 50.0;" is not accomplishing
anything.  Are you sure the parameter shouldn't be <=0.50 instead?

Also, you previously said you didn't mind of this process took a
couple minutes, as long as it didn't interfere with other things going
on in the database.  So you probably need to describe what those other
things going on in the database are.

Also, you might have a data correctness problem.  If the plan is to
scan new ads against all docs, and new docs against all ads; then if
new rows are added to each table during overlapping transaction, the
new ads against new docs comparison will not actually happen.  You
will probably need to add manual locking to get around this problem.

Cheers

Jeff

В списке pgsql-performance по дате отправления:

Предыдущее
От: Joel Reymont
Дата:
Сообщение: Re: stored proc and inserting hundreds of thousands of rows
Следующее
От: Joel Reymont
Дата:
Сообщение: Re: stored proc and inserting hundreds of thousands of rows