Re: Joins and DELETE FROM

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Joins and DELETE FROM
Дата
Msg-id 47D2D485.9040103@enterprisedb.com
обсуждение исходный текст
Ответ на Joins and DELETE FROM  ("Kynn Jones" <kynnjo@gmail.com>)
Ответы Re: Joins and DELETE FROM  ("Kynn Jones" <kynnjo@gmail.com>)
Список pgsql-performance
Kynn Jones wrote:
> Hi!
>
> As part of a data warehousing project, I need to pre-process data downloaded
> from an external source, in the form of several large flat files.  This
> preprocessing entails checking the validity of various data items, and
> discarding those that fail to pass all the checks.
>
> Currently, the code that performs the checks generates intermediate
> temporary tables of those bits of data that are invalid in some way.  (This
> simplifies the process of generating various quality-control reports about
> the incoming data).
>
> The next step is to weed out the bad data from the main tables, and here's
> where I begin to get lost.
>
> To be concrete, suppose I have a table T consisting of 20 million rows,
> keyed on some column K.  (There are no formal constrains on T at the moment,
> but one could define column K as T's primary key.)  Suppose also that I have
> a second table B (for "bad") consisting of 20 thousand rows, and also keyed
> on some column K.  For each value of B.K there is exactly one row in T such
> that T.K = B.K, and the task is to delete all these rows from T as
> efficiently as possible.
>
> My naive approach would something like
>
> DELETE FROM T WHERE T.K IN ( SELECT K FROM B );
>
> ...which, according to EXPLAIN, is a terrible idea, because it involves
> sequentially scanning all 20 million rows of T just to delete about only
> 0.1% of them.
>
> It seems to me better to sequentially scan B and rely on an index on T to
> zero-in the few rows in T that must be deleted.
>
> Is this strategy something that can be done with plain SQL (even if to do
> this I must produce additional helper tables, indices, etc.), or must I
> write a stored procedure to implement it?

The planner knows how to produce such a plan, so it must thinking that
it's not the fastest plan.

Have you ANALYZEd the tables? You do have an index on T.K, right? What
does EXPLAIN ANALYZE output look like? (you can do BEGIN; EXPLAIN
ANALYZE ...; ROLLBACK; if you don't want to actually delete the rows)

The sequential scan really could be the fastest way to do that. If those
0.1% of the rows are scattered randomly across the table, an index scan
might end up fetching almost every page, but using random I/O which is
much slower than a sequential read. For example, assuming you can fit
100 rows on a page, deleting 0.1% of the rows would have to access ~ 10%
of the pages. At that point, it can easily be cheaper to just seq scan it.

You can try to coerce the planner to choose the indexscan with "set
enable_seqscan=off", to see how fast it actually is.

You could also write the query as DELETE FROM t USING b WHERE t.k = b.k,
but I doubt it makes much difference.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

В списке pgsql-performance по дате отправления:

Предыдущее
От: "Kynn Jones"
Дата:
Сообщение: Joins and DELETE FROM
Следующее
От: "Kynn Jones"
Дата:
Сообщение: Re: Joins and DELETE FROM