Обсуждение: Delete performance again

Поиск
Список
Период
Сортировка

Delete performance again

От
"Віталій Тимчишин"
Дата:
Hello.

I have a database with company table that have a bunch of related (delete=cascade) tables.
Also it has 1<->M relation to company_descr table.
Once we've found that ~half of our companies do not have any description and we would like to remove them.
First this I've tried was
delete from company where id not in (select company_id from company_descr);
I've tried to analyze command, but unlike to other RDBM I've used it did not include cascade deletes/checks into query plan. That is first problem.
It was SLOW. To make it faster I've done next thing:

create temporary table comprm(id) as select id from company;
delete from comprm where id in (select company_id from company_descr);
delete from company where id in (select id from comprm);

That was much better. So the question is why postgresql can't do such a thing.
But it was better only until "removing" dataset was small (~5% of all table).
As soon as I've tried to remove 50% I've got speed problems. I've ensured I have all indexes for both ends of foreign key.
I've tried to remove all cascaded entries by myself, e.g.:

create temporary table comprm(id) as select id from company;
delete from comprm where id in (select company_id from company_descr);
delete from company_alias where company_id in (select id from comprm);
...
delete from company where id in (select id from comprm);

It did not help until I drop all constraints before and recreate all constraints after.
Now I have it work for 15minutes, while previously it could not do in a day.

Is it OK? I'd say, some (if not all) of the optimizations could be done by postgresql optimizer.

Re: Delete performance again

От
Tom Lane
Дата:
"=?ISO-8859-5?B?svbi0Nv22SDC2Nzn2OjY3Q==?=" <tivv00@gmail.com> writes:
> delete from company where id not in (select company_id from company_descr);
> I've tried to analyze command, but unlike to other RDBM I've used it did not
> include cascade deletes/checks into query plan. That is first problem.
> It was SLOW.

Usually the reason for that is having forgotten to make an index on the
referencing column(s) ?

            regards, tom lane

Re: Delete performance again

От
"Віталій Тимчишин"
Дата:


2008/10/2 Tom Lane <tgl@sss.pgh.pa.us>
"Віталій Тимчишин" <tivv00@gmail.com> writes:
> delete from company where id not in (select company_id from company_descr);
> I've tried to analyze command, but unlike to other RDBM I've used it did not
> include cascade deletes/checks into query plan. That is first problem.
> It was SLOW.

Usually the reason for that is having forgotten to make an index on the
referencing column(s) ?

Not at all. As you can see below in original message, simply "extending" the query to what should have been done by optimizer helps. I'd say optimizer always uses fixed plan not taking into account that this is massive update and id doing index lookup of children records for each parent record, while it would be much more effective to perform removal of all children records in single table scan.

It's like trigger "for each record" instead of "for each statement".

Re: Delete performance again

От
"Marc Mamin"
Дата:
Hi,

Maybe you can try this syntax. I'm not sure, but it eventually perform better:


delete from company_alias USING comprm
where company_alias.company_id =comprm.id


Cheers,

Marc

Re: Delete performance again

От
"Віталій Тимчишин"
Дата:
OK, I did try you proposal and correlated subselect.
I have a database ~900000 companies.
First try was to remove randomly selected 1000 companies
Uncorrelated subselect: 65899ms
Correlated subselect: 97467ms
using: 9605ms
my way: 104979ms. (without constraints recreate)
My is the worst because it is oriented on massive delete.
So I thought USING would perform better, so I did try 10000 companies
my way: 190527ms. (without constraints recreate)
using: 694144ms
I was a little shocked, but I did check plans and found out that it did switch from Nested Loop to Hash Join.
I did disable Hash Join, it not show Merge Join. This was also disabled....
and I've got 747253ms.
Then I've tried combinations: Without hash join it was the best result of 402629ms, without merge join it was 1096116ms.

My conclusion: Until optimizer would take into account additional actions needed (like constraints check/cascade deletes/triggers), it can not make good plan.

Re: Delete performance again

От
"Віталій Тимчишин"
Дата:
BTW: Have just tried "clean" (without any foreign keys constraints) peformance of
"delete from tbl where field not in (select)"
vs
"create temporary table tmp(id)  as select distinct field from tbl; delete from tmp where id in (select); delete from tbl where field in (select id from tmp)".
both tbl and select are huge.
tbl cardinality is ~5 million, select is ~1 milliion. Number of records to delete is small.
select is simply "select id from table2".

First (simple) one could not do in a night, second did in few seconds.