Bug in execution of EXISTS and IN clauses for large tables

Поиск

Список

Период

Сортировка

От	Martin Kos
Тема	Bug in execution of EXISTS and IN clauses for large tables
Дата	22 февраля 2022 г. 16:17:06
Msg-id	AM4PR0401MB2241616DE8A709B13019B0E59F3B9@AM4PR0401MB2241.eurprd04.prod.outlook.com обсуждение исходный текст
Ответы	Re: Bug in execution of EXISTS and IN clauses for large tables
Список	pgsql-bugs

Дерево обсуждения

Dear PostgreSQL team,

I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.

I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).

Description: I need to subset one large table to rows to only ids present in second table and then join columns from a third table, that has all IDs (unique) with different information. Finally, counts of different columns in different group by conditions are executed. To make the sub-setting efficient and avoid multiplication of the rows when sub-setting to the IDs present only in the second table, I have used the “where exists” or “where ID in…” clauses. However, under my conditions, the “exists” and “in” clauses were processed as inner join leading to multiplication of the rows. Placing the “exists” or “in” clauses on the second smaller table1, retrieved expected smaller number of rows. It seems that PostgreSQL selects different plan for the two clauses depending on the cost.

Minimal settings to reproduce the issue:

I have managed to reproduce the issue with this minimal setting. I have three tables with large number of rows:

table1_25mil_base table: 25 million rows, ID column (simple sequence) and column_x – random number with duplicates
table2_60mil table: 60 million rows – created from the first table, multiple occurrences of the 25 million unique IDs, and column_y (random)

table3_25mil table: 25 million rows - created from the first table, but contains only subset of the unique IDs (~ 2/3 of the unique IDs), and column_z (random).

The presence of the second column the table 3 seems to be needed to reproduce the issue.

The queries to create the test tables are at the end below.

The result of the different queries that should in principle provide the same output is below – please note the number of rows in the tables produced.

Please let me know if this is an expected behavior?

If you need any more information, let me know.

Thank you.

Kind regards,
Martin Kos

/* test 1 using “EXISTS” */

select count(*)

from (select *

from table2_60mil table2

where exists (select id

from table3_30mil table3

where table2.id = table3.id

)

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 72'010'384 (expected less then 60 million rows present in the table2)

;

select count(*)

from (select *

from table2_60mil

--where id in (select id from test_20mil)

)t1

join (select id, column_x

from table1_25mil_base table1

where exists (select id

from table3_30mil table3

where table1.id = table3.id

)

)t2

on t1.id = t2.id

-- 44'273'564 (expected “correct” reduced number of rows)

;

/* test 2 – using “IN clause” */

select count(*)

from (select *

from table2_60mil

where id in (select id from table3_30mil)

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 72'010'384 (expected less then 60 million rows present in the table2. The number of rows corresponds to inner join);

select count(*)

from (select *

from table2_60mil

--where id in (select id from test_20mil)

)t1

join (select id, column_x

from table1_25mil_base

where id in (select id from table3_30mil)

)t2

on t1.id = t2.id

-- 44'273'564 (expected “correct” reduced number of rows)

;

/* test compare with "inner join" */

select count(*)

from (select table2.id

from table2_60mil table2

join table3_30mil table3

on table2.id = table3.id

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 72'010'384

;

/* test 3 – sub-setting to the same table -> should result in the same number of rows as in the original table! */

select count(*)

from (select *

from table2_60mil

where id in (select id from table2_60mil)

)t1

join (select id, column_x

from table1_25mil_base

)t2

on t1.id = t2.id

-- 179'989'276 – clearly multiplied rows, behaves as inner join

;

select count(*)

from (select *

from table2_60mil

--where id in (select id from test_20mil)

)t1

join (select id, column_x

from table1_25mil_base

where id in (select id from table2_60mil)

)t2

on t1.id = t2.id

-- 60'000'000 (correct number of rows)

;

/* make base table1 with 25 million distinct ids */

select generate_series(1,25000000) AS id, floor(random() * 100000)::int as column_x

into table1_25mil_base

;

/* make table2 with 60 million rows with close to all ids. Offset is used to prevent planner to reuse of the tables, perhaps not needed */

select id, floor(random() * 10000)::int as column_y

into table2_60mil

from ((select id

from table1_25mil_base

order by random()

limit 10000000)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 0)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 2345656)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 500000)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 1452652)

union all

(select id

from table1_25mil_base

order by random()

limit 10000000 offset 777)

;

/* create 25 million "indication" table */

select id, floor(random() * 10000)::int as column_z

into table3_30mil

from ((select id

from table1_25mil_base

order by random()

limit 5000000 offset 7)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 345786)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 0)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 20008)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 369)

union all

(select id

from table1_25mil_base

order by random()

limit 5000000 offset 77877)

;

/* tables info */

select count(id), count(distinct id)

from table1_25mil_base

-- 25000000 25000000

;

select count(id), count(distinct id)

from table2_60mil

-- 60000000 23833944

;

select count(id), count(distinct id)

from table3_30mil

-- 25000000 23833944

;

Martin Kos

Biomedical Data Scientist Analytics




E:	martin.kos@molecularhealth.com

Molecular Health GmbH

Kurfürsten‑Anlage 21

69115

Heidelberg

www.molecularhealth.com

Vorsitzender des Aufsichtsrats: Prof. Dr. Christof Hettich
Geschäftsführer: Dr. Friedrich von Bohlen und Halbach (Sprecher der Geschäftsleitung), Dr. Stephan Brock
Sitz der Gesellschaft: Heidelberg
Handelsregister: Amtsgericht Mannheim - HRB 338037

Confidentiality note: The information in this email and any attachment may contain confidential and proprietary information of Molecular Health GmbH, Molecular Health, Inc. and/or its affiliates and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any review, reliance or distribution by others or forwarding without express permission is strictly prohibited and may cause liability. In case you have received this message due to an error in transmission, we kindly ask you to notify the sender immediately and to delete this email and any attachment from your system.

Disclaimer: The information in this transmission contains privileged and confidential information, including patient information protected by federal and state privacy laws. It is intended only for the use of the person(s) named above. If you are not the intended recipient, you are hereby notified, that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply mail and destroy all copies of the original message.

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее

От: "egashira.yusuke@fujitsu.com"
Дата: 22 февраля 2022 г., 15:06:23
Сообщение: RE: Reconnect a single connection used by multiple threads in embedded SQL in C application causes error.

Следующее

От: "David G. Johnston"
Дата: 22 февраля 2022 г., 17:17:53
Сообщение: Re: Bug in execution of EXISTS and IN clauses for large tables

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Bug in execution of EXISTS and IN clauses for large tables

Вложения

Предыдущее

Следующее