Bug in execution of EXISTS and IN clauses for large tables

Поиск
Список
Период
Сортировка
От Martin Kos
Тема Bug in execution of EXISTS and IN clauses for large tables
Дата
Msg-id AM4PR0401MB2241616DE8A709B13019B0E59F3B9@AM4PR0401MB2241.eurprd04.prod.outlook.com
обсуждение исходный текст
Ответы Re: Bug in execution of EXISTS and IN clauses for large tables  ("David G. Johnston" <david.g.johnston@gmail.com>)
Список pgsql-bugs

Dear PostgreSQL team,

 

I would like to report an unexpected behavior of the queries using either the EXISTS or IN clauses.


I am using PostgreSQL version 11.1 (accessed via the DBVisualizer).

 

Description: I need to subset one large table to rows to only  ids present in second table and then join columns from a third table, that has all IDs (unique) with different information. Finally, counts of different columns in different group by conditions are executed. To make the sub-setting efficient and avoid multiplication of the rows when sub-setting to the IDs present only in the second table, I have used the “where exists” or “where ID in…” clauses. However, under my conditions, the “exists” and “in” clauses were processed as inner join leading to multiplication of the rows. Placing the “exists” or “in” clauses on the second smaller table1, retrieved expected smaller number of rows. It seems that PostgreSQL selects different plan for the two clauses depending on the cost.

 

Minimal settings to reproduce the issue:

I have managed to reproduce the issue with this minimal setting.  I have three tables with large number of rows:

table1_25mil_base table: 25 million rows, ID column (simple sequence) and column_x – random number with duplicates
table2_60mil table:  60 million rows – created from the first table, multiple occurrences of the 25 million unique IDs, and column_y (random)

table3_25mil table: 25 million rows  - created from the first table, but contains only subset of the unique IDs (~ 2/3 of the unique IDs), and column_z (random).

The presence of the second column the table 3 seems to be needed to reproduce the issue.

The queries to create the test tables are at the end below.

 

The result of the different queries that should in principle provide the same output is below – please note the number of rows in the tables produced.

 

Please let me know if this is an expected behavior?

If you need any more information, let me know.

Thank you.

 

Kind regards,
Martin Kos

 

 

/* test 1 using “EXISTS” */

 

select count(*)

from (select *

      from table2_60mil table2

      where exists (select id

                    from table3_30mil table3

                    where table2.id = table3.id

                   )

      )t1

join (select id, column_x

      from table1_25mil_base

      )t2

  on t1.id = t2.id

-- 72'010'384  (expected less then 60 million rows present in the table2)

;

 

select count(*)

from (select *

      from table2_60mil

      --where id in (select id from test_20mil)

      )t1

join (select id, column_x

      from table1_25mil_base table1

      where exists (select id

                    from table3_30mil table3

                    where table1.id = table3.id

                   )

      )t2

  on t1.id = t2.id

-- 44'273'564  (expected “correct” reduced number of rows)

;

 

 

/* test 2 – using “IN clause” */

 

select count(*)

from (select *

      from table2_60mil

      where id in (select id from table3_30mil)

      )t1

join (select id, column_x

      from table1_25mil_base

      )t2

  on t1.id = t2.id

-- 72'010'384  (expected less then 60 million rows present in the table2. The number of rows corresponds to inner join);

 

select count(*)

from (select *

      from table2_60mil

      --where id in (select id from test_20mil)

      )t1

join (select id, column_x

      from table1_25mil_base

      where id in (select id from table3_30mil)

      )t2

  on t1.id = t2.id

-- 44'273'564  (expected “correct” reduced number of rows)

;

 

/* test compare with "inner join" */

 

select count(*)

from (select table2.id

      from table2_60mil table2

      join table3_30mil table3

      on table2.id = table3.id

      )t1

join (select id, column_x

      from table1_25mil_base

      )t2

  on t1.id = t2.id

-- 72'010'384

;

 

 

/* test 3 – sub-setting to the same table -> should result in the same number of rows as in the original table! */

 

select count(*)

from (select *

      from table2_60mil

      where id in (select id from table2_60mil)

      )t1

join (select id, column_x

      from table1_25mil_base

      )t2

  on t1.id = t2.id

-- 179'989'276 – clearly multiplied rows, behaves as inner join

;

 

select count(*)

from (select *

      from table2_60mil

      --where id in (select id from test_20mil)

      )t1

join (select id, column_x

      from table1_25mil_base

      where id in (select id from table2_60mil)

      )t2

  on t1.id = t2.id

-- 60'000'000 (correct number of rows)

;

 

 

/* make base table1 with 25 million distinct ids */

select generate_series(1,25000000) AS id, floor(random() * 100000)::int as column_x

into table1_25mil_base

;

/* make table2 with 60 million rows with close to all ids. Offset is used to prevent planner to reuse of the tables, perhaps not needed */

 

select id, floor(random() * 10000)::int as column_y

into table2_60mil

from ((select id

      from table1_25mil_base

      order by random()

      limit 10000000)

       union all

       (select id

      from table1_25mil_base

      order by random()

      limit 10000000 offset 0)

       union all

      (select id

      from table1_25mil_base

      order by random()

      limit 10000000 offset 2345656)

       union all

      (select id

      from table1_25mil_base

      order by random()

      limit 10000000 offset 500000)

      union all

      (select id

      from table1_25mil_base

      order by random()

      limit 10000000 offset 1452652)

       union all

      (select id

      from table1_25mil_base

      order by random()

      limit 10000000 offset 777)

      )t

;

    

/* create 25 million "indication" table */

 

select id, floor(random() * 10000)::int as column_z

into table3_30mil

from ((select id

      from table1_25mil_base

      order by random()

      limit 5000000 offset 7)

       union all

      (select id

      from table1_25mil_base

      order by random()

      limit 5000000 offset 345786)

      union all

      (select id

      from table1_25mil_base

      order by random()

      limit 5000000 offset 0)

       union all

      (select id

      from table1_25mil_base

      order by random()

      limit 5000000 offset 20008)

      union all

      (select id

      from table1_25mil_base

      order by random()

      limit 5000000 offset 369)

      union all

      (select id

      from table1_25mil_base

      order by random()

      limit 5000000 offset 77877)

      )t

;

    

/* tables info */

select count(id), count(distinct id)

from table1_25mil_base

-- 25000000 25000000

;

 

select count(id), count(distinct id)

from table2_60mil

-- 60000000  23833944

;

 

select count(id), count(distinct id)

from table3_30mil

-- 25000000 23833944

;

 

 

Martin Kos
Follow Us
Biomedical Data Scientist Analytics
Facebook
LinkedIn
Twitter
YouTube
  
E: martin.kos@molecularhealth.com
Molecular Health GmbH
Kurfürsten‑Anlage 21
69115
Heidelberg
www.molecularhealth.com
Molecular Health
Vorsitzender des Aufsichtsrats: Prof. Dr. Christof Hettich
Geschäftsführer: Dr. Friedrich von Bohlen und Halbach (Sprecher der Geschäftsleitung), Dr. Stephan Brock
Sitz der Gesellschaft: Heidelberg
Handelsregister: Amtsgericht Mannheim - HRB 338037

Confidentiality note: ​The information in this email and any attachment may contain confidential and proprietary information of Molecular Health GmbH, Molecular Health, Inc. and/or its affiliates and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any review, reliance or distribution by others or forwarding without express permission is strictly prohibited and may cause liability. In case you have received this message due to an error in transmission, we kindly ask you to notify the sender immediately and to delete this email and any attachment from your system.

​Disclaimer: The information in this transmission contains privileged and confidential information, including patient information protected by federal and state privacy laws. It is intended only for the use of the person(s) named above. If you are not the intended recipient, you are hereby notified, that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply mail and destroy all copies of the original message.

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "egashira.yusuke@fujitsu.com"
Дата:
Сообщение: RE: Reconnect a single connection used by multiple threads in embedded SQL in C application causes error.
Следующее
От: "David G. Johnston"
Дата:
Сообщение: Re: Bug in execution of EXISTS and IN clauses for large tables