Обсуждение: Help with sql
Hi Guys, This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list,please let me know. It is also, perhaps, a really silly question. This query (without the 'explain' keyword) , when executed takes forever and a day: > condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps); > QUERY PLAN > ---------------------------------------------------------------------------------------------- > Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4) > Filter: (NOT (SubPlan 1)) > SubPlan 1 > -> Materialize (cost=0.00..2517.78 rows=95852 width=4) > -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) > (5 rows) This query returns within a second: > condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps); > QUERY PLAN > ---------------------------------------------------------------------------------------------- > Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4) > Filter: (NOT (hashed SubPlan 1)) > SubPlan 1 > -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4) > -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) > (5 rows) The difference is the "distinct" keyword in the inner select. What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using"sets" and an item (I thought) can not be in a set more than once. Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operatoris "in" or "not in" ? And if I can throw in another question on top: is there a different method other than "not in" that would work better? Thank you guys for the help and a really awesome database. pedz (this is *suppose* to be sent as plain text... I hope my mailer does what it is told)
On 07/06/2012 03:34 PM, Perry Smith wrote: > Hi Guys, > > This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list,please let me know. > > It is also, perhaps, a really silly question. > > This query (without the 'explain' keyword) , when executed takes forever and a day: > >> condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps); >> QUERY PLAN >> ---------------------------------------------------------------------------------------------- >> Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4) >> Filter: (NOT (SubPlan 1)) >> SubPlan 1 >> -> Materialize (cost=0.00..2517.78 rows=95852 width=4) >> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) >> (5 rows) > > This query returns within a second: > >> condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps); >> QUERY PLAN >> ---------------------------------------------------------------------------------------------- >> Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4) >> Filter: (NOT (hashed SubPlan 1)) >> SubPlan 1 >> -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4) >> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) >> (5 rows) > > The difference is the "distinct" keyword in the inner select. > > What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using"sets" and an item (I thought) can not be in a set more than once. > > Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operatoris "in" or "not in" ? > > And if I can throw in another question on top: is there a different method other than "not in" that would work better? > > Thank you guys for the help and a really awesome database. > pedz > (this is *suppose* to be sent as plain text... I hope my mailer does what it is told) > > Well they are distinct records, they just may have the same values. And I'm not trying to be flippant. We don't see the structure of those table: are all the id fields involved primary keys or with unique index coverage? Does "not exists ( select fileset.id = fileset_id from service_pack_fileset_map) change the behaviour?
On 07/06/2012 02:34 PM, Perry Smith wrote: > Hi Guys, > > This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list,please let me know. > > It is also, perhaps, a really silly question. > > This query (without the 'explain' keyword) , when executed takes forever and a day: > >> condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps); >> QUERY PLAN >> ---------------------------------------------------------------------------------------------- >> Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4) >> Filter: (NOT (SubPlan 1)) >> SubPlan 1 >> -> Materialize (cost=0.00..2517.78 rows=95852 width=4) >> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) >> (5 rows) > This query returns within a second: > >> condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps); >> QUERY PLAN >> ---------------------------------------------------------------------------------------------- >> Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4) >> Filter: (NOT (hashed SubPlan 1)) >> SubPlan 1 >> -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4) >> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) >> (5 rows) > The difference is the "distinct" keyword in the inner select. > > What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using"sets" and an item (I thought) can not be in a set more than once. > > Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operatoris "in" or "not in" ? > > And if I can throw in another question on top: is there a different method other than "not in" that would work better? Actually it is *very* PostgreSQL specific. In fact, it may even be PostgreSQL *version* specific as you are delving into how the planner decides how to handle a query. It appears that the planner is assuming, based on collected stats and available indexes, that there will be roughly 1/6 the records returned by the "distinct" query and thus chose a different method to join the records. One useful piece of information would be the indexes on the two tables. As to other methods, you can use: ... where not exists (select 1 from service_pack_fileset_maps where fileset_id = filesets.id)... (Note: as alluded to above, ...not in... works better in some releases and ...not exists... better in others due to improvements over time.) Still another method: select id from filesets except select fileset_id from service_pack_fileset_maps; Cheers, Steve
On Jul 6, 2012, at 7:56 PM, Steve Crawford wrote: > On 07/06/2012 02:34 PM, Perry Smith wrote: >> Hi Guys, >> >> This isn't a PostgreSQL specific question but just a SQL question. If this is not an appropriate question for this list,please let me know. >> >> It is also, perhaps, a really silly question. >> >> This query (without the 'explain' keyword) , when executed takes forever and a day: >> >>> condor_development=> explain select id from filesets where id not in ( select fileset_id from service_pack_fileset_maps); >>> QUERY PLAN >>> ---------------------------------------------------------------------------------------------- >>> Seq Scan on filesets (cost=0.00..71937742.00 rows=26088 width=4) >>> Filter: (NOT (SubPlan 1)) >>> SubPlan 1 >>> -> Materialize (cost=0.00..2517.78 rows=95852 width=4) >>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) >>> (5 rows) >> This query returns within a second: >> >>> condor_development=> explain select id from filesets where id not in ( select distinct fileset_id from service_pack_fileset_maps); >>> QUERY PLAN >>> ---------------------------------------------------------------------------------------------- >>> Seq Scan on filesets (cost=2102.31..3153.53 rows=26088 width=4) >>> Filter: (NOT (hashed SubPlan 1)) >>> SubPlan 1 >>> -> HashAggregate (cost=1903.15..2062.48 rows=15933 width=4) >>> -> Seq Scan on service_pack_fileset_maps (cost=0.00..1663.52 rows=95852 width=4) >>> (5 rows) >> The difference is the "distinct" keyword in the inner select. >> >> What I'm confused about is why isn't the "distinct" implicit? I thought the construct "blah in ( select ... )" was using"sets" and an item (I thought) can not be in a set more than once. >> >> Perhaps my question is the opposite really? Why would you not always use "distinct" in the inner select when the operatoris "in" or "not in" ? >> >> And if I can throw in another question on top: is there a different method other than "not in" that would work better? > Actually it is *very* PostgreSQL specific. In fact, it may even be PostgreSQL *version* specific as you are delving intohow the planner decides how to handle a query. > > It appears that the planner is assuming, based on collected stats and available indexes, that there will be roughly 1/6the records returned by the "distinct" query and thus chose a different method to join the records. One useful piece ofinformation would be the indexes on the two tables. > > As to other methods, you can use: > ... where not exists (select 1 from service_pack_fileset_maps where fileset_id = filesets.id)... > (Note: as alluded to above, ...not in... works better in some releases and ...not exists... better in others due to improvementsover time.) > > Still another method: > select id from filesets except select fileset_id from service_pack_fileset_maps; Thanks guys. Small side note: I thought I saw "set difference" in the documentation but I couldn't find it. It appears "EXCEPT" is setdifference. Thank you for that tidbit. The database is mostly static. I run through a very lengthy process to populate the database maybe once a month and thenit is 99% read-only. By far, most of the accesses are via a view that I have that is rather long and ugly so I won'tpaste it in. I've tried to make this particular view as fast as possible so the indexes, etc are what I think willhelp that out. The version is psql (PostgreSQL) 9.0.4 Each table has a key of "id" -- this database was created by / used by Ruby on Rails and that is how it likes to do things. > condor_development=> \d service_pack_fileset_maps > Table "public.service_pack_fileset_maps" > Column | Type | Modifiers > -----------------+-----------------------------+------------------------------------------------------------------------ > id | integer | not null default nextval('service_pack_fileset_maps_id_seq'::regclass) > service_pack_id | integer | not null > fileset_id | integer | not null > created_at | timestamp without time zone | > updated_at | timestamp without time zone | > Indexes: > "service_pack_fileset_maps_pkey" PRIMARY KEY, btree (id) > "service_pack_fileset_maps_service_pack_id_key" UNIQUE, btree (service_pack_id, fileset_id) > "index_service_pack_fileset_maps_on_fileset_id" btree (fileset_id) > Foreign-key constraints: > "service_pack_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ON DELETE CASCADE DEFERRABLE > "service_pack_fileset_maps_service_pack_id_fkey" FOREIGN KEY (service_pack_id) REFERENCES service_packs(id) ON DELETECASCADE DEFERRABLE > condor_development=> \d filesets > Table "public.filesets" > Column | Type | Modifiers > ------------+-----------------------------+------------------------------------------------------- > id | integer | not null default nextval('filesets_id_seq'::regclass) > lpp_id | integer | not null > vrmf | character varying(255) | not null > created_at | timestamp without time zone | > updated_at | timestamp without time zone | > Indexes: > "filesets_pkey" PRIMARY KEY, btree (id) > "filesets_lpp_id_key" UNIQUE, btree (lpp_id, vrmf) > Foreign-key constraints: > "filesets_lpp_id_fkey" FOREIGN KEY (lpp_id) REFERENCES lpps(id) ON DELETE CASCADE DEFERRABLE > Referenced by: > TABLE "fileset_aix_file_maps" CONSTRAINT "fileset_aix_file_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCESfilesets(id) ON DELETE CASCADE DEFERRABLE > TABLE "fileset_ptf_maps" CONSTRAINT "fileset_ptf_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id)ON DELETE CASCADE DEFERRABLE > TABLE "package_fileset_maps" CONSTRAINT "package_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCESfilesets(id) ON DELETE CASCADE DEFERRABLE > TABLE "service_pack_fileset_maps" CONSTRAINT "service_pack_fileset_maps_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCESfilesets(id) ON DELETE CASCADE DEFERRABLE > TABLE "upd_pc_views" CONSTRAINT "upd_pc_views_fileset_id_fkey" FOREIGN KEY (fileset_id) REFERENCES filesets(id) ONDELETE CASCADE DEFERRABLE Thank you again for your help, pedz
On Sat, Jul 7, 2012 at 11:48 PM, Perry Smith <pedzsan@gmail.com> wrote: > The database is mostly static. I run through a very lengthy process to populate the database maybe once a month and thenit is 99% read-only. Do you run an ANALYZE on the table after populating it? Postgres needs up-to-date statistics for best results. I'd recommend doing an explicit 'VACUUM ANALYZE' once your data's loaded, and then try your queries after that - it might not do much, but it also might give a massive improvement. ChrisA