Обсуждение: Strange query plan
Dear all,
I am new to PG but I have a solid background on tuning in Oracle and MSSQL.
I have a query coming out from a piece of software from our SW-Stack (I can’t change it) and of course it takes a large amount of time.
The table I am query are inherited (partitioned) and the query is the following (names are changed for policy):
[SELECT]LOG: duration: 427514.480 ms execute <unnamed>: select
"f_table"."fk_column" as "c0",
"f_table"."fk_column" as "c1"
from
"s_schema"."f_table" as "f_table"
where
"f_table"."fk_column" = 'somevalue'
group by
"f_table"."fk_column"
order by
"f_table"."fk_column" ASC
the fk_column/somevalue is the “partition key” and the planner correctly purge the inherited table accordingly.
Records in partitions vary from a min of 30M to max of 160M rows.
'Group (cost=0.00..4674965.80 rows=200 width=17)'
' -> Append (cost=0.00..4360975.94 rows=125595945 width=17)'
' -> Index Scan using f_table_pkey on f_table (cost=0.00..5.64 rows=1 width=58)'
' Index Cond: ((fk_column)::text = 'somevalue'::text)'
' -> Seq Scan on f_table _scxc f_table (cost=0.00..4360970.30 rows=125595944 width=17)'
' Filter: ((fk_column)::text = 'somevalue'::text)'
disabling the seq_scan do not help it forces the index but it takes ages.
In each partition the value of fk_column is just one (being the partition key) and I am expecting that this is checked on the constraint by the planner.
Furthermore I have put an index on fk_column (tried both btree and hash) however the plan is always a seq_scan on the partition, even if the index has only one value?
Regardless the constraint (which I think it should be taken into consideration here) I am expecting that through Index Scan would easily figure out that the value.
In theory there should be no need to access the table here but perform everything on the index object (and of course in the “father” table).
Furthemore I don’t understand why on the main table is using an index scan (on 0 rows).
Yes: Analyzed.
I fear I am missing something on Index usage in Postgres.
cheers,
g
4 CPU (on VMWare) + 8G of RAM
seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 2.5 # same scale as above
cpu_tuple_cost = 0.01 # same scale as above
cpu_index_tuple_cost = 0.005 # same scale as above
cpu_operator_cost = 0.0025 # same scale as above
default_statistics_target = 100 # range 1-10000
constraint_exclusion = partition # on, off, or partition
shared_buffers = 960MB # min 128kB
temp_buffers = 256MB # min 800kB
max_prepared_transactions = 100 # zero disables the feature
work_mem = 192MB # min 64kB
maintenance_work_mem = 480MB # min 1MB
Hi, On 28 Říjen 2011, 19:27, Sorbara, Giorgio (CIOK) wrote: > Dear all, > > I am new to PG but I have a solid background on tuning in Oracle and > MSSQL. > I have a query coming out from a piece of software from our SW-Stack (I > can't change it) and of course it takes a large amount of time. > > The table I am query are inherited (partitioned) and the query is the > following (names are changed for policy): That's a bit ridiculous policy, especially as you've used the same fake column name (fk_column) for all columns. Does that mean you're reading just one column, or that there are actually more columns? I'd guess the first option, as fk_column is referenced twice in the select list ... > [SELECT]LOG: duration: 427514.480 ms execute <unnamed>: select > "f_table"."fk_column" as "c0", > "f_table"."fk_column" as "c1" > from > "s_schema"."f_table" as "f_table" > where > "f_table"."fk_column" = 'somevalue' > group by > "f_table"."fk_column" > order by > "f_table"."fk_column" ASC > > the fk_column/somevalue is the "partition key" and the planner correctly > purge the inherited table accordingly. > Records in partitions vary from a min of 30M to max of 160M rows. > > 'Group (cost=0.00..4674965.80 rows=200 width=17)' > ' -> Append (cost=0.00..4360975.94 rows=125595945 width=17)' > ' -> Index Scan using f_table_pkey on f_table (cost=0.00..5.64 > rows=1 width=58)' > ' Index Cond: ((fk_column)::text = 'somevalue'::text)' > ' -> Seq Scan on f_table _scxc f_table (cost=0.00..4360970.30 > rows=125595944 width=17)' > ' Filter: ((fk_column)::text = 'somevalue'::text)' > > disabling the seq_scan do not help it forces the index but it takes ages. > > In each partition the value of fk_column is just one (being the partition > key) and I am expecting that this is checked on the constraint by the > planner. > Furthermore I have put an index on fk_column (tried both btree and hash) > however the plan is always a seq_scan on the partition, even if the index > has only one value? I'm a bit confused right now. The fk_column is used for partitioning, so "fk_column = somevalue" actually means "give me all data from exactly one partition, right? In that case the above behaviour is expected, because index scan would mean a lot of random I/O. MVCC in PostgreSQL works very differently, compared to Oracle for example - the indexes do not contain necessary visibility info (which transactions can see those records), so whenever you read a tuple from index, you have to check the data in the actual table. So an index scan of the whole table means "read the whole index and the whole table" and the table is accessed randomly (which kinda defeats the db cache). So the sequential scan is the expected and perfectly sane. BTW this should change in 9.2, as there is an index-only scan implementation. > Regardless the constraint (which I think it should be taken into > consideration here) I am expecting that through Index Scan would easily > figure out that the value. > In theory there should be no need to access the table here but perform > everything on the index object (and of course in the "father" table). > Furthemore I don't understand why on the main table is using an index scan > (on 0 rows). Not true. PostgreSQL MVCC does not work that - see explanation above. > I fear I am missing something on Index usage in Postgres. Yup, seems like that. Anyway, a few recommendations / questions: 1) Don't post EXPLAIN output, post EXPLAIN ANALYZE if possible. 2) How many live tuples are there? Are the really 125.595.945 live rows, or is the table bloated? When you do a "select count(*)" from the table, what number you get? 3) What amount of data are we talking here? 125 million rows could be 10MB or 10GB, hard to guess. Tomas
Hi Tomas, and thank you for your reply. Inline my comments > -----Original Message----- > From: Tomas Vondra [mailto:tv@fuzzy.cz] > Sent: 28 October 2011 8:10 PM > To: Sorbara, Giorgio (CIOK) > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Strange query plan > > Hi, > > On 28 Říjen 2011, 19:27, Sorbara, Giorgio (CIOK) wrote: > > Dear all, > > > > I am new to PG but I have a solid background on tuning in Oracle and > > MSSQL. > > I have a query coming out from a piece of software from our SW-Stack > (I > > can't change it) and of course it takes a large amount of time. > > > > The table I am query are inherited (partitioned) and the query is the > > following (names are changed for policy): > > That's a bit ridiculous policy, especially as you've used the same fake > column name (fk_column) for all columns. Does that mean you're reading > just one column, or that there are actually more columns? I'd guess the > first option, as fk_column is referenced twice in the select list ... Sorry but that is the exact query (I'll ignore the policy and post the exact columns from now on). Just to be clear is a query generated by Mondrian (ROLAP engine) for a degenerated dimension and it looks like this: select "f_suipy"."fk_theme" as "c0", "f_suipy"."fk_theme" as "c1" from "gaez"."f_suipy" as "f_suipy" where "f_suipy"."fk_theme" = 'main_py_six_scxc' group by "f_suipy"."fk_theme" order by "f_suipy"."fk_theme" ASC; we have a total of 18 partitions. > > > > the fk_column/somevalue is the "partition key" and the planner > correctly > > purge the inherited table accordingly. > > Records in partitions vary from a min of 30M to max of 160M rows. > > > > 'Group (cost=0.00..4674965.80 rows=200 width=17)' > > ' -> Append (cost=0.00..4360975.94 rows=125595945 width=17)' > > ' -> Index Scan using f_table_pkey on f_table > (cost=0.00..5.64 > > rows=1 width=58)' > > ' Index Cond: ((fk_column)::text = 'somevalue'::text)' > > ' -> Seq Scan on f_table _scxc f_table > (cost=0.00..4360970.30 > > rows=125595944 width=17)' > > ' Filter: ((fk_column)::text = 'somevalue'::text)' > > > > disabling the seq_scan do not help it forces the index but it takes > ages. > > > > In each partition the value of fk_column is just one (being the > partition > > key) and I am expecting that this is checked on the constraint by the > > planner. > > Furthermore I have put an index on fk_column (tried both btree and > hash) > > however the plan is always a seq_scan on the partition, even if the > index > > has only one value? > > I'm a bit confused right now. The fk_column is used for partitioning, > so > "fk_column = somevalue" actually means "give me all data from exactly > one > partition, right? Yes, but there is an enforced constraint telling me that column can host only one value. > > In that case the above behaviour is expected, because index scan would > mean a lot of random I/O. MVCC in PostgreSQL works very differently, > compared to Oracle for example - the indexes do not contain necessary > visibility info (which transactions can see those records), so whenever > you read a tuple from index, you have to check the data in the actual > table. > > So an index scan of the whole table means "read the whole index and the > whole table" and the table is accessed randomly (which kinda defeats > the > db cache). So the sequential scan is the expected and perfectly sane. > > BTW this should change in 9.2, as there is an index-only scan > implementation. > > > Regardless the constraint (which I think it should be taken into > > consideration here) I am expecting that through Index Scan would > easily > > figure out that the value. > > In theory there should be no need to access the table here but > perform > > everything on the index object (and of course in the "father" table). > > Furthemore I don't understand why on the main table is using an index > scan > > (on 0 rows). > > Not true. PostgreSQL MVCC does not work that - see explanation above. > > > I fear I am missing something on Index usage in Postgres. > > Yup, seems like that. Ok... so since the index is not version aware I have to check the version in the "data" segment to be sure I am pointingat the right value. I can see now there is no point at using this partitioning scheme... it was sort of perfect to me as I could drive the partitioneasily with a degenerated dimension. Except for this small issue (waiting more than 10 min is not an option). Furthermore I am afraid that even partial indexes won't work. > > Anyway, a few recommendations / questions: > > 1) Don't post EXPLAIN output, post EXPLAIN ANALYZE if possible. Group (cost=0.00..4674965.80 rows=200 width=17) (actual time=13.375..550943.592 rows=1 loops=1) -> Append (cost=0.00..4360975.94 rows=125595945 width=17) (actual time=13.373..524324.817 rows=125595932 loops=1) -> Index Scan using f_suipy_pkey on f_suipy (cost=0.00..5.64 rows=1 width=58) (actual time=0.019..0.019 rows=0loops=1) Index Cond: ((fk_theme)::text = 'main_py_six_scxc'::text) -> Seq Scan on f_suipy_main_py_six_scxc f_suipy (cost=0.00..4360970.30 rows=125595944 width=17) (actual time=13.352..495259.117rows=125595932 loops=1) Filter: ((fk_theme)::text = 'main_py_six_scxc'::text) Total runtime: 550943.699 ms > > 2) How many live tuples are there? Are the really 125.595.945 live > rows, > or is the table bloated? When you do a "select count(*)" from the > table, > what number you get? SELECT count(0) FROM gaez.f_suipy_main_py_six_scxc; 125595932 > > 3) What amount of data are we talking here? 125 million rows could be > 10MB > or 10GB, hard to guess. pg_size_pretty tells me: 'f_suipy_main_py_six_scxc';'21 GB' > > Tomas g ps: excellent benchmarking work on your site :)
On Mon, Oct 31, 2011 at 9:52 AM, Sorbara, Giorgio (CIOK) <Giorgio.Sorbara@fao.org> wrote: > Group (cost=0.00..4674965.80 rows=200 width=17) (actual time=13.375..550943.592 rows=1 loops=1) > -> Append (cost=0.00..4360975.94 rows=125595945 width=17) (actual time=13.373..524324.817 rows=125595932 loops=1) > -> Index Scan using f_suipy_pkey on f_suipy (cost=0.00..5.64 rows=1 width=58) (actual time=0.019..0.019 rows=0loops=1) > Index Cond: ((fk_theme)::text = 'main_py_six_scxc'::text) > -> Seq Scan on f_suipy_main_py_six_scxc f_suipy (cost=0.00..4360970.30 rows=125595944 width=17) (actual time=13.352..495259.117rows=125595932 loops=1) > Filter: ((fk_theme)::text = 'main_py_six_scxc'::text) > Total runtime: 550943.699 ms How fast do you expect this to run? It's aggregating 125 million rows, so that's going to take some time no matter how you slice it. Unless I'm misreading this, it's actually taking only about 4 microseconds per row, which does not obviously suck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> -----Original Message----- > From: Robert Haas [mailto:robertmhaas@gmail.com] > Sent: 04 November 2011 5:07 PM > To: Sorbara, Giorgio (CIOK) > Cc: Tomas Vondra; pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Strange query plan > > On Mon, Oct 31, 2011 at 9:52 AM, Sorbara, Giorgio (CIOK) > <Giorgio.Sorbara@fao.org> wrote: > > Group (cost=0.00..4674965.80 rows=200 width=17) (actual > time=13.375..550943.592 rows=1 loops=1) > > -> Append (cost=0.00..4360975.94 rows=125595945 width=17) (actual > time=13.373..524324.817 rows=125595932 loops=1) > > -> Index Scan using f_suipy_pkey on f_suipy > (cost=0.00..5.64 rows=1 width=58) (actual time=0.019..0.019 rows=0 > loops=1) > > Index Cond: ((fk_theme)::text = > 'main_py_six_scxc'::text) > > -> Seq Scan on f_suipy_main_py_six_scxc f_suipy > (cost=0.00..4360970.30 rows=125595944 width=17) (actual > time=13.352..495259.117 rows=125595932 loops=1) > > Filter: ((fk_theme)::text = 'main_py_six_scxc'::text) > > Total runtime: 550943.699 ms > > How fast do you expect this to run? It's aggregating 125 million > rows, so that's going to take some time no matter how you slice it. > Unless I'm misreading this, it's actually taking only about 4 > microseconds per row, which does not obviously suck. Well, the problem is not how fast it takes to process one row rather the best query plan I am supposed to get. I don't meanthe planer is wrong but I was expecting a feature is not there (yet). We don't have pure index scan. Fair enough. so I have approached the problem in a different way: getting rid of the degenerateddimensions and exploiting "useless" dimension table. It's a workaround but it actually seems to work :) now I have a ~350 millions fact table and no partition but I am happyto get the data I want in 1 sec or less. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Fri, Nov 4, 2011 at 12:14 PM, Sorbara, Giorgio (CIOK) <Giorgio.Sorbara@fao.org> wrote: >> How fast do you expect this to run? It's aggregating 125 million >> rows, so that's going to take some time no matter how you slice it. >> Unless I'm misreading this, it's actually taking only about 4 >> microseconds per row, which does not obviously suck. > > Well, the problem is not how fast it takes to process one row rather the best query plan I am supposed to get. I don'tmean the planer is wrong but I was expecting a feature is not there (yet). > We don't have pure index scan. Fair enough. so I have approached the problem in a different way: getting rid of the degenerateddimensions and exploiting "useless" dimension table. > It's a workaround but it actually seems to work :) now I have a ~350 millions fact table and no partition but I am happyto get the data I want in 1 sec or less. Am I misreading the EXPLAIN ANALYZE output? I'm reading that to say that there were 125 million rows in the table that matched your filter condition. If that's correct, I don't think index-only scans (which will be in 9.2) are going to help you much - it might be faster, but it's definitely not going to be anything like instantaneous. On the flip side, if I *am* misreading the output and the number of rows needed to compute the aggregate is actually some very small number, then you ought to be getting an index scan even in older versions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company