Обсуждение: Postgres picks suboptimal index after building of an extended statistics

Поиск
Список
Период
Сортировка

Postgres picks suboptimal index after building of an extended statistics

От
"Andrey V. Lepikhov"
Дата:
Hi,

Ivan Frolkov reported a problem with choosing a non-optimal index during 
a query optimization. This problem appeared after building of an 
extended statistics.

I prepared the test case (see t.sql in attachment).
For reproduction of this case we need to have a composite primary key 
index and one another index.
Before creation of extended statistics, SELECT from the table choose PK 
index and returns only one row. But after, this SELECT picks alternative 
index, fetches and filters many tuples.

The problem is related to a corner case in btree cost estimation procedure:
if postgres detects unique one-row index scan, it sets
numIndexTuples to 1.0.

But the selectivity is calculated as usual, by the 
clauselist_selectivity() routine and can have a value, much more than 
corresponding to single tuple. This selectivity value is used later in 
the code to calculate a number of fetched tuples and can lead to 
choosing of an suboptimal index.

The attached patch is my suggestion to fix this problem.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Вложения

Re: Postgres picks suboptimal index after building of an extended statistics

От
Peter Geoghegan
Дата:
On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> Ivan Frolkov reported a problem with choosing a non-optimal index during
> a query optimization. This problem appeared after building of an
> extended statistics.

Any thoughts on this, Tomas?

-- 
Peter Geoghegan



Re: Postgres picks suboptimal index after building of an extended statistics

От
Tomas Vondra
Дата:
On 8/11/21 2:48 AM, Peter Geoghegan wrote:
> On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> Ivan Frolkov reported a problem with choosing a non-optimal index during
>> a query optimization. This problem appeared after building of an
>> extended statistics.
> 
> Any thoughts on this, Tomas?
> 

Thanks for reminding me, I missed / forgot about this thread.

I agree the current behavior is unfortunate, but I'm not convinced the 
proposed patch is fixing the right place - doesn't this mean the index 
costing won't match the row estimates displayed by EXPLAIN?

I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
and improve the cardinality estimates directly, not just costing for 
index scans.

Also, is it correct that the patch calculates num_sa_scans only when 
(numIndexTuples >= 0.0)?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Postgres picks suboptimal index after building of an extended statistics

От
"Andrey V. Lepikhov"
Дата:
On 8/12/21 4:26 AM, Tomas Vondra wrote:
> On 8/11/21 2:48 AM, Peter Geoghegan wrote:
>> On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov
>> <a.lepikhov@postgrespro.ru> wrote:
>>> Ivan Frolkov reported a problem with choosing a non-optimal index during
>>> a query optimization. This problem appeared after building of an
>>> extended statistics.
>>
>> Any thoughts on this, Tomas?
>>
> 
> Thanks for reminding me, I missed / forgot about this thread.
> 
> I agree the current behavior is unfortunate, but I'm not convinced the 
> proposed patch is fixing the right place - doesn't this mean the index 
> costing won't match the row estimates displayed by EXPLAIN?
I think, it is not a problem. In EXPLAIN you will see only 1 row 
with/without this patch.
> 
> I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
> and improve the cardinality estimates directly, not just costing for 
> index scans.
This idea looks better. I will try to implement it.
> 
> Also, is it correct that the patch calculates num_sa_scans only when 
> (numIndexTuples >= 0.0)?
Thanks, fixed.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Вложения

Re: Postgres picks suboptimal index after building of an extended statistics

От
Andrey Lepikhov
Дата:
On 12/8/21 04:26, Tomas Vondra wrote:
> On 8/11/21 2:48 AM, Peter Geoghegan wrote:
>> On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov
> I agree the current behavior is unfortunate, but I'm not convinced the 
> proposed patch is fixing the right place - doesn't this mean the index 
> costing won't match the row estimates displayed by EXPLAIN?
I rewrote the patch. It's now simpler and shorter. May be more convenient.
Also, it includes a regression test to detect the problem in future.
> 
> I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
> and improve the cardinality estimates directly, not just costing for 
> index scans.
I tried to implement this in different ways. But it causes additional 
overhead and code complexity - analyzing a list of indexes and match 
clauses of each index with input clauses in each selectivity estimation.
I don't like that way and propose a new patch in attachment.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Вложения

Re: Postgres picks suboptimal index after building of an extended statistics

От
Tom Lane
Дата:
Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> On 12/8/21 04:26, Tomas Vondra wrote:
>> I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
>> and improve the cardinality estimates directly, not just costing for 
>> index scans.

> I tried to implement this in different ways. But it causes additional 
> overhead and code complexity - analyzing a list of indexes and match 
> clauses of each index with input clauses in each selectivity estimation.
> I don't like that way and propose a new patch in attachment.

I looked at this briefly.  I do not think that messing with
btcostestimate/genericcostestimate is the right response at all.
The problem can be demonstrated with no index whatever, as in the
attached shortened version of the original example.  I get

                    QUERY PLAN                     
---------------------------------------------------
 Seq Scan on a  (cost=0.00..46.02 rows=1 width=12)
   Filter: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

before adding the extended stats, and

                     QUERY PLAN                     
----------------------------------------------------
 Seq Scan on a  (cost=0.00..46.02 rows=28 width=12)
   Filter: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

afterwards.  So the extended stats have made the rowcount
estimate significantly worse, which seems like an indicator of a
bug somewhere in extended stats.  The more so because I can crank
default_statistics_target all the way to 10000 without these
estimates changing.  If we can't get a dead-on estimate for a
2001-row table at that stats level, we're doing something wrong,
surely?

Also, I found that if I ask only for ndistinct stats,
I still get rows=1.  The fishiness seems to be directly
a problem with dependencies stats.

            regards, tom lane

DROP TABLE IF EXISTS a CASCADE;
DROP STATISTICS IF EXISTS aestat;

CREATE TABLE a AS (
    SELECT
        gs % 10 AS x,
        (gs % 10 + (gs/10::int4) % 10) % 10 AS y,
        (gs / 100)::int4 AS z
    FROM generate_series(1,1000) AS gs
);
INSERT INTO a (SELECT gs,gs,gs FROM generate_series(1000,2000) AS gs);

-- ALTER TABLE a ADD PRIMARY KEY (x,y,z);
-- CREATE INDEX ON a(x);
ANALYZE a;

EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1;

CREATE STATISTICS aestat(dependencies,ndistinct) ON x,y,z FROM a;
ANALYZE a;

EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1;

Re: Postgres picks suboptimal index after building of an extended statistics

От
Andrey Lepikhov
Дата:
On 7/8/22 03:07, Tom Lane wrote:
> Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
>> On 12/8/21 04:26, Tomas Vondra wrote:
>>> I wonder if we should teach clauselist_selectivity about UNIQUE indexes,
>>> and improve the cardinality estimates directly, not just costing for
>>> index scans.
> 
>> I tried to implement this in different ways. But it causes additional
>> overhead and code complexity - analyzing a list of indexes and match
>> clauses of each index with input clauses in each selectivity estimation.
>> I don't like that way and propose a new patch in attachment.
> 
> I looked at this briefly.  I do not think that messing with
> btcostestimate/genericcostestimate is the right response at all.
> The problem can be demonstrated with no index whatever, as in the
> attached shortened version of the original example.  I get

I partly agree with you. Yes, I see the problem too. But also we have a 
problem that I described above: optimizer don't choose a path with 
minimal selectivity from a set selectivities which shows cardinality 
less than 1 (see badestimate2.sql).
New patch (see in attachment), fixes this problem.

-- 
Regards
Andrey Lepikhov
Postgres Professional
Вложения

Re: Postgres picks suboptimal index after building of an extended statistics

От
Andres Freund
Дата:
Hi,

On 2022-07-11 12:57:36 +0500, Andrey Lepikhov wrote:
> On 7/8/22 03:07, Tom Lane wrote:
> > Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:
> > > On 12/8/21 04:26, Tomas Vondra wrote:
> > > > I wonder if we should teach clauselist_selectivity about UNIQUE indexes,
> > > > and improve the cardinality estimates directly, not just costing for
> > > > index scans.
> > 
> > > I tried to implement this in different ways. But it causes additional
> > > overhead and code complexity - analyzing a list of indexes and match
> > > clauses of each index with input clauses in each selectivity estimation.
> > > I don't like that way and propose a new patch in attachment.
> > 
> > I looked at this briefly.  I do not think that messing with
> > btcostestimate/genericcostestimate is the right response at all.
> > The problem can be demonstrated with no index whatever, as in the
> > attached shortened version of the original example.  I get
> 
> I partly agree with you. Yes, I see the problem too. But also we have a
> problem that I described above: optimizer don't choose a path with minimal
> selectivity from a set selectivities which shows cardinality less than 1
> (see badestimate2.sql).
> New patch (see in attachment), fixes this problem.

This causes the mains regression tests to fail due to a planner change:

https://cirrus-ci.com/build/6680222884429824

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/join.out
/tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/join.out    2022-11-22 12:27:18.852087140 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out    2022-11-22 12:28:47.934938882 +0000
@@ -6671,10 +6671,9 @@
    Merge Cond: (j1.id1 = j2.id1)
    Join Filter: (j2.id2 = j1.id2)
    ->  Index Scan using j1_id1_idx on j1
-   ->  Index Only Scan using j2_pkey on j2
+   ->  Index Scan using j2_id1_idx on j2
          Index Cond: (id1 >= ANY ('{1,5}'::integer[]))
-         Filter: ((id1 % 1000) = 1)
-(7 rows)
+(6 rows)
 
 select * from j1
 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
 
Greetings,

Andres Freund



Re: Postgres picks suboptimal index after building of an extended statistics

От
Andrey Lepikhov
Дата:
On 12/8/2021 06:26, Tomas Vondra wrote:
> On 8/11/21 2:48 AM, Peter Geoghegan wrote:
>> On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov
>> <a.lepikhov@postgrespro.ru> wrote:
>>> Ivan Frolkov reported a problem with choosing a non-optimal index during
>>> a query optimization. This problem appeared after building of an
>>> extended statistics.
>>
>> Any thoughts on this, Tomas?
>>
> 
> Thanks for reminding me, I missed / forgot about this thread.
> 
> I agree the current behavior is unfortunate, but I'm not convinced the 
> proposed patch is fixing the right place - doesn't this mean the index 
> costing won't match the row estimates displayed by EXPLAIN?
> 
> I wonder if we should teach clauselist_selectivity about UNIQUE indexes, 
> and improve the cardinality estimates directly, not just costing for 
> index scans.
> 
> Also, is it correct that the patch calculates num_sa_scans only when 
> (numIndexTuples >= 0.0)?
I can't stop thinking about this issue. It is bizarre when Postgres 
chooses a non-unique index if a unique index gives us proof of minimum scan.
I don't see a reason to teach the clauselist_selectivity() routine to 
estimate UNIQUE indexes. We add some cycles, but it will work with btree 
indexes only.
Maybe to change compare_path_costs_fuzzily() and add some heuristic, for 
example:
"If selectivity of both paths gives us no more than 1 row, prefer to use 
a unique index or an index with least selectivity."

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: Postgres picks suboptimal index after building of an extended statistics

От
Tomas Vondra
Дата:
On 9/25/23 06:30, Andrey Lepikhov wrote:
> ...
> I can't stop thinking about this issue. It is bizarre when Postgres
> chooses a non-unique index if a unique index gives us proof of minimum
> scan.

That's true, but no one implemented this heuristics. So the "proof of
minimum scan" is merely hypothetical - the optimizer is unaware of it.

> I don't see a reason to teach the clauselist_selectivity() routine to
> estimate UNIQUE indexes. We add some cycles, but it will work with btree
> indexes only.

I'm not sure I understand what this is meant to say. Can you elaborate?
We only allow UNIQUE for btree indexes anyway, so what exactly is the
problem here?

> Maybe to change compare_path_costs_fuzzily() and add some heuristic, for
> example:
> "If selectivity of both paths gives us no more than 1 row, prefer to use
> a unique index or an index with least selectivity."
> 

I don't understand how this would work. What do yo mean by "selectivity
of a path"? AFAICS the problem here is that we estimate a path to return
more rows (while we know there can only be 1, thanks to UNIQUE index).

So how would you know the path does not give us more than 1 row? Surely
you're not proposing compare_path_costs_fuzzily() to do something
expensive like analyzing the paths, or something.

Also, how would it work in upper levels? If you just change which path
we keep, but leave the inaccurate row estimate in place, that may fix
that level, but it's certainly going to confuse later planning steps.

IMHO the problem here is that we produce wrong estimate, so we better
fix that, instead of adding band-aids to other places.

This happens because functional dependencies are very simple type of
statistics - it has some expectations about the input data and also the
queries executed on it. For example it assumes the data is reasonably
homogeneous, so that we can calculate a global "degree".

But the data in the example directly violates that - it has 1000 rows
that are very random (so we'd find no dependencies), and 1000 rows with
perfect dependencies. Hence we end with degree=0.5, which approximates
the dependencies to all data. Not great, true, but that's the price for
simplicity of this statistics kind.

So the simplest solution is to disable dependencies on such data sets.
It's a bit inconvenient/unfortunate we build dependencies by default,
and people may not be aware of there assumptions.

Perhaps we could/should make dependency_degree() more pessimistic when
we find some "violations" of the rule (we intentionally are not strict
about it, because data are imperfect). I don't have a good idea how to
change the formulas, but I'm open to the idea in principle.

The other thing we could do is checking for unique indexes in
clauselist_selectivity, and if we find a match we can just skip the
extended statistics altogether. Not sure how expensive this would be,
but for typical cases (with modest number of indexes) perhaps OK. It
wouldn't work without a unique index, but I don't have a better idea.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Postgres picks suboptimal index after building of an extended statistics

От
Andrei Lepikhov
Дата:
Thanks for detaied answer,

On 3/11/2023 00:37, Tomas Vondra wrote:
> On 9/25/23 06:30, Andrey Lepikhov wrote:
>> ...
>> I can't stop thinking about this issue. It is bizarre when Postgres
>> chooses a non-unique index if a unique index gives us proof of minimum
>> scan.
> That's true, but no one implemented this heuristics. So the "proof of
> minimum scan" is merely hypothetical - the optimizer is unaware of it.

See the simple patch in the attachment. There, I have attempted to 
resolve situations of uncertainty to avoid making decisions based solely 
on the order of indexes in the list.

>> I don't see a reason to teach the clauselist_selectivity() routine to
>> estimate UNIQUE indexes. We add some cycles, but it will work with btree
>> indexes only.
> I'm not sure I understand what this is meant to say. Can you elaborate?
> We only allow UNIQUE for btree indexes anyway, so what exactly is the
> problem here?

Partly, you already answered yourself below: we have unique index 
estimation in a few estimation calls, but go through the list of indexes 
each time.
Also, for this sake, we would add some input parameters, usually NULL, 
because many estimations don't involve indexes at all.

>> Maybe to change compare_path_costs_fuzzily() and add some heuristic, for
>> example:
>> "If selectivity of both paths gives us no more than 1 row, prefer to use
>> a unique index or an index with least selectivity."
> I don't understand how this would work. What do yo mean by "selectivity
> of a path"? AFAICS the problem here is that we estimate a path to return
> more rows (while we know there can only be 1, thanks to UNIQUE index).

Oops, I meant cardinality. See the patch in the attachment.

> So how would you know the path does not give us more than 1 row? Surely
> you're not proposing compare_path_costs_fuzzily() to do something
> expensive like analyzing the paths, or something.

I solely propose to make optimizer more consecutive in its decisions: if 
we have one row for both path candidates, use uniqueness of the index or 
value of selectivity as one more parameter.

> Also, how would it work in upper levels? If you just change which path
> we keep, but leave the inaccurate row estimate in place, that may fix
> that level, but it's certainly going to confuse later planning steps.
It is designed for the only scan level.
> IMHO the problem here is that we produce wrong estimate, so we better
> fix that, instead of adding band-aids to other places.

Agree. I am looking for a solution to help users somehow resolve such 
problems. As an alternative solution, I can propose a selectivity hook 
or (maybe even better) - use the pathlist approach and add indexes into 
the index list with some predefined order - at first positions, place 
unique indexes with more columns, etc.

> This happens because functional dependencies are very simple type of
> statistics - it has some expectations about the input data and also the
> queries executed on it. For example it assumes the data is reasonably
> homogeneous, so that we can calculate a global "degree".
> 
> But the data in the example directly violates that - it has 1000 rows
> that are very random (so we'd find no dependencies), and 1000 rows with
> perfect dependencies. Hence we end with degree=0.5, which approximates
> the dependencies to all data. Not great, true, but that's the price for
> simplicity of this statistics kind.
> 
> So the simplest solution is to disable dependencies on such data sets.
> It's a bit inconvenient/unfortunate we build dependencies by default,
> and people may not be aware of there assumptions.
> 
> Perhaps we could/should make dependency_degree() more pessimistic when
> we find some "violations" of the rule (we intentionally are not strict
> about it, because data are imperfect). I don't have a good idea how to
> change the formulas, but I'm open to the idea in principle.

Thanks for the explanation!

> The other thing we could do is checking for unique indexes in
> clauselist_selectivity, and if we find a match we can just skip the
> extended statistics altogether. Not sure how expensive this would be,
> but for typical cases (with modest number of indexes) perhaps OK. It
> wouldn't work without a unique index, but I don't have a better idea.
It looks a bit expensive for me. But I am ready to try, if current 
solution doesn't look applicable.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Вложения

Re: Postgres picks suboptimal index after building of an extended statistics

От
Andrei Lepikhov
Дата:
Second version of the patch - resolve non-symmetrical decision, thanks 
to Teodor Sigaev's review.


-- 
regards,
Andrei Lepikhov
Postgres Professional

Вложения

Re: Postgres picks suboptimal index after building of an extended statistics

От
Alexander Korotkov
Дата:
Hi!

I'd like to get this subject off the ground.  The problem originally described in [1] obviously comes from wrong selectivity estimation.  "Dependencies" extended statistics lead to significant selectivity miss 24/1000 instead of 1/1000.  When the estimation is correct, the PostgreSQL optimizer is capable of choosing the appropriate unique index for the query.

Tom pointed out in [2] that this might be a problem of "Dependencies" extended statistics.  I've rechecked this.  The dataset consists of two parts.  The first part defined in the CREATE TABLE statement contains independent values.  The second part defined in the INSERT statement contains dependent values.  "Dependencies" extended statistics estimate dependency degree as a fraction of rows containing dependent values.  According to this definition, it correctly estimates the dependency degree as about 0.5 for all the combinations.  So, the error in the estimate comes from the synergy of two factors: MCV estimation of the frequency of individual values, and spreading of average dependency degree for those values, which is not relevant for them.  I don't think there is a way to fix "dependencies" extended statistics because it works exactly as designed.  I have to note that instead of fixing "dependencies" extended statistics you can just add multi-column MCV statistics, which would fix the problem.

CREATE STATISTICS aestat(dependencies,ndistinct,mcv) ON x,y,z FROM a;

Independently on how well our statistics work, it looks pitiful that we couldn't fix that using the knowledge of unique constraints on the table.  Despite statistics, which give us just more or less accurate estimates, the constraint is something we really enforce and thus can rely on.  The patches provided by Andrei in [1], [3], and [4] fix this issue at the index scan path level.  As Thomas pointed out in [5], that could lead to inconsistency between the number of rows used for unique index scan estimation and the value displayed in EXPLAIN (and used for other paths).  Even though this was debated in [6], I can confirm this inconsistency.  Thus, with the patch published in [4], I can see the 28-row estimation with a unique index scan.

`                              QUERY PLAN
-----------------------------------------------------------------------
 Index Only Scan using a_pkey on a  (cost=0.28..8.30 rows=28 width=12)
   Index Cond: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)

Also, there is a set of patches [7], [8], and [9], which makes the optimizer consider path selectivity as long as path costs during the path selection.  I've rechecked that none of these patches could resolve the original problem described in [1].  Also, I think they are quite tricky.  The model of our optimizer assumes that paths in the list should be the different ways of getting the same result.  If we choose the paths by their selectivity, that breaks this model.  I don't say there is no way for this.  But if we do this, that would require significant rethinking of our optimizer model and possible revision of a significant part of it.  Anyway, I think if there is still interest in this, that should be moved into a separate thread to keep this thread focused on the problem described in [1].

Finally, I'd like to note that the issue described in [1] is mostly the selectivity estimation problem.  It could be solved by adding the multi-column MCV statistics.  The patches published so far look more like hacks for particular use cases rather than appropriate solutions.  It still looks promising to me to use the knowledge of unique constraints during selectivity estimation [10].  Even though it's hard to implement and possibly implies some overhead, it fits the current model.  I also think unique contracts could probably be used in some way to improve estimates even when there is no full match.

Links.
1. https://www.postgresql.org/message-id/0ca4553c-1f34-12ba-9122-44199d1ced41%40postgrespro.ru
2. https://www.postgresql.org/message-id/3119052.1657231656%40sss.pgh.pa.us
3. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
4. https://www.postgresql.org/message-id/a5a18d86-c0e5-0ceb-9a18-be1beb2d2944%40postgrespro.ru
5. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
6. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
7. https://www.postgresql.org/message-id/2df148b5-0bb8-f80b-ac03-251682fab585%40postgrespro.ru
8. https://www.postgresql.org/message-id/6fb43191-2df3-4791-b307-be754e648276%40postgrespro.ru
9. https://www.postgresql.org/message-id/154f786a-06a0-4fb1-b8a4-16c66149731b%40postgrespro.ru
10. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com

------
Regards,
Alexander Korotkov

Re: Postgres picks suboptimal index after building of an extended statistics

От
Andrei Lepikhov
Дата:
On 18/12/2023 15:29, Alexander Korotkov wrote:
> Also, there is a set of patches [7], [8], and [9], which makes the 
> optimizer consider path selectivity as long as path costs during the 
> path selection.  I've rechecked that none of these patches could resolve 
> the original problem described in [1].
It is true. We accidentally mixed two different problems in one thread.
>  Also, I think they are quite 
> tricky.  The model of our optimizer assumes that paths in the list 
> should be the different ways of getting the same result.  If we choose 
> the paths by their selectivity, that breaks this model.  I don't say 
> there is no way for this.  But if we do this, that would require 
> significant rethinking of our optimizer model and possible revision of a 
> significant part of it.
I can't understand that. In [9] we just elaborate the COSTS_EQUAL case 
and establish final decision on more stable basis than a casual order of 
indexes in the list.
>  Anyway, I think if there is still interest in 
> this, that should be moved into a separate thread to keep this thread 
> focused on the problem described in [1].
Agree. IMO, the problem of optimizer dependency on an order of indexes 
in the relation index list is more urgent for now.
> 
> Finally, I'd like to note that the issue described in [1] is mostly the 
> selectivity estimation problem.  It could be solved by adding the 
> multi-column MCV statistics.  The patches published so far look more 
> like hacks for particular use cases rather than appropriate solutions.  
> It still looks promising to me to use the knowledge of unique 
> constraints during selectivity estimation [10].  Even though it's hard 
> to implement and possibly implies some overhead, it fits the current 
> model.  I also think unique contracts could probably be used in some way 
> to improve estimates even when there is no full match.
I have tried to use the knowledge about unique indexes in the 
selectivity estimation routine. But it looks invasive and adds a lot of 
overhead.

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: Postgres picks suboptimal index after building of an extended statistics

От
Alexander Korotkov
Дата:
On Thu, Dec 21, 2023 at 10:41 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> On 18/12/2023 15:29, Alexander Korotkov wrote:
> > Also, there is a set of patches [7], [8], and [9], which makes the
> > optimizer consider path selectivity as long as path costs during the
> > path selection.  I've rechecked that none of these patches could resolve
> > the original problem described in [1].
> It is true. We accidentally mixed two different problems in one thread.
> >  Also, I think they are quite
> > tricky.  The model of our optimizer assumes that paths in the list
> > should be the different ways of getting the same result.  If we choose
> > the paths by their selectivity, that breaks this model.  I don't say
> > there is no way for this.  But if we do this, that would require
> > significant rethinking of our optimizer model and possible revision of a
> > significant part of it.
> I can't understand that. In [9] we just elaborate the COSTS_EQUAL case
> and establish final decision on more stable basis than a casual order of
> indexes in the list.

I took a closer look at the patch in [9].  I should drop my argument
about breaking the model, because add_path() already considers other
aspects than just costs.  But I have two more note about that patch:

1) It seems that you're determining the fact that the index path
should return strictly one row by checking path->rows <= 1.0 and
indexinfo->unique.  Is it really guaranteed that in this case quals
are matching unique constraint?  path->rows <= 1.0 could be just an
estimation error.  Or one row could be correctly estimated, but it's
going to be selected by some quals matching unique constraint and
other quals in recheck.  So, it seems there is a risk to select
suboptimal index due to this condition.

2) Even for non-unique indexes this patch is putting new logic on top
of the subsequent code.  How we can prove it's going to be a win?
That could lead, for instance, to dropping parallel-safe paths in
cases we didn't do so before.

Anyway, please start a separate thread if you're willing to put more
work into this.

> >  Anyway, I think if there is still interest in
> > this, that should be moved into a separate thread to keep this thread
> > focused on the problem described in [1].
> Agree. IMO, the problem of optimizer dependency on an order of indexes
> in the relation index list is more urgent for now.
> >
> > Finally, I'd like to note that the issue described in [1] is mostly the
> > selectivity estimation problem.  It could be solved by adding the
> > multi-column MCV statistics.  The patches published so far look more
> > like hacks for particular use cases rather than appropriate solutions.
> > It still looks promising to me to use the knowledge of unique
> > constraints during selectivity estimation [10].  Even though it's hard
> > to implement and possibly implies some overhead, it fits the current
> > model.  I also think unique contracts could probably be used in some way
> > to improve estimates even when there is no full match.
> I have tried to use the knowledge about unique indexes in the
> selectivity estimation routine. But it looks invasive and adds a lot of
> overhead.

I got it.  But it doesn't look enough to decide this is no way.  Could
you, please, share some of your results?  It might happen that we just
need to rework some of data structures to make this information more
easily accessible at selectivity estimation stage.

------
Regards,
Alexander Korotkov