Обсуждение: surprising query optimisation

Поиск
Список
Период
Сортировка

surprising query optimisation

От
Chris Withers
Дата:
Hi All,

We have an app that deals with a lot of queries, and we've been slowly 
seeing performance issues emerge. We take a lot of free form queries 
from users and stumbled upon a very surprising optimisation.

So, we have a 'state' column which is a 3 character string column with 
an index on it. Despite being a string, this column is only used to 
store one of three values: 'NEW', 'ACK', or 'RSV'.

One of our most common queries clauses is "state!='RSV'" and we've found 
that by substituting this clause with "state='ACK' or state='NEW'" 
wherever it was used, we've dropped the postgres server's load average 
from 20 down to 4 and the CPU usage from 60% in user space down to <5%.

This seems counter-intuitive to me, so thought I'd ask here. Why would 
this be likely to make such a difference? We're currently on 9.4, is 
this something that's likely to be different (better? worse?) if we got 
all the way up to 10 or 11?

cheers,

Chris



Re: surprising query optimisation

От
Adrian Klaver
Дата:
On 11/28/18 2:26 PM, Chris Withers wrote:
> Hi All,
> 
> We have an app that deals with a lot of queries, and we've been slowly 
> seeing performance issues emerge. We take a lot of free form queries 
> from users and stumbled upon a very surprising optimisation.
> 
> So, we have a 'state' column which is a 3 character string column with 
> an index on it. Despite being a string, this column is only used to 
> store one of three values: 'NEW', 'ACK', or 'RSV'.
> 
> One of our most common queries clauses is "state!='RSV'" and we've found 
> that by substituting this clause with "state='ACK' or state='NEW'" 
> wherever it was used, we've dropped the postgres server's load average 
> from 20 down to 4 and the CPU usage from 60% in user space down to <5%.
> 
> This seems counter-intuitive to me, so thought I'd ask here. Why would 

The way I see it is state = "something" is a confined question. state != 
'something' is potentially unbounded.

Does EXPLAIN ANALYZE shed any light?

> this be likely to make such a difference? We're currently on 9.4, is 
> this something that's likely to be different (better? worse?) if we got 
> all the way up to 10 or 11?
> 
> cheers,
> 
> Chris
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: surprising query optimisation

От
Gavin Flower
Дата:
On 29/11/2018 11:26, Chris Withers wrote:
> Hi All,
>
> We have an app that deals with a lot of queries, and we've been slowly 
> seeing performance issues emerge. We take a lot of free form queries 
> from users and stumbled upon a very surprising optimisation.
>
> So, we have a 'state' column which is a 3 character string column with 
> an index on it. Despite being a string, this column is only used to 
> store one of three values: 'NEW', 'ACK', or 'RSV'.
>
> One of our most common queries clauses is "state!='RSV'" and we've 
> found that by substituting this clause with "state='ACK' or 
> state='NEW'" wherever it was used, we've dropped the postgres server's 
> load average from 20 down to 4 and the CPU usage from 60% in user 
> space down to <5%.
>
> This seems counter-intuitive to me, so thought I'd ask here. Why would 
> this be likely to make such a difference? We're currently on 9.4, is 
> this something that's likely to be different (better? worse?) if we 
> got all the way up to 10 or 11?
>
> cheers,
>
> Chris
>
>
At a guess...

     "state!='RSV'"  ==> pg only has to check one value

and

     "state='ACK' or state='NEW'"   ==> pg has to check two values

so I would expect the '!=' to be faster.


Cheers,
Gavin



Re: surprising query optimisation

От
Stephen Frost
Дата:
Greetings,

* Chris Withers (chris@withers.org) wrote:
> We have an app that deals with a lot of queries, and we've been slowly
> seeing performance issues emerge. We take a lot of free form queries from
> users and stumbled upon a very surprising optimisation.
>
> So, we have a 'state' column which is a 3 character string column with an
> index on it. Despite being a string, this column is only used to store one
> of three values: 'NEW', 'ACK', or 'RSV'.

Sounds like a horrible field to have an index on.

> One of our most common queries clauses is "state!='RSV'" and we've found
> that by substituting this clause with "state='ACK' or state='NEW'" wherever
> it was used, we've dropped the postgres server's load average from 20 down
> to 4 and the CPU usage from 60% in user space down to <5%.

You've changed the question you're asking the database.  PG doesn't
*know* that there's only those three values, but it probably has a
pretty good guess about how many records are ACK and how many are NEW
thanks to those being in the MCV list.

> This seems counter-intuitive to me, so thought I'd ask here. Why would this
> be likely to make such a difference? We're currently on 9.4, is this
> something that's likely to be different (better? worse?) if we got all the
> way up to 10 or 11?

When you change what you're asking, PG is going to change how it gives
you the answer and sometimes that'll be faster and other times it won't
be.

Really though, if you want something more than wild speculation, posting
the 'explain analyze' of each query along with the actual table
definitions and sizes and such would be the best way to get it.

I'd suggest you check out the wiki article written about this kind of
question:

https://wiki.postgresql.org/wiki/Slow_Query_Questions

Thanks!

Stephen

Вложения

Re: surprising query optimisation

От
Chris Withers
Дата:
On 28/11/2018 22:49, Stephen Frost wrote:
> * Chris Withers (chris@withers.org) wrote:
>> We have an app that deals with a lot of queries, and we've been slowly
>> seeing performance issues emerge. We take a lot of free form queries from
>> users and stumbled upon a very surprising optimisation.
>>
>> So, we have a 'state' column which is a 3 character string column with an
>> index on it. Despite being a string, this column is only used to store one
>> of three values: 'NEW', 'ACK', or 'RSV'.
> 
> Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what 
would you do/recommend instead?

> Really though, if you want something more than wild speculation, posting
> the 'explain analyze' of each query along with the actual table
> definitions and sizes and such would be the best way to get it.

table definition:

# \d alerts_alert
               Table "public.alerts_alert"
      Column      |           Type           | Modifiers
-----------------+--------------------------+-----------
  tags            | jsonb                    | not null
  id              | character varying(86)    | not null
  earliest_seen   | timestamp with time zone | not null
  latest_seen     | timestamp with time zone | not null
  created         | timestamp with time zone | not null
  modified        | timestamp with time zone | not null
  type            | character varying(300)   | not null
  state           | character varying(3)     | not null
  until           | timestamp with time zone |
  latest_note     | text                     | not null
  created_by_id   | integer                  | not null
  modified_by_id  | integer                  | not null
  owner_id        | integer                  |
  owning_group_id | integer                  | not null
  latest_new      | timestamp with time zone | not null
Indexes:
     "alerts_alert_pkey" PRIMARY KEY, btree (id)
     "alert_tags_index" gin (tags)
     "alerts_alert_1efacf1d" btree (latest_seen)
     "alerts_alert_3103a7d8" btree (until)
     "alerts_alert_599dcce2" btree (type)
     "alerts_alert_5e7b1936" btree (owner_id)
     "alerts_alert_9ae73c65" btree (modified)
     "alerts_alert_9ed39e2e" btree (state)
     "alerts_alert_b3da0983" btree (modified_by_id)
     "alerts_alert_c5151f5a" btree (earliest_seen)
     "alerts_alert_e2fa5388" btree (created)
     "alerts_alert_e93cb7eb" btree (created_by_id)
     "alerts_alert_efea2d76" btree (owning_group_id)
     "alerts_alert_id_13155e16_like" btree (id varchar_pattern_ops)
     "alerts_alert_latest_new_e8d1fbde_uniq" btree (latest_new)
     "alerts_alert_state_90ab480b_like" btree (state varchar_pattern_ops)
     "alerts_alert_type_3021f46f_like" btree (type varchar_pattern_ops)
Foreign-key constraints:
     "alerts_alert_created_by_id_520608c0_fk_alerts_user_id" FOREIGN KEY 
(created_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_modified_by_id_6db4b04b_fk_alerts_user_id" FOREIGN 
KEY (modified_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY 
DEFERRED
     "alerts_alert_owner_id_0c00548a_fk_alerts_user_id" FOREIGN KEY 
(owner_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_owning_group_id_a4869b66_fk_auth_group_id" FOREIGN 
KEY (owning_group_id) REFERENCES auth_group(id) DEFERRABLE INITIALLY 
DEFERRED
Referenced by:
     TABLE "alerts_alertevent" CONSTRAINT 
"alerts_alertevent_alert_id_edd734b8_fk_alerts_alert_id" FOREIGN KEY 
(alert_id) REFERENCES alerts_alert(id) DEFERRABLE INITIALLY DEFERRED

Row counts by state:

# select state, count(*) from alerts_alert group by 1 order by 1;
  state |  count
-------+---------
  ACK   |    1053
  NEW   |    1958
  RSV   | 1528623
(3 rows)

here's an example of the "bad" query plan:
https://explain.depesz.com/s/cDkp

here's an example with all the "state!='RSV'" clauses rewritten as I 
described:
https://explain.depesz.com/s/B9Xi

> I'd suggest you check out the wiki article written about this kind of
> question:
> 
> https://wiki.postgresql.org/wiki/Slow_Query_Questions

Thanks!

Chris


Re: surprising query optimisation

От
Stephen Frost
Дата:
Greetings,

On Fri, Nov 30, 2018 at 07:52 Chris Withers <chris@withers.org> wrote:
On 28/11/2018 22:49, Stephen Frost wrote:
> * Chris Withers (chris@withers.org) wrote:
>> We have an app that deals with a lot of queries, and we've been slowly
>> seeing performance issues emerge. We take a lot of free form queries from
>> users and stumbled upon a very surprising optimisation.
>>
>> So, we have a 'state' column which is a 3 character string column with an
>> index on it. Despite being a string, this column is only used to store one
>> of three values: 'NEW', 'ACK', or 'RSV'.
>
> Sounds like a horrible field to have an index on.

That's counter-intuitive for me. What leads you to say this and what
would you do/recommend instead?

> Really though, if you want something more than wild speculation, posting
> the 'explain analyze' of each query along with the actual table
> definitions and sizes and such would be the best way to get it.

table definition:

# \d alerts_alert
               Table "public.alerts_alert"
      Column      |           Type           | Modifiers
-----------------+--------------------------+-----------
  tags            | jsonb                    | not null
  id              | character varying(86)    | not null
  earliest_seen   | timestamp with time zone | not null
  latest_seen     | timestamp with time zone | not null
  created         | timestamp with time zone | not null
  modified        | timestamp with time zone | not null
  type            | character varying(300)   | not null
  state           | character varying(3)     | not null
  until           | timestamp with time zone |
  latest_note     | text                     | not null
  created_by_id   | integer                  | not null
  modified_by_id  | integer                  | not null
  owner_id        | integer                  |
  owning_group_id | integer                  | not null
  latest_new      | timestamp with time zone | not null
Indexes:
     "alerts_alert_pkey" PRIMARY KEY, btree (id)
     "alert_tags_index" gin (tags)
     "alerts_alert_1efacf1d" btree (latest_seen)
     "alerts_alert_3103a7d8" btree (until)
     "alerts_alert_599dcce2" btree (type)
     "alerts_alert_5e7b1936" btree (owner_id)
     "alerts_alert_9ae73c65" btree (modified)
     "alerts_alert_9ed39e2e" btree (state)
     "alerts_alert_b3da0983" btree (modified_by_id)
     "alerts_alert_c5151f5a" btree (earliest_seen)
     "alerts_alert_e2fa5388" btree (created)
     "alerts_alert_e93cb7eb" btree (created_by_id)
     "alerts_alert_efea2d76" btree (owning_group_id)
     "alerts_alert_id_13155e16_like" btree (id varchar_pattern_ops)
     "alerts_alert_latest_new_e8d1fbde_uniq" btree (latest_new)
     "alerts_alert_state_90ab480b_like" btree (state varchar_pattern_ops)
     "alerts_alert_type_3021f46f_like" btree (type varchar_pattern_ops)
Foreign-key constraints:
     "alerts_alert_created_by_id_520608c0_fk_alerts_user_id" FOREIGN KEY
(created_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_modified_by_id_6db4b04b_fk_alerts_user_id" FOREIGN
KEY (modified_by_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY
DEFERRED
     "alerts_alert_owner_id_0c00548a_fk_alerts_user_id" FOREIGN KEY
(owner_id) REFERENCES alerts_user(id) DEFERRABLE INITIALLY DEFERRED
     "alerts_alert_owning_group_id_a4869b66_fk_auth_group_id" FOREIGN
KEY (owning_group_id) REFERENCES auth_group(id) DEFERRABLE INITIALLY
DEFERRED
Referenced by:
     TABLE "alerts_alertevent" CONSTRAINT
"alerts_alertevent_alert_id_edd734b8_fk_alerts_alert_id" FOREIGN KEY
(alert_id) REFERENCES alerts_alert(id) DEFERRABLE INITIALLY DEFERRED

Row counts by state:

# select state, count(*) from alerts_alert group by 1 order by 1;
  state |  count
-------+---------
  ACK   |    1053
  NEW   |    1958
  RSV   | 1528623
(3 rows)

here's an example of the "bad" query plan:
https://explain.depesz.com/s/cDkp

here's an example with all the "state!='RSV'" clauses rewritten as I
described:
https://explain.depesz.com/s/B9Xi

> I'd suggest you check out the wiki article written about this kind of
> question:
>
> https://wiki.postgresql.org/wiki/Slow_Query_Questions

Have you tried a partial index on state!=‘RSV’?

Thanks,

Stephen

Re: surprising query optimisation

От
Chris Withers
Дата:
On 30/11/2018 12:55, Stephen Frost wrote:
>      > I'd suggest you check out the wiki article written about this kind of
>      > question:
>      >
>      > https://wiki.postgresql.org/wiki/Slow_Query_Questions
> 
> 
> Have you tried a partial index on state!=‘RSV’?

The solution I originally posted, that we do easily enough at our query 
generation layer, is working perfectly, but this is good to know for 
next time.

My post here is mainly to try and understand what's going on so I can 
improve my general feel for how to use postgres at it's best.

So, why was the query ending up being a big scan rather than some quick 
lookups based on the index?

cheers,

Chris


Re: surprising query optimisation

От
Stephen Frost
Дата:
Greetings,

On Fri, Nov 30, 2018 at 08:00 Chris Withers <chris@withers.org> wrote:
On 30/11/2018 12:55, Stephen Frost wrote:
>      > I'd suggest you check out the wiki article written about this kind of
>      > question:
>      >
>      > https://wiki.postgresql.org/wiki/Slow_Query_Questions
>
>
> Have you tried a partial index on state!=‘RSV’?

The solution I originally posted, that we do easily enough at our query
generation layer, is working perfectly, but this is good to know for
next time.

My post here is mainly to try and understand what's going on so I can
improve my general feel for how to use postgres at it's best.

So, why was the query ending up being a big scan rather than some quick
lookups based on the index?

Thought that was mentioned already but at least part of the issue is that PG can’t just search for the other values when it’s a != in the index because it wouldn’t know what values to search for...  PG doesn’t know, with complete certainty, that there’s only 3 values.

The partial index is something you should want anyway- that index won’t ever be used to search for RSV because it’s cheaper to just scan the table if we are looking for those, and the index will be much, much smaller without that common value being included.

Thanks!

Stephen

Re: surprising query optimisation

От
Stephen Frost
Дата:
Greetings,

* Chris Withers (chris@withers.org) wrote:
> On 28/11/2018 22:49, Stephen Frost wrote:
> >* Chris Withers (chris@withers.org) wrote:
> >>We have an app that deals with a lot of queries, and we've been slowly
> >>seeing performance issues emerge. We take a lot of free form queries from
> >>users and stumbled upon a very surprising optimisation.
> >>
> >>So, we have a 'state' column which is a 3 character string column with an
> >>index on it. Despite being a string, this column is only used to store one
> >>of three values: 'NEW', 'ACK', or 'RSV'.
> >
> >Sounds like a horrible field to have an index on.
>
> That's counter-intuitive for me. What leads you to say this and what would
> you do/recommend instead?

For this, specifically, it's because you end up with exactly what you
have: a large index with tons of duplicate values.  Indexes are
particularly good when you have high-cardinality fields.  Now, if you
have a skewed index, where there's one popular value and a few much less
popular ones, then that's where you really want a partial index (as I
suggest earlier) so that queries against the non-popular value(s) is
able to use the index and the index is much smaller.

Of course, for this to work you need to set up the partial index
correctly and make sure that your queries end up using it.

Thanks!

Stephen

Вложения

Re: surprising query optimisation

От
Gavin Flower
Дата:
On 01/12/2018 04:33, Stephen Frost wrote:
> Greetings,
>
> * Chris Withers (chris@withers.org) wrote:
>> On 28/11/2018 22:49, Stephen Frost wrote:
>>> * Chris Withers (chris@withers.org) wrote:
>>>> We have an app that deals with a lot of queries, and we've been slowly
>>>> seeing performance issues emerge. We take a lot of free form queries from
>>>> users and stumbled upon a very surprising optimisation.
>>>>
>>>> So, we have a 'state' column which is a 3 character string column with an
>>>> index on it. Despite being a string, this column is only used to store one
>>>> of three values: 'NEW', 'ACK', or 'RSV'.
>>> Sounds like a horrible field to have an index on.
>> That's counter-intuitive for me. What leads you to say this and what would
>> you do/recommend instead?
> For this, specifically, it's because you end up with exactly what you
> have: a large index with tons of duplicate values.  Indexes are
> particularly good when you have high-cardinality fields.  Now, if you
> have a skewed index, where there's one popular value and a few much less
> popular ones, then that's where you really want a partial index (as I
> suggest earlier) so that queries against the non-popular value(s) is
> able to use the index and the index is much smaller.
>
> Of course, for this to work you need to set up the partial index
> correctly and make sure that your queries end up using it.
>
> Thanks!
>
> Stephen

An index simply tells pg which block to look at (assuming that the index 
itself is not sufficient to satisfy the query), so if using an index 
would still require that pg look at most blocks, then it cheaper to just 
scan the whole table - rather than load the index and still look at all 
blocks that contain the table data.  I've oversimplified slightly.

An index is best used when using it results in fewer blocks being read 
from disk.

Also the use of RAM is better when the size of the index is kept small.  
For example having an index on sex for nurses is a waste of time because 
most nurses are female.  However, a partial index (as suggested by 
Stephen, for your query) that includes only males could be useful if you 
have queries looking for male nurses (assumes male nurses are a very 
small fraction of nurses such that most data blocks don't have rows for 
males nurses, and the planner knows this).

I once optimised a very complex set queries that made extensive use of 
indexes.  However, with the knowledge I have today, I would have most 
likely had fewer and smaller indexes.  As I now realize, that some of my 
indexes were probably counter productive, especially as I'd given no 
thought to how much RAM would be required to host the data plus 
indexes!  Fortunately, while the objective was to run all those queries 
within 24 hours, they actually only took a couple of hours.

Chris, I would strongly suggest, you read up on the excellent 
documentation pg has about indexes, but don't expect to understand it 
all at one sitting...


Cheers,
Gavin



Re: surprising query optimisation

От
Chris Withers
Дата:
On 30/11/2018 15:33, Stephen Frost wrote:
> Greetings,
> 
> * Chris Withers (chris@withers.org) wrote:
>> On 28/11/2018 22:49, Stephen Frost wrote:
>>> * Chris Withers (chris@withers.org) wrote:
>>>> We have an app that deals with a lot of queries, and we've been slowly
>>>> seeing performance issues emerge. We take a lot of free form queries from
>>>> users and stumbled upon a very surprising optimisation.
>>>>
>>>> So, we have a 'state' column which is a 3 character string column with an
>>>> index on it. Despite being a string, this column is only used to store one
>>>> of three values: 'NEW', 'ACK', or 'RSV'.
>>>
>>> Sounds like a horrible field to have an index on.
>>
>> That's counter-intuitive for me. What leads you to say this and what would
>> you do/recommend instead?
> 
> For this, specifically, it's because you end up with exactly what you
> have: a large index with tons of duplicate values.  Indexes are
> particularly good when you have high-cardinality fields.  Now, if you
> have a skewed index, where there's one popular value and a few much less
> popular ones, then that's where you really want a partial index (as I
> suggest earlier) so that queries against the non-popular value(s) is
> able to use the index and the index is much smaller.

Interesting! In my head, for some reason, I'd always assumed a btree 
index would break down a char field based on the characters within it. 
Does that never happen?

If I changed this to be an enum field, would != still perform poorly or 
can the query optimiser spot that it's an enum and just look for the 
other options?

cheers,

Chris


Re: surprising query optimisation

От
Chris Withers
Дата:
On 30/11/2018 22:10, Gavin Flower wrote:
> 
> I once optimised a very complex set queries that made extensive use of 
> indexes.  However, with the knowledge I have today, I would have most 
> likely had fewer and smaller indexes.  As I now realize, that some of my 
> indexes were probably counter productive, especially as I'd given no 
> thought to how much RAM would be required to host the data plus 
> indexes!  Fortunately, while the objective was to run all those queries 
> within 24 hours, they actually only took a couple of hours.

So, interestingly, this box has 250GB memory in it, and even though I've 
set effective_cache_size to 200GB, I only see 9G of memory being used. 
How can I persuade postgres to keep more in memory?

cheers,

Chris


Re: surprising query optimisation

От
Thomas Kellerer
Дата:
Stephen Frost schrieb am 30.11.2018 um 14:05:
> PG doesn’t know, with complete certainty, that there’s only 3
> values.
 
Would the optimizer consult a check constraint ensuring that?



Re: surprising query optimisation

От
Thomas Kellerer
Дата:
Chris Withers schrieb am 05.12.2018 um 12:42:
> So, interestingly, this box has 250GB memory in it, and even though
> I've set effective_cache_size to 200GB, I only see 9G of memory being
> used. How can I persuade postgres to keep more in memory?
effective_cache_size is a hint to the optimizer on how much data is to be expected to be cached (by the filesystem).

Only shared_buffers (and work_mem) will actually allocate memory from the operating system.

Thomas


Re: surprising query optimisation

От
Stephen Frost
Дата:
Greetings,

* Thomas Kellerer (spam_eater@gmx.net) wrote:
> Stephen Frost schrieb am 30.11.2018 um 14:05:
> > PG doesn’t know, with complete certainty, that there’s only 3
> > values.
>
> Would the optimizer consult a check constraint ensuring that?

Not today, I don't believe (haven't looked at the code though, to be
fair), and seems like it'd be an awful lot of work for a rare
use-case that would be better with just a partial index..

What we will do today is if you ask for a specific value and there's a
CHECK constraint which lists out specific values and that value isn't
among the set, then we'll just skip over the table (One-time filter:
false), thanks to constraint exclusion.

Thanks!

Stephen

Вложения

Re: surprising query optimisation

От
Stephen Frost
Дата:
Greetings,

* Chris Withers (chris@withers.org) wrote:
> On 30/11/2018 15:33, Stephen Frost wrote:
> >* Chris Withers (chris@withers.org) wrote:
> >>On 28/11/2018 22:49, Stephen Frost wrote:
> >For this, specifically, it's because you end up with exactly what you
> >have: a large index with tons of duplicate values.  Indexes are
> >particularly good when you have high-cardinality fields.  Now, if you
> >have a skewed index, where there's one popular value and a few much less
> >popular ones, then that's where you really want a partial index (as I
> >suggest earlier) so that queries against the non-popular value(s) is
> >able to use the index and the index is much smaller.
>
> Interesting! In my head, for some reason, I'd always assumed a btree index
> would break down a char field based on the characters within it. Does that
> never happen?

Not sure what you mean by 'break down a char field'.

> If I changed this to be an enum field, would != still perform poorly or can
> the query optimiser spot that it's an enum and just look for the other
> options?

I don't believe we've got any kind of optimization like that today for
enums.

Thanks!

Stephen

Вложения

Re: surprising query optimisation

От
Chris Withers
Дата:
On 05/12/2018 14:38, Stephen Frost wrote:
> Greetings,
> 
> * Chris Withers (chris@withers.org) wrote:
>> On 30/11/2018 15:33, Stephen Frost wrote:
>>> * Chris Withers (chris@withers.org) wrote:
>>>> On 28/11/2018 22:49, Stephen Frost wrote:
>>> For this, specifically, it's because you end up with exactly what you
>>> have: a large index with tons of duplicate values.  Indexes are
>>> particularly good when you have high-cardinality fields.  Now, if you
>>> have a skewed index, where there's one popular value and a few much less
>>> popular ones, then that's where you really want a partial index (as I
>>> suggest earlier) so that queries against the non-popular value(s) is
>>> able to use the index and the index is much smaller.
>>
>> Interesting! In my head, for some reason, I'd always assumed a btree index
>> would break down a char field based on the characters within it. Does that
>> never happen?
> 
> Not sure what you mean by 'break down a char field'.

Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more 
complicated hierarchy ('N', 'NE', 'A', 'AC', etc).

>> If I changed this to be an enum field, would != still perform poorly or can
>> the query optimiser spot that it's an enum and just look for the other
>> options?
> 
> I don't believe we've got any kind of optimization like that today for
> enums.

Good to know, I see query optimisers as magic, and postgres often seems 
to achieve magic results ;-)

Chris


Re: surprising query optimisation

От
Ron
Дата:
On 12/05/2018 08:42 AM, Chris Withers wrote:
> On 05/12/2018 14:38, Stephen Frost wrote:
>> Greetings,
>>
>> * Chris Withers (chris@withers.org) wrote:
>>> On 30/11/2018 15:33, Stephen Frost wrote:
>>>> * Chris Withers (chris@withers.org) wrote:
>>>>> On 28/11/2018 22:49, Stephen Frost wrote:
>>>> For this, specifically, it's because you end up with exactly what you
>>>> have: a large index with tons of duplicate values.  Indexes are
>>>> particularly good when you have high-cardinality fields. Now, if you
>>>> have a skewed index, where there's one popular value and a few much less
>>>> popular ones, then that's where you really want a partial index (as I
>>>> suggest earlier) so that queries against the non-popular value(s) is
>>>> able to use the index and the index is much smaller.
>>>
>>> Interesting! In my head, for some reason, I'd always assumed a btree index
>>> would break down a char field based on the characters within it. Does that
>>> never happen?
>>
>> Not sure what you mean by 'break down a char field'.
>
> Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more 
> complicated hierarchy ('N', 'NE', 'A', 'AC', etc).

The b-tree indexes on legacy RDBMS which I still occasionally fiddle with 
performs key prefix compression in a manner similar to what you refer to, 
but otherwise that's not how b-trees work.

-- 
Angular momentum makes the world go 'round.


Re: surprising query optimisation

От
Stephen Frost
Дата:
Greetings,

* Ron (ronljohnsonjr@gmail.com) wrote:
> On 12/05/2018 08:42 AM, Chris Withers wrote:
> >On 05/12/2018 14:38, Stephen Frost wrote:
> >>>>* Chris Withers (chris@withers.org) wrote:
> >>>Interesting! In my head, for some reason, I'd always assumed a btree index
> >>>would break down a char field based on the characters within it. Does that
> >>>never happen?
> >>
> >>Not sure what you mean by 'break down a char field'.
> >
> >Rather than breaking into three buckets ('NEW', 'ACK', 'RSV'), a more
> >complicated hierarchy ('N', 'NE', 'A', 'AC', etc).
>
> The b-tree indexes on legacy RDBMS which I still occasionally fiddle with
> performs key prefix compression in a manner similar to what you refer to,
> but otherwise that's not how b-trees work.

There's been some discussion of prefix compression in PostgreSQL.  Even
with that, though, it hardly seems sensible to have an index which has
tons of duplicates comprising most of the index, and a != would still
have to search the index to make sure there aren't any entries which
need to be returned..

Now, maybe once we get skipping scans where we would be able to skip
over a large chunk of the index because it's just tons of duplicates
without having to visit everything along the way, then maybe having this
inefficient index would "just" take up disk space, but why waste that
space?

Thanks!

Stephen

Вложения