Обсуждение: LIKE op with B-Tree Index?

Поиск
Список
Период
Сортировка

LIKE op with B-Tree Index?

От
"Sam Wong"
Дата:
Hi communities,

I am investigating a performance issue involved with LIKE 'xxxx%' on an
index in a complex query with joins.

The problem boils down into this simple scenario---:
====Scenario====
My database locale is C, using UTF-8 encoding. I tested this on 9.1.6 and 9.
2.1.

Q1.
SELECT * FROM shipments WHERE shipment_id LIKE '12345678%'
  
Q2.
SELECT * FROM shipments WHERE shipment_id >= '12345678' AND shipment_id <
'12345679'

shipments is a table with million rows and 20 columns. Shipment_id is the
primary key with text and non-null field.

CREATE TABLE cod.shipments
(
  shipment_id text NOT NULL,
  -- other columns omitted
  CONSTRAINT shipments_pkey PRIMARY KEY (shipment_id)
)

Analyze Q1 gives this:
Index Scan using shipments_pkey on shipments  (cost=0.00..39.84 rows=1450
width=294) (actual time=0.018..0.018 rows=1 loops=1)
  Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id <
'12345679'::text))
  Filter: (shipment_id ~~ '12345678%'::text)
  Buffers: shared hit=4

Analyze Q2 gives this:
Index Scan using shipments_pkey on shipments  (cost=0.00..39.83 rows=1
width=294) (actual time=0.027..0.027 rows=1 loops=1)
  Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id <
'12345679'::text))
  Buffers: shared hit=4

====Problem Description====
In Q1, the planner thought there will be 1450 rows, and Q2 gave a much
better estimate of 1.
The problem is when I combine such condition with a join to other table,
postgres will prefer a merge join (or hash) rather than a nested loop.

====Question====
Is Q1 and Q2 equivalent? From what I see and the result they seems to be the
same, or did I miss something? (Charset: C, Encoding: UTF-8)
If they are equivalent, is that a bug of the planner?

Many Thanks,
Sam



Re: LIKE op with B-Tree Index?

От
Merlin Moncure
Дата:
On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote:
> Hi communities,
>
> I am investigating a performance issue involved with LIKE 'xxxx%' on an
> index in a complex query with joins.
>
> The problem boils down into this simple scenario---:
> ====Scenario====
> My database locale is C, using UTF-8 encoding. I tested this on 9.1.6 and 9.
> 2.1.
>
> Q1.
> SELECT * FROM shipments WHERE shipment_id LIKE '12345678%'
>
> Q2.
> SELECT * FROM shipments WHERE shipment_id >= '12345678' AND shipment_id <
> '12345679'
>
> shipments is a table with million rows and 20 columns. Shipment_id is the
> primary key with text and non-null field.
>
> CREATE TABLE cod.shipments
> (
>   shipment_id text NOT NULL,
>   -- other columns omitted
>   CONSTRAINT shipments_pkey PRIMARY KEY (shipment_id)
> )
>
> Analyze Q1 gives this:
> Index Scan using shipments_pkey on shipments  (cost=0.00..39.84 rows=1450
> width=294) (actual time=0.018..0.018 rows=1 loops=1)
>   Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id <
> '12345679'::text))
>   Filter: (shipment_id ~~ '12345678%'::text)
>   Buffers: shared hit=4
>
> Analyze Q2 gives this:
> Index Scan using shipments_pkey on shipments  (cost=0.00..39.83 rows=1
> width=294) (actual time=0.027..0.027 rows=1 loops=1)
>   Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id <
> '12345679'::text))
>   Buffers: shared hit=4
>
> ====Problem Description====
> In Q1, the planner thought there will be 1450 rows, and Q2 gave a much
> better estimate of 1.
> The problem is when I combine such condition with a join to other table,
> postgres will prefer a merge join (or hash) rather than a nested loop.
>
> ====Question====
> Is Q1 and Q2 equivalent? From what I see and the result they seems to be the
> same, or did I miss something? (Charset: C, Encoding: UTF-8)
> If they are equivalent, is that a bug of the planner?

They are most certainly not equivalent.  What if the shipping_id is 12345678Z?

merlin


Re: LIKE op with B-Tree Index?

От
"Sam Wong"
Дата:
> On Wednesday, October 17, 2012 4:30, Merlin Moncure wrote,
>
> On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote:
> > Hi communities,
> >
> > I am investigating a performance issue involved with LIKE 'xxxx%' on
> > an index in a complex query with joins.
> >
> > The problem boils down into this simple scenario---:
> > ====Scenario====
> > My database locale is C, using UTF-8 encoding. I tested this on 9.1.6
and 9.
> > 2.1.
> >
> > Q1.
> > SELECT * FROM shipments WHERE shipment_id LIKE '12345678%'
> >
> > Q2.
> > SELECT * FROM shipments WHERE shipment_id >= '12345678' AND
> > shipment_id < '12345679'
> >
> > ...snip...
> >
> > ====Question====
> > Is Q1 and Q2 equivalent? From what I see and the result they seems to
> > be the same, or did I miss something? (Charset: C, Encoding: UTF-8) If
> > they are equivalent, is that a bug of the planner?
>
> They are most certainly not equivalent.  What if the shipping_id is
> 12345678Z?
>
> merlin
>
But '12345678Z' is indeed >= '12345678' AND < '12345679'. Just like 'apple'
< 'apples' < 'apply' in a dictionary.

A quick test:
vitalink=# select * from ss;
    id
-----------
 12345678
 12345678Z
 12345679
(3 rows)

vitalink=# select * from ss  WHERE id >= '12345678' AND id < '12345679';
    id
-----------
 12345678
 12345678Z
(2 rows)

Sam



Re: LIKE op with B-Tree Index?

От
Merlin Moncure
Дата:
On Tue, Oct 16, 2012 at 8:01 PM, Sam Wong <sam@hellosam.net> wrote:
>> On Wednesday, October 17, 2012 4:30, Merlin Moncure wrote,
>>
>> On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote:
>> > Hi communities,
>> >
>> > I am investigating a performance issue involved with LIKE 'xxxx%' on
>> > an index in a complex query with joins.
>> >
>> > The problem boils down into this simple scenario---:
>> > ====Scenario====
>> > My database locale is C, using UTF-8 encoding. I tested this on 9.1.6
> and 9.
>> > 2.1.
>> >
>> > Q1.
>> > SELECT * FROM shipments WHERE shipment_id LIKE '12345678%'
>> >
>> > Q2.
>> > SELECT * FROM shipments WHERE shipment_id >= '12345678' AND
>> > shipment_id < '12345679'
>> >
>> > ...snip...
>> >
>> > ====Question====
>> > Is Q1 and Q2 equivalent? From what I see and the result they seems to
>> > be the same, or did I miss something? (Charset: C, Encoding: UTF-8) If
>> > they are equivalent, is that a bug of the planner?
>>
>> They are most certainly not equivalent.  What if the shipping_id is
>> 12345678Z?
>>
>> merlin
>>
> But '12345678Z' is indeed >= '12345678' AND < '12345679'. Just like 'apple'
> < 'apples' < 'apply' in a dictionary.

Right -- I didn't visualize it properly.  Still, you're asking the
server to infer that since you're looking between to adjacent textual
characters range bounded [) it convert the 'between' to a partial
string search.   That hold up logically but probably isn't worth
spending cycles to do, particularly in cases of non-ascii mappable
unicode characters.

merlin


Re: LIKE op with B-Tree Index?

От
"Sam Wong"
Дата:
> Moncure wrote on Thursday, October 18, 2012 1:45
> On Tue, Oct 16, 2012 at 8:01 PM, Sam Wong <sam@hellosam.net> wrote:
> >> On Wednesday, October 17, 2012 4:30, Merlin Moncure wrote,
> >>
> >> On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote:
> >> > Hi communities,
> >> >
> >> > I am investigating a performance issue involved with LIKE 'xxxx%'
> >> > on an index in a complex query with joins.
> >> >
> >> > The problem boils down into this simple scenario---:
> >> > ====Scenario====
> >> > My database locale is C, using UTF-8 encoding. I tested this on
> >> > 9.1.6
> > and 9.
> >> > 2.1.
> >> >
> >> > Q1.
> >> > SELECT * FROM shipments WHERE shipment_id LIKE '12345678%'
> >> >
> >> > Q2.
> >> > SELECT * FROM shipments WHERE shipment_id >= '12345678' AND
> >> > shipment_id < '12345679'
> >> >
> >> > ...snip...
> >> >
> >> > ====Question====
> >> > Is Q1 and Q2 equivalent? From what I see and the result they seems
> >> > to be the same, or did I miss something? (Charset: C, Encoding:
> >> > UTF-8) If they are equivalent, is that a bug of the planner?
> >>
> >> They are most certainly not equivalent.  What if the shipping_id is
> >> 12345678Z?
> >>
> >> merlin
> >>
> > But '12345678Z' is indeed >= '12345678' AND < '12345679'. Just like
'apple'
> > < 'apples' < 'apply' in a dictionary.
>
> Right -- I didn't visualize it properly.  Still, you're asking the server
to infer that
> since you're looking between to adjacent textual characters range bounded
[) it
> convert the 'between' to a partial
> string search.   That hold up logically but probably isn't worth
> spending cycles to do, particularly in cases of non-ascii mappable unicode
> characters.
> merlin

Postgresql did that already. Refer to the analyze result of Q1 and Q2, it
gives
"Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id <
'12345679'::text))"
(I also just realized they did it just now)

Yet, with additional Filter (ref Q1 analyze), it's surprisingly that it
estimates Q1 will have more rows that Q2.

FYI, I made a self-contained test case and submitted a bug #7610.



Re: LIKE op with B-Tree Index?

От
"Albe Laurenz"
Дата:
Sam Wong wrote:
>>>>> I am investigating a performance issue involved with LIKE 'xxxx%'
>>>>> on an index in a complex query with joins.

>>>>> Q1.
>>>>> SELECT * FROM shipments WHERE shipment_id LIKE '12345678%'
>>>>>
>>>>> Q2.
>>>>> SELECT * FROM shipments WHERE shipment_id >= '12345678' AND
>>>>> shipment_id < '12345679'

[Q1 and Q2 have different row estimates]

Merlin wrote:
>> Right -- I didn't visualize it properly.  Still, you're asking
>> the server to infer that
>> since you're looking between to adjacent textual characters range
bounded
>> [) it convert the 'between' to a partial
>> string search.   That hold up logically but probably isn't worth
>> spending cycles to do, particularly in cases of non-ascii mappable
unicode
>> characters.

> Postgresql did that already. Refer to the analyze result of Q1 and Q2,
it
> gives
> "Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id <
> '12345679'::text))"
> (I also just realized they did it just now)
>
> Yet, with additional Filter (ref Q1 analyze), it's surprisingly that
it
> estimates Q1 will have more rows that Q2.
>
> FYI, I made a self-contained test case and submitted a bug #7610.

Did you try to increase the statistics for column "shipment_id"?

This will probably not make the difference go away, but
if the estimate gets better, it might be good enough for
the planner to pick the correct plan.

Yours,
Laurenz Albe