Обсуждение: LIKE op with B-Tree Index?
Hi communities, I am investigating a performance issue involved with LIKE 'xxxx%' on an index in a complex query with joins. The problem boils down into this simple scenario---: ====Scenario==== My database locale is C, using UTF-8 encoding. I tested this on 9.1.6 and 9. 2.1. Q1. SELECT * FROM shipments WHERE shipment_id LIKE '12345678%' Q2. SELECT * FROM shipments WHERE shipment_id >= '12345678' AND shipment_id < '12345679' shipments is a table with million rows and 20 columns. Shipment_id is the primary key with text and non-null field. CREATE TABLE cod.shipments ( shipment_id text NOT NULL, -- other columns omitted CONSTRAINT shipments_pkey PRIMARY KEY (shipment_id) ) Analyze Q1 gives this: Index Scan using shipments_pkey on shipments (cost=0.00..39.84 rows=1450 width=294) (actual time=0.018..0.018 rows=1 loops=1) Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id < '12345679'::text)) Filter: (shipment_id ~~ '12345678%'::text) Buffers: shared hit=4 Analyze Q2 gives this: Index Scan using shipments_pkey on shipments (cost=0.00..39.83 rows=1 width=294) (actual time=0.027..0.027 rows=1 loops=1) Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id < '12345679'::text)) Buffers: shared hit=4 ====Problem Description==== In Q1, the planner thought there will be 1450 rows, and Q2 gave a much better estimate of 1. The problem is when I combine such condition with a join to other table, postgres will prefer a merge join (or hash) rather than a nested loop. ====Question==== Is Q1 and Q2 equivalent? From what I see and the result they seems to be the same, or did I miss something? (Charset: C, Encoding: UTF-8) If they are equivalent, is that a bug of the planner? Many Thanks, Sam
On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote: > Hi communities, > > I am investigating a performance issue involved with LIKE 'xxxx%' on an > index in a complex query with joins. > > The problem boils down into this simple scenario---: > ====Scenario==== > My database locale is C, using UTF-8 encoding. I tested this on 9.1.6 and 9. > 2.1. > > Q1. > SELECT * FROM shipments WHERE shipment_id LIKE '12345678%' > > Q2. > SELECT * FROM shipments WHERE shipment_id >= '12345678' AND shipment_id < > '12345679' > > shipments is a table with million rows and 20 columns. Shipment_id is the > primary key with text and non-null field. > > CREATE TABLE cod.shipments > ( > shipment_id text NOT NULL, > -- other columns omitted > CONSTRAINT shipments_pkey PRIMARY KEY (shipment_id) > ) > > Analyze Q1 gives this: > Index Scan using shipments_pkey on shipments (cost=0.00..39.84 rows=1450 > width=294) (actual time=0.018..0.018 rows=1 loops=1) > Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id < > '12345679'::text)) > Filter: (shipment_id ~~ '12345678%'::text) > Buffers: shared hit=4 > > Analyze Q2 gives this: > Index Scan using shipments_pkey on shipments (cost=0.00..39.83 rows=1 > width=294) (actual time=0.027..0.027 rows=1 loops=1) > Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id < > '12345679'::text)) > Buffers: shared hit=4 > > ====Problem Description==== > In Q1, the planner thought there will be 1450 rows, and Q2 gave a much > better estimate of 1. > The problem is when I combine such condition with a join to other table, > postgres will prefer a merge join (or hash) rather than a nested loop. > > ====Question==== > Is Q1 and Q2 equivalent? From what I see and the result they seems to be the > same, or did I miss something? (Charset: C, Encoding: UTF-8) > If they are equivalent, is that a bug of the planner? They are most certainly not equivalent. What if the shipping_id is 12345678Z? merlin
> On Wednesday, October 17, 2012 4:30, Merlin Moncure wrote, > > On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote: > > Hi communities, > > > > I am investigating a performance issue involved with LIKE 'xxxx%' on > > an index in a complex query with joins. > > > > The problem boils down into this simple scenario---: > > ====Scenario==== > > My database locale is C, using UTF-8 encoding. I tested this on 9.1.6 and 9. > > 2.1. > > > > Q1. > > SELECT * FROM shipments WHERE shipment_id LIKE '12345678%' > > > > Q2. > > SELECT * FROM shipments WHERE shipment_id >= '12345678' AND > > shipment_id < '12345679' > > > > ...snip... > > > > ====Question==== > > Is Q1 and Q2 equivalent? From what I see and the result they seems to > > be the same, or did I miss something? (Charset: C, Encoding: UTF-8) If > > they are equivalent, is that a bug of the planner? > > They are most certainly not equivalent. What if the shipping_id is > 12345678Z? > > merlin > But '12345678Z' is indeed >= '12345678' AND < '12345679'. Just like 'apple' < 'apples' < 'apply' in a dictionary. A quick test: vitalink=# select * from ss; id ----------- 12345678 12345678Z 12345679 (3 rows) vitalink=# select * from ss WHERE id >= '12345678' AND id < '12345679'; id ----------- 12345678 12345678Z (2 rows) Sam
On Tue, Oct 16, 2012 at 8:01 PM, Sam Wong <sam@hellosam.net> wrote: >> On Wednesday, October 17, 2012 4:30, Merlin Moncure wrote, >> >> On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote: >> > Hi communities, >> > >> > I am investigating a performance issue involved with LIKE 'xxxx%' on >> > an index in a complex query with joins. >> > >> > The problem boils down into this simple scenario---: >> > ====Scenario==== >> > My database locale is C, using UTF-8 encoding. I tested this on 9.1.6 > and 9. >> > 2.1. >> > >> > Q1. >> > SELECT * FROM shipments WHERE shipment_id LIKE '12345678%' >> > >> > Q2. >> > SELECT * FROM shipments WHERE shipment_id >= '12345678' AND >> > shipment_id < '12345679' >> > >> > ...snip... >> > >> > ====Question==== >> > Is Q1 and Q2 equivalent? From what I see and the result they seems to >> > be the same, or did I miss something? (Charset: C, Encoding: UTF-8) If >> > they are equivalent, is that a bug of the planner? >> >> They are most certainly not equivalent. What if the shipping_id is >> 12345678Z? >> >> merlin >> > But '12345678Z' is indeed >= '12345678' AND < '12345679'. Just like 'apple' > < 'apples' < 'apply' in a dictionary. Right -- I didn't visualize it properly. Still, you're asking the server to infer that since you're looking between to adjacent textual characters range bounded [) it convert the 'between' to a partial string search. That hold up logically but probably isn't worth spending cycles to do, particularly in cases of non-ascii mappable unicode characters. merlin
> Moncure wrote on Thursday, October 18, 2012 1:45 > On Tue, Oct 16, 2012 at 8:01 PM, Sam Wong <sam@hellosam.net> wrote: > >> On Wednesday, October 17, 2012 4:30, Merlin Moncure wrote, > >> > >> On Tue, Oct 16, 2012 at 3:15 AM, Sam Wong <sam@hellosam.net> wrote: > >> > Hi communities, > >> > > >> > I am investigating a performance issue involved with LIKE 'xxxx%' > >> > on an index in a complex query with joins. > >> > > >> > The problem boils down into this simple scenario---: > >> > ====Scenario==== > >> > My database locale is C, using UTF-8 encoding. I tested this on > >> > 9.1.6 > > and 9. > >> > 2.1. > >> > > >> > Q1. > >> > SELECT * FROM shipments WHERE shipment_id LIKE '12345678%' > >> > > >> > Q2. > >> > SELECT * FROM shipments WHERE shipment_id >= '12345678' AND > >> > shipment_id < '12345679' > >> > > >> > ...snip... > >> > > >> > ====Question==== > >> > Is Q1 and Q2 equivalent? From what I see and the result they seems > >> > to be the same, or did I miss something? (Charset: C, Encoding: > >> > UTF-8) If they are equivalent, is that a bug of the planner? > >> > >> They are most certainly not equivalent. What if the shipping_id is > >> 12345678Z? > >> > >> merlin > >> > > But '12345678Z' is indeed >= '12345678' AND < '12345679'. Just like 'apple' > > < 'apples' < 'apply' in a dictionary. > > Right -- I didn't visualize it properly. Still, you're asking the server to infer that > since you're looking between to adjacent textual characters range bounded [) it > convert the 'between' to a partial > string search. That hold up logically but probably isn't worth > spending cycles to do, particularly in cases of non-ascii mappable unicode > characters. > merlin Postgresql did that already. Refer to the analyze result of Q1 and Q2, it gives "Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id < '12345679'::text))" (I also just realized they did it just now) Yet, with additional Filter (ref Q1 analyze), it's surprisingly that it estimates Q1 will have more rows that Q2. FYI, I made a self-contained test case and submitted a bug #7610.
Sam Wong wrote: >>>>> I am investigating a performance issue involved with LIKE 'xxxx%' >>>>> on an index in a complex query with joins. >>>>> Q1. >>>>> SELECT * FROM shipments WHERE shipment_id LIKE '12345678%' >>>>> >>>>> Q2. >>>>> SELECT * FROM shipments WHERE shipment_id >= '12345678' AND >>>>> shipment_id < '12345679' [Q1 and Q2 have different row estimates] Merlin wrote: >> Right -- I didn't visualize it properly. Still, you're asking >> the server to infer that >> since you're looking between to adjacent textual characters range bounded >> [) it convert the 'between' to a partial >> string search. That hold up logically but probably isn't worth >> spending cycles to do, particularly in cases of non-ascii mappable unicode >> characters. > Postgresql did that already. Refer to the analyze result of Q1 and Q2, it > gives > "Index Cond: ((shipment_id >= '12345678'::text) AND (shipment_id < > '12345679'::text))" > (I also just realized they did it just now) > > Yet, with additional Filter (ref Q1 analyze), it's surprisingly that it > estimates Q1 will have more rows that Q2. > > FYI, I made a self-contained test case and submitted a bug #7610. Did you try to increase the statistics for column "shipment_id"? This will probably not make the difference go away, but if the estimate gets better, it might be good enough for the planner to pick the correct plan. Yours, Laurenz Albe