Обсуждение: text column indexing in UTF-8 database
Do I really need 4 indexes per column to handle the 4 combinations of
{equality,like/regexp} x {unfolded,casefolded} in a UTF-8 encoded
database?
I have a column that I'd like to be able to search with equality and
regexp (or like), optionally casefolded. The database is UTF-8 encoded.
The table and index defs are below.
Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed up
regexp and like; that worked beautiful. But I discovered a caveat that
t_p_o apparently doesn't handle equality. Thus, I think I need distinct
indexes for the 4 cases above. Right?
Thanks,
Reece
rkh@csb-dev=> \d pannotation
Table "unison.pannotation"
Column | Type | Modifiers
----------------+--------------------------+------------------------
pannotation_id | integer | not null default
origin_id | integer | not null
alias | text | not null
descr | text |
tax_id | integer |
added | timestamp with time zone | not null default timenow()
Indexes:
...
"pannotation_alias" btree (alias)
"pannotation_alias_cf" btree (lower(alias))
"pannotation_alias_cf_tpo" btree (lower(alias) text_pattern_ops)
"pannotation_alias_tpo" btree (alias text_pattern_ops)
...
where those indexes are defined as:
rkh@csb-dev=> \x
rkh@csb-dev=> select indexname,indexdef from pg_indexes
where indexname~'^pannotation_alias';
-[ RECORD 1 ]--------------------------------------------------------
indexname | pannotation_alias_cf_tpo
indexdef | CREATE INDEX pannotation_alias_cf_tpo ON pannotation USING btree (lower(alias) text_pattern_ops)
-[ RECORD 2 ]---------------------------------------------------------
indexname | pannotation_alias_tpo
indexdef | CREATE INDEX pannotation_alias_tpo ON pannotation USING btree (alias text_pattern_ops)
-[ RECORD 3 ]---------------------------------------------------------
indexname | pannotation_alias
indexdef | CREATE INDEX pannotation_alias ON pannotation USING btree (alias)
-[ RECORD 4 ]---------------------------------------------------------
indexname | pannotation_alias_cf
indexdef | CREATE INDEX pannotation_alias_cf ON pannotation USING btree
(lower(alias))
--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
On Mar 12, 2009, at 5:15 PM, Reece Hart wrote:
> Do I really need 4 indexes per column to handle the 4 combinations of
> {equality,like/regexp} x {unfolded,casefolded} in a UTF-8 encoded
> database?
>
> I have a column that I'd like to be able to search with equality and
> regexp (or like), optionally casefolded. The database is UTF-8
> encoded.
> The table and index defs are below.
>
> Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed
> up
> regexp and like; that worked beautiful. But I discovered a caveat that
> t_p_o apparently doesn't handle equality. Thus, I think I need
> distinct
> indexes for the 4 cases above. Right?
If A=B then lower(A) = lower(B), and if A like B then lower(A) like
lower(B).
So, if nothing else, you could rewrite "where alias = 'Foo'" as
"where lower(alias) = lower('Foo') and alias='Foo'" and take advantage
of the lower() functional index.
I've read that 8.4 will be able to use a text_pattern_ops index for
equality.
Cheers,
Steve
On Thu, 2009-03-12 at 17:15 -0700, Reece Hart wrote:
> Jeff Davis gave me a tip to use text_pattern_ops on indexes to speed up
> regexp and like; that worked beautiful. But I discovered a caveat that
> t_p_o apparently doesn't handle equality. Thus, I think I need distinct
> indexes for the 4 cases above. Right?
It looks like an index using text_pattern_ops can be used for equality
(see my test case below).
This works apparently because texteq() is defined as bitwise-equality.
Is that really correct? I was under the impression that some locales do
not obey that rule, and may consider two slightly different strings to
be equal.
Regards,
Jeff Davis
create table a(t text);
create index a_idx on a (t text_pattern_ops);
insert into a values('foo');
set enable_seqscan='f';
analyze a;
explain analyze select * from a where t = 'foo';
QUERY PLAN
-----------------------------------------------
Index Scan using a_idx on a (cost=0.00..8.27 rows=1 width=4)
(actual time=0.009..0.010 rows=1 loops=1)
Index Cond: (t = 'foo'::text)
Total runtime: 0.036 ms
(3 rows)
On Thu, 2009-03-12 at 18:02 -0700, Jeff Davis wrote:
> It looks like an index using text_pattern_ops can be used for equality
> (see my test case below).
Odd. I can't reproduce your test case. I noticed that I edited out the
version and platform from my OP. (A: 8.3.6, x86_64 linux). You're on
8.3.6, or do you happen to be testing on the 8.4 branch?
I see this:
rkh@rkh=> \i tpo-test.sql
version
--------------------------------------------------------------------------------------------
PostgreSQL 8.3.6 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.0 (SUSE Linux)
select name,setting from pg_settings where name~'locale|encoding';
name | setting
-----------------+---------
client_encoding | UTF8
server_encoding | UTF8
(2 rows)
\!echo $LANG
en_US.UTF-8
create temp table a(t text);
CREATE TABLE
insert into a values('foo');
INSERT 0 1
set enable_seqscan='f';
SET
create index a_t_tpo on a (t text_pattern_ops);
CREATE INDEX
analyze a;
ANALYZE
explain analyze select * from a where t = 'foo';
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------
Seq Scan on a (cost=100000000.00..100000001.01 rows=1 width=4) (actual
time=0.014..0.016 rows=1 loops=1)
Filter: (t = 'foo'::text)
Total runtime: 0.047 ms
(3 rows)
create index a_t on a (t);
CREATE INDEX
analyze a;
ANALYZE
explain analyze select * from a where t = 'foo';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------
Index Scan using a_t on a (cost=0.00..8.27 rows=1 width=4) (actual
time=0.061..0.062 rows=1 loops=1)
Index Cond: (t = 'foo'::text)
Total runtime: 0.099 ms
(3 rows)
script at http://harts.net/reece/tpo-test.sql
--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
On Thu, 2009-03-12 at 17:32 -0700, Steve Atkins wrote:
Good idea. Thanks. The niggling remaining problem is that the DB is open to a SQL-savvy audience and it'd be nice to avoid telling them to casefold their predicates.
For regexps, lower(alias) ~* lower(regexp) won't work because extended regexps might contain character classes (e.g., \S != \s). And, I guess that alias ~* regexp requires a seqscan because the index isn't ordered over ~* (right?). How about lower(alias) ~* regexp ? Is PG smart enough to know that that ordering is well defined? Is my head on straight there?
Thanks again,
Reece
If A=B then lower(A) = lower(B), and if A like B then lower(A) like
lower(B).
So, if nothing else, you could rewrite "where alias = 'Foo'" as
"where lower(alias) = lower('Foo') and alias='Foo'" and take advantage
of the lower() functional index.
Good idea. Thanks. The niggling remaining problem is that the DB is open to a SQL-savvy audience and it'd be nice to avoid telling them to casefold their predicates.
For regexps, lower(alias) ~* lower(regexp) won't work because extended regexps might contain character classes (e.g., \S != \s). And, I guess that alias ~* regexp requires a seqscan because the index isn't ordered over ~* (right?). How about lower(alias) ~* regexp ? Is PG smart enough to know that that ordering is well defined? Is my head on straight there?
Thanks again,
Reece
-- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0 |
Jeff Davis <pgsql@j-davis.com> writes:
> It looks like an index using text_pattern_ops can be used for equality
> (see my test case below).
This is true as of 8.4; prior versions make a distinction between =
and ~=~.
> This works apparently because texteq() is defined as bitwise-equality.
> Is that really correct? I was under the impression that some locales do
> not obey that rule, and may consider two slightly different strings to
> be equal.
The locale might, but Postgres doesn't --- look at the implementation
of texteq().
There is actually some history here; the former distinction in the
equality operators arose from exactly your concern. But after we
put in the second-pass check to insist on bitwise equality, we
realized that the equality operators really were equivalent.
regards, tom lane
On Fri, 2009-03-13 at 13:13 -0400, Tom Lane wrote:
> There is actually some history here; the former distinction in the
> equality operators arose from exactly your concern. But after we
> put in the second-pass check to insist on bitwise equality, we
> realized that the equality operators really were equivalent.
>
Interesting. It's certainly convenient when "=" means "values are
exactly the same" ;)
One thing that still doesn't make sense to me is that texteq() is
bitwise-equality even in 8.3.
It sounds like Reece Hart can avoid the extra index by making a new
opclass that's like text_pattern_ops except "~=~" should be named "=".
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
> One thing that still doesn't make sense to me is that texteq() is
> bitwise-equality even in 8.3.
Historical artifact ... we made the semantics change some time ago, but
the ensuing change to remove ~=~ didn't happen until 8.4.
> It sounds like Reece Hart can avoid the extra index by making a new
> opclass that's like text_pattern_ops except "~=~" should be named "=".
Unfortunately not, because the LIKE index optimizations are keyed off
the built-in opclasses.
regards, tom lane