Обсуждение: spgist text_ops and LIKE

Поиск
Список
Период
Сортировка

spgist text_ops and LIKE

От
Robert Haas
Дата:
Is spgist intended to support prefix searches with LIKE?

I ask because, first, it seems like something spgist ought to be good
at (unless I'm confused), and, second, the text_ops opfamily includes
these operators:

~<~(text,text)~<=~(text,text)~>=~(text,text)~>~(text,text)

...which seems to be the same operators that are used for btree
pattern-matching searches:

rhaas=# explain select count(*) from person where last_name like 'WAR%';                                       QUERY
PLAN
------------------------------------------------------------------------------------------Aggregate
(cost=2519.27..2519.28rows=1 width=0)  ->  Bitmap Heap Scan on person  (cost=24.70..2496.75 rows=9005 width=0)
Filter:(last_name ~~ 'WAR%'::text)        ->  Bitmap Index Scan on person_tpo  (cost=0.00..22.45
 
rows=900 width=0)              Index Cond: ((last_name ~>=~ 'WAR'::text) AND
(last_name ~<~ 'WAS'::text))
(5 rows)

...but when I create an index like this:

create index person_spg on person using spgist (last_name text_ops);

...I can't get LIKE to use it, even if I disable seqscans.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spgist text_ops and LIKE

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Is spgist intended to support prefix searches with LIKE?

Too lazy to look at the code right now, but I think indxpath.c contains
hardwired assumptions that LIKE prefix optimizations are only possible
with btree indexes.  Possibly it would be worth relaxing that.  (The
whole "special index operator" mechanism is undesirably special-purpose,
but I currently have no ideas about how to make it more flexible.)
        regards, tom lane


Re: spgist text_ops and LIKE

От
Robert Haas
Дата:
On Wed, Feb 1, 2012 at 10:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Is spgist intended to support prefix searches with LIKE?
>
> Too lazy to look at the code right now, but I think indxpath.c contains
> hardwired assumptions that LIKE prefix optimizations are only possible
> with btree indexes.  Possibly it would be worth relaxing that.  (The
> whole "special index operator" mechanism is undesirably special-purpose,
> but I currently have no ideas about how to make it more flexible.)

Is it as simple as the attached?

Given the set of operators that are available for spgist, one would
assume that it's intended to handle both pattern-matching and regular
comparison operators.  On the flip side, spg_text_inner_consistent()
seems to fall back to a full index-scan for the regular comparison
operators, which makes me wonder why anyone would bother having the
operator at all.  Indeed, in my testing, it's slightly slower than a
sequential scan:

select sum(1) from person where last_name > 'WAR';
 - btree: 113-114ms
 - btree with text_pattern_ops: can't do it
 - spgist: 920-935ms
 - sequential scan: 895-910ms

With the attached patch, it seems to work as well as text_pattern_ops
for pattern matching:

select sum(1) from person where last_name LIKE 'WAR%';
 - btree: can't do it
 - btree with text_pattern_ops: 6-7ms
 - spgist: 7-9ms
 - sequential scan: 170-180ms

It's fine for equality, too:

select sum(1) from person where last_name = 'WARNER';
 - btree: 1.7-1.9ms
 - btree with text_pattern_ops: 1.7-1.9ms
 - spgist: 1.7-1.9ms
 - sequential scan: 160-165ms

It seems that there's at least the potential for spgist indexes to be
extremely efficient, given adequate knowledge about the collation
under which comparisons are being done.   If for example we're trying
to find strings where last_name > 'WAR', we certainly need to chase
down the subtrees where the first character is 'W', 'X', 'Y', or 'Z'.
If the collation is case-insensitive, we might also need to chase 'w',
'x', 'y', and 'z'.  And there might be other characters that need to
be chased as well, depending on the collation: some collations might
sort numbers before letters, while others might do the reverse.  But
it's likely that there are few if any collations where we need to
chase down the subtree for 'Q' on this query, because anything
beginning with 'Q' is likely to sort before anything beginning with
'WAR' pretty much anywhere.  I'm not sure there's a good way to get
that information without a lot of grotty hard-coding, but it seems
like the potential is there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: spgist text_ops and LIKE

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 1, 2012 at 10:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Too lazy to look at the code right now, but I think indxpath.c contains
>> hardwired assumptions that LIKE prefix optimizations are only possible
>> with btree indexes.  Possibly it would be worth relaxing that.

> Is it as simple as the attached?

Yeah, that looks about right, with the single quibble that the other
places where we have #defines in pg_opfamily.h put them after the
corresponding DATA line not before.  Please fix and commit.

> On the flip side, spg_text_inner_consistent()
> seems to fall back to a full index-scan for the regular comparison
> operators, which makes me wonder why anyone would bother having the
> operator at all.

MHO is that that code path is only useful when you are in C locale;
but I think that's a useful enough case to be worth the rather small
amount of incremental code to handle it.  I'm not sure I believe your
speculations about making spgist handle inequality searches in other
locales.  Non-C sort orders are too bizarre and we have too little
visibility into the precise rules the current locale is following.

... however ... could it be reasonable to apply strxfrm and store the
result of that in the index?  Then searches would just need byte-by-byte
comparisons.
        regards, tom lane


Re: spgist text_ops and LIKE

От
Robert Haas
Дата:
On Thu, Feb 2, 2012 at 12:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Feb 1, 2012 at 10:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Too lazy to look at the code right now, but I think indxpath.c contains
>>> hardwired assumptions that LIKE prefix optimizations are only possible
>>> with btree indexes.  Possibly it would be worth relaxing that.
>
>> Is it as simple as the attached?
>
> Yeah, that looks about right, with the single quibble that the other
> places where we have #defines in pg_opfamily.h put them after the
> corresponding DATA line not before.  Please fix and commit.

Done.

>> On the flip side, spg_text_inner_consistent()
>> seems to fall back to a full index-scan for the regular comparison
>> operators, which makes me wonder why anyone would bother having the
>> operator at all.
>
> MHO is that that code path is only useful when you are in C locale;
> but I think that's a useful enough case to be worth the rather small
> amount of incremental code to handle it.  I'm not sure I believe your
> speculations about making spgist handle inequality searches in other
> locales.  Non-C sort orders are too bizarre and we have too little
> visibility into the precise rules the current locale is following.

Unfortunately, I fear that you are right, unless we were to switch
over to using a locale machinery of our own devising in place of that
provided by the OS.  And no, I'm not proposing that.

> ... however ... could it be reasonable to apply strxfrm and store the
> result of that in the index?  Then searches would just need byte-by-byte
> comparisons.

Yeah, you could, but I don't see the point.  I was hoping that spgist
would give us the ability to have a single index that allows both
inequality comparisons and pattern matching.  If you're going to have
one index where the strings are stored "straight" and another where
strxfrm() is applied, then you could just build two btree indexes, one
the regular way and a second with text_pattern_ops.

I'm having trouble figuring out under what set of circumstances spgist
is expected to be the best available alternative.  It only supports a
small subset of the data types that GiST does, so I suppose the point
is that it should be faster for the cases that it does handle.  And,
considering that this is all brand-new code, the fact that it's almost
keeping pace with btree on both pattern-matching and equality
comparisons is certainly respectable -- but I so far haven't found any
cases where it's a clear win.  There's limited glory in being the
almost-fastest way of indexing for a certain class of queries.
Admittedly, I haven't tried the point-in-box stuff yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: spgist text_ops and LIKE

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm having trouble figuring out under what set of circumstances spgist
> is expected to be the best available alternative.  It only supports a
> small subset of the data types that GiST does, so I suppose the point
> is that it should be faster for the cases that it does handle.  And,
> considering that this is all brand-new code, the fact that it's almost
> keeping pace with btree on both pattern-matching and equality
> comparisons is certainly respectable -- but I so far haven't found any
> cases where it's a clear win.  There's limited glory in being the
> almost-fastest way of indexing for a certain class of queries.
> Admittedly, I haven't tried the point-in-box stuff yet.

I think the text opclass is mostly meant as sample code/proof-of-concept.
What spgist is likely to be good for is queries that btree can't cope
with at all.  But it's cute that we can make it handle LIKE for a few
more lines of code, so I'm fine with adding that.
        regards, tom lane


Re: spgist text_ops and LIKE

От
Alexander Korotkov
Дата:
On Thu, Feb 2, 2012 at 10:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I'm having trouble figuring out under what set of circumstances spgist
is expected to be the best available alternative.  It only supports a
small subset of the data types that GiST does, so I suppose the point
is that it should be faster for the cases that it does handle.  And,
considering that this is all brand-new code, the fact that it's almost
keeping pace with btree on both pattern-matching and equality
comparisons is certainly respectable -- but I so far haven't found any
cases where it's a clear win.  There's limited glory in being the
almost-fastest way of indexing for a certain class of queries.

I think that current implementation of suffix tree (in particular implementation of consistent methods) in spgist is far from reveal full potential of suffix tree. I see at least two additional search types which can be efficiently evaluated using suffix tree:
1) Search by prefix regular expression. For example, search for /^a(bc|d)e[fgh]/ in single index scan.
2) Search by levenshtein distance, i.e. "SELECT * FROM tbl WHERE levenshtein(col, 'some constant') < k;" using index (surely actual syntax for such index scan would be anouther).
This is only two additional applications which first comes to my mind, there could be other. Unfortunately, I don't have enough of time for it now. But I like to put my hands on this in future.

Admittedly, I haven't tried the point-in-box stuff yet.

In short, I believe spgist for geometrical data-types is more CPU-effective and less IO-effective than gist. Gist have to call consistent method for all index tuples of the page it uses in scan, while spgist scans smaller nodes. Sure, comprehensive testing is required for doing final conclusion about that. 

-----
With best regards,
Alexander Korotkov.