Обсуждение: Re: [GENERAL] indexed regex select optimisation missing?
Stuart - I'm forwarding a version of your last message to the hackers list, and to Lamar Owen, who's the keeper of the RPMs. The short verson, for those who haven't followed this thread over on GENERAL, is that Stuart is being bitten by the USE_LOCALE affect on the makeIndexable() function in the parser: anchored regex searches on a large table (a glossary, I believe) take a long time, proportional to sort position of the anchoring text: i.e. searching for '^zoo' is quick, '^apple' is very slow. I seems to recall the packagers here (Lamar and Oliver) asking if defining USE_LOCALE for the general RPM or deb would cause any problems for other users, who don't need locale info. Here's a real world example. The discussion about this was last June, and shifted focus into the multi-byte problem, as far as I can tell. Bruce, some version of this is on the TODO list, right? Ross On Fri, Nov 05, 1999 at 12:09:19PM +1300, Stuart Woolford wrote: > On Fri, 05 Nov 1999, you wrote: > > Ah, your description just tripped a memory for me from the hackers list: > > > > The behavior you describe has to do with the implementation of using an > > index for regex matching, in the presence of the USE_LOCALE configuration > > option. > > > > Internally, the condition: WHERE word~'^alongword' is converted in the > > parser(!) to: > > > > WHERE word >= 'alongword' AND word < 'alongword\377' > > > > since the index needs inequalities to be used, not matches. Now, the > > problem is the hack of tacking an octal \377 on the string to create > > the lexagraphically 'just bigger' value assumes ASCI sort order. If > > USE_LOCALE is defined, this is dropped, since we don't have a good fix > > yet, and slow correct behavior is better than fast, incorrect behavior. > > ah, now this makes sense, I'm using the RPMs, and I bet they have lexical > enabled by default (damb! perhaps another set should be produced without this > option? it makes a BIG difference) > > > > So, you have two options: if you don't need locale support, > recompile > without it. Otherwise, hand code your anchored matches as the pair > of > conditionals above Hmm, is there syntax for adding an arbitrary value to > > a string constant in the SQL? I suppose you could use: word < 'alongwore', > > i.e. hand increment the last character, so it's larger than any match. > > I've tried a test using ">='window' and <'windox'", and it works perfectly, and > very very fast, so I think we have found your culprit. > > > > > Your point is correct, the developers are aware of it as a theoretical > > problem, at least. Always helps to hear a real world case, though. I > > believe it's on the TODO list as is, otherwise, pester Bruce. ;-) > > > > Reviewing my email logs from June, most of the work on this has to do with > > people who needs locales, and potentially multibyte character sets. Tom > > Lane is of the opinion that this particular optimization needs to be moved > > out of the parser, and deeper into the planner or optimizer/rewriter, > > so a good fix may be some ways out. > > Hmm, perhaps a 'good' initial fix would be to produce another set of RPMs, > and/or add it to the FAQ in the 4.x section about the slow queries that say > indexes are used for this type of search. using the >= AND < trick does seem to > work, but is a little non-obvious (and hard to code in some situations, it will > make quite a difference to how I need to implement my searching system) > > > > > Ross > > thank you very very much for your assistance on this, it is greatly appreciated! > > -- > ------------------------------------------------------------ > Stuart Woolford, stuartw@newmail.net > Unix Consultant. > Software Developer. > Supra Club of New Zealand. > ------------------------------------------------------------ -- Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu> NSBRI Research Scientist/Programmer Computer and Information Technology Institute Rice University, 6100 S. Main St., Houston, TX 77005
"Ross J. Reedstrom" wrote: > > Stuart - > I'm forwarding a version of your last message to the hackers list, and > to Lamar Owen, who's the keeper of the RPMs. The short verson, for those > > Hmm, perhaps a 'good' initial fix would be to produce another set of RPMs, That is easy enough. I can build two versions -- with locale, and no-locale. No-locale RPM's would be named differently -- postgresql-6.5.3-1nl.i386.rpm (that's 'one in ell'). I have been helping another user figure out the regression results for locales -- it's not fun. HOWEVER, I also need to follow the RedHat-originated standard, with is with locale support. It'll take a little bit to rebuild, but not too long -- I could release no-locale RPM's as early as tomorrow for RedHat 6.x, and as early as an hour from now for RedHat 5.2 (both releases happening after the official 6.5.3 release, of course). In fact, if a user wants to build the no-locale RPM's themselves, it's not too difficult: 1.) get the postgresql-6.5.2-1.src.rpm source RPM (hereafter abbreviated 'the SRPM') 2.) Install the SRPM with 'rpm -i' 3.) Become root, and cd to /usr/src/redhat/SPECS 4.) Open postgresql.spec with your favorite editor 5.) Remove the configure option '--enable-locale' (if you use vi, and are comfortable with doing so, you can ':%s/--enable-locale//g' to good effect). 6.) Change the string after the line 'Release:' to be '1nl' from 1. 7.) Save and exit your editor. 8.) execute the command 'rpm -ba postgresql.spec' 9.) When it's done, install the new RPM's from the appropriate directory under /usr/src/redhat/RPMS. 10.) Clean up by removing the files under SOURCES and the postgresql-6.5.2 build tree under BUILD. NOTE: You need a fairly complete development environment to do this -- in particular, 'python-devel' must be installed (it's not by default, even under a 'C Development' and 'Development Libraries' enabled installation. You do need the C++ compiler installed as well. Would this help?? -- Lamar Owen WGCR Internet Radio 1 Peter 4:11
> Stuart -
> I'm forwarding a version of your last message to the hackers list, and
> to Lamar Owen, who's the keeper of the RPMs. The short verson, for those
> who haven't followed this thread over on GENERAL, is that Stuart is being
> bitten by the USE_LOCALE affect on the makeIndexable() function in the
> parser: anchored regex searches on a large table (a glossary, I believe)
> take a long time, proportional to sort position of the anchoring text:
> i.e. searching for '^zoo' is quick, '^apple' is very slow.
>
> I seems to recall the packagers here (Lamar and Oliver) asking if defining
> USE_LOCALE for the general RPM or deb would cause any problems for other
> users, who don't need locale info. Here's a real world example.
>
> The discussion about this was last June, and shifted focus into the
> multi-byte problem, as far as I can tell. Bruce, some version of this
> is on the TODO list, right?
I have beefed up the FAQ with a mention that locale disables regex
indexing, and have added to TODO:
    * Allow LOCALE to use indexes in regular expression searches
--
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
			
		"Ross J. Reedstrom" <reedstrm@wallace.ece.rice.edu> writes:
> Reviewing my email logs from June, most of the work on this has to do with
> people who needs locales, and potentially multibyte character sets. Tom
> Lane is of the opinion that this particular optimization needs to be moved
> out of the parser, and deeper into the planner or optimizer/rewriter,
> so a good fix may be some ways out.
Actually, that part is already done: addition of the index-enabling
comparisons is gone from the parser and is now done in the optimizer,
which has a whole bunch of benefits (one being that the comparison
clauses don't get added to the query unless they are actually used
with an index!).
But the underlying LOCALE problem still remains: I don't know a good
character-set-independent method for generating a "just a little bit
larger" string to use as the righthand limit.  If anyone out there is
an expert on foreign and multibyte character sets, some help would
be appreciated.  Basically, given that we know the LIKE or regex
pattern can only match values beginning with FOO, we want to generate
string comparisons that select out the range of values that begin with
FOO (or, at worst, a slightly larger range).  In USASCII locale it's not
hard: you can do
    field >= 'FOO' AND field < 'FOP'
but it's not immediately obvious how to make this idea work reliably
in the presence of odd collation orders or multibyte characters...
BTW: the \377 hack is actually wrong for USASCII too, since it'll
exclude a data value like 'FOO\377x' which should be included.
            regards, tom lane
			
		I don't know much about the backend stuff, but wouldn't it reduce the amount of records you go through to do a search for FO. and then do a another check on each returned record to check that the last character matches? More checks, but fewer total records. Anyway, just a thought. At 12:46 PM 11/5/99, Tom Lane wrote: >[snip] > > Basically, given that we know the LIKE or regex >pattern can only match values beginning with FOO, we want to generate >string comparisons that select out the range of values that begin with >FOO (or, at worst, a slightly larger range). In USASCII locale it's not >hard: you can do > field >= 'FOO' AND field < 'FOP' >but it's not immediately obvious how to make this idea work reliably >in the presence of odd collation orders or multibyte characters... > >BTW: the \377 hack is actually wrong for USASCII too, since it'll >exclude a data value like 'FOO\377x' which should be included. > > regards, tom lane > >************
Firstly, damb you guys are good, please accept my strongest complements for the response time on this issue! On Sat, 06 Nov 1999, Tom Lane wrote: > "Ross J. Reedstrom" <reedstrm@wallace.ece.rice.edu> writes: > > Reviewing my email logs from June, most of the work on this has to do with > > people who needs locales, and potentially multibyte character sets. Tom > > Lane is of the opinion that this particular optimization needs to be moved > > out of the parser, and deeper into the planner or optimizer/rewriter, > > so a good fix may be some ways out. > > Actually, that part is already done: addition of the index-enabling > comparisons is gone from the parser and is now done in the optimizer, > which has a whole bunch of benefits (one being that the comparison > clauses don't get added to the query unless they are actually used > with an index!). > > But the underlying LOCALE problem still remains: I don't know a good > character-set-independent method for generating a "just a little bit > larger" string to use as the righthand limit. If anyone out there is > an expert on foreign and multibyte character sets, some help would > be appreciated. Basically, given that we know the LIKE or regex > pattern can only match values beginning with FOO, we want to generate > string comparisons that select out the range of values that begin with > FOO (or, at worst, a slightly larger range). In USASCII locale it's not > hard: you can do > field >= 'FOO' AND field < 'FOP' > but it's not immediately obvious how to make this idea work reliably > in the presence of odd collation orders or multibyte characters... how about something along the lines of: file >='FOO' and field='FOO.*' ie, terminate once the search fails on a match of the static left-hand-side followed by anything (although I have the feeling this does not fit into your execution system..), and a simple regex type check be added to the scan validation code? > > BTW: the \377 hack is actually wrong for USASCII too, since it'll > exclude a data value like 'FOO\377x' which should be included. That's why I pointed out that in my particular case, I only have alpha and numeric data in the database, so it is safe, it's certainly no general solution. -- ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. ------------------------------------------------------------
Well, I've improved my regex text searches to actually use the indexes properly
now for the basic case, but I have found another 'problem' (or feature, call it
what you will ;) - to demonstrate:
with locale turned on (the default RPMS are like this):
the following takes a LONG time to run on 1.6 million records:
-------------------------------------
explain select isbn, count from inv_word_i where
word~'^foo'
order by count
Sort  (cost=35148.70 rows=353 width=16)
  ->  Index Scan using i3 on inv_word_i  (cost=35148.70 rows=353 width=16)
-------------------------------------
the following runs instantly, and does (nearly) the same thing:
-------------------------------------
explain select isbn, count from inv_word_i where
word>='foo' and word<'fop'
order by count
Sort  (cost=11716.57 rows=183852 width=16)
  ->  Index Scan using i3 on inv_word_i  (cost=11716.57 rows=183852 width=16)
-------------------------------------
but what about the following? :
-------------------------------------
explain select isbn , sum(count) from inv_word_i where
(word>='window' and word<'windox')
or
(word>='idiot' and word<'idiou')
group by isbn
order by sum(count) desc
Sort  (cost=70068.84 rows=605525 width=16)
  ->  Aggregate  (cost=70068.84 rows=605525 width=16)
        ->  Group  (cost=70068.84 rows=605525 width=16)
              ->  Sort  (cost=70068.84 rows=605525 width=16)
                    ->  Seq Scan on inv_word_i  (cost=70068.84 rows=605525 width=16)
-------------------------------------
this is the fastest way I've found so far to do a multi-word search (window and
idiot as the root words in this case), you note it does NOT use the indexes,
but falls back to a linear scan?!? it takes well over 30 seconds (much much too
long)
I've tried a LOT of different combinations, and have yet to find a way of
getting the system to use the indexes correctly to do what I want, the closest
I've ffound is using a select intersect select method to find all docs
containing both word (what I really want, although the query above is a ranked
or query), but it gets slow as soon as I select more than one field for the
results (I need to line isbn in this case to another database in the final
application)
I assume there is some reason the system falls back to a linear scan in this
case? it seems two index lookups would be much much more efficient..
am I missing something again?
--
------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------