Обсуждение: Dumb question involving to_tsvector and a view

Поиск
Список
Период
Сортировка

Dumb question involving to_tsvector and a view

От
"Raymond C. Rodgers"
Дата:
Hi folks,
     I'm building a PHP script for a web site I'm developing. At the
moment, there is absolutely no real data in the database, so obviously
performance is pretty good right now. I'm in the midst of developing an
administration page for the site, which will do a full text search on
several tables separately, and I realized that one of the tables
currently doesn't have a tsvector column. As I went to add a tsvector
column, it occurred to me that it might be possible to add a dynamic
tsvector column through the use of a view, so I created a temporary view
with a command along the lines of:

     CREATE TEMPORARY VIEW ftstest AS SELECT id, field1, field2,
TO_TSVECTOR(COALESCE(field1,'') || ' ' || COALESCE(field2,'')) AS
txtsrch FROM mytable;

     To my surprise, it worked. Now, I'm sitting here thinking about the
performance impact that doing this would have. I can't help but think
that a query to this view when the table is filled with thousands or
tens of thousands of entries would be painfully slow, but would there be
any real advantage to doing it in a view rather than just adding the
column to the table? (That's the dumb question.) If the site only had a
few dozen users, and the amount of data on the site was minimal, this
wouldn't be too big an issue. Still a bad design decision, but are there
any good reasons to do it?

Thanks for your patience with this dumb question. :)
Raymond

Re: Dumb question involving to_tsvector and a view

От
Kevin Grittner
Дата:
Raymond C. Rodgers <sinful622@gmail.com> wrote:=0A=0A> As I went to add a t=
svector column, it occurred to me that it=0A> might be possible to add a dy=
namic tsvector column through the=0A> use of a view, so I created a tempora=
ry view with a command along=0A> the lines of:=0A>=0A>=A0=A0=A0=A0 CREATE T=
EMPORARY VIEW ftstest AS SELECT id, field1, field2,=0A> TO_TSVECTOR(COALESC=
E(field1,'') || ' ' ||=0A> COALESCE(field2,'')) AS txtsrch FROM mytable;=0A=
>=0A> To my surprise, it worked. Now, I'm sitting here thinking about=0A> t=
he performance impact that doing this would have.=0A=0AI had a similar situ=
ation and benchmarked it both ways.=A0 For my=0Asituation I came out ahead =
writing the extra column for inserts and=0Aupdates than generating the tsve=
ctor values on the fly each time it=0Awas queried.=A0 YMMV.=A0 It probably =
depends mostly on the ratio of=0Ainserts and updates to selects.=0A=0A--=0A=
Kevin Grittner=0AEnterpriseDB: http://www.enterprisedb.com=0AThe Enterprise=
 PostgreSQL Company

Re: Dumb question involving to_tsvector and a view

От
Tom Lane
Дата:
Kevin Grittner <kgrittn@ymail.com> writes:
> Raymond C. Rodgers <sinful622@gmail.com> wrote:
>> As I went to add a tsvector column, it occurred to me that it
>> might be possible to add a dynamic tsvector column through the
>> use of a view, so I created a temporary view with a command along
>> the lines of:
>>
>>      CREATE TEMPORARY VIEW ftstest AS SELECT id, field1, field2,
>> TO_TSVECTOR(COALESCE(field1,'') || ' ' ||
>> COALESCE(field2,'')) AS txtsrch FROM mytable;
>>
>> To my surprise, it worked. Now, I'm sitting here thinking about
>> the performance impact that doing this would have.

> I had a similar situation and benchmarked it both ways.  For my
> situation I came out ahead writing the extra column for inserts and
> updates than generating the tsvector values on the fly each time it
> was queried.  YMMV.  It probably depends mostly on the ratio of
> inserts and updates to selects.

A "virtual" tsvector like that is probably going to be useless for
searching as soon as you get a meaningful amount of data, because the
only way the DB can implement a search is to compute the tsvector value
for each table row and then examine it for the target word(s).

What you want is a GIST or GIN index on the contents of the tsvector.
You can either realize the tsvector as a table column and put a regular
index on it, or you can build a functional index on the to_tsvector()
expression.  The latter is kind of like your idea in that the tsvector
as a whole isn't stored anywhere --- but there's an index containing all
the words, which is what you need for searching.

I think there are examples of both ways in the "text search" chapter of
the manual.  (If not, there should be ...)

            regards, tom lane

Re: Dumb question involving to_tsvector and a view

От
"Raymond C. Rodgers"
Дата:
On 02/23/2013 05:26 AM, Tom Lane wrote:
> A "virtual" tsvector like that is probably going to be useless for
> searching as soon as you get a meaningful amount of data, because the
> only way the DB can implement a search is to compute the tsvector
> value for each table row and then examine it for the target word(s).
> What you want is a GIST or GIN index on the contents of the tsvector.
> You can either realize the tsvector as a table column and put a
> regular index on it, or you can build a functional index on the
> to_tsvector() expression. The latter is kind of like your idea in that
> the tsvector as a whole isn't stored anywhere --- but there's an index
> containing all the words, which is what you need for searching. I
> think there are examples of both ways in the "text search" chapter of
> the manual. (If not, there should be ...) regards, tom lane
I think the only real advantage to using something like this would be a
space savings in terms of storing the tsvector data, but I don't see
that being a significant enough reason to go ahead and use this idea in
a production situation. As mentioned [by pretty much all of us], once
the table size is sufficiently large there would be a performance
penalty by to_tsvector being executed on every record in the table. (If
I'm not mistaken, with the way I wrote that "create view", every record
in "mytable" would be subject to the function call, then any narrowing
parameters in the where clause would be applied afterwards.)

Any way, like I said originally, it was a dumb question. It might be ok
to use that in a situation where the table size is known to be small,
but there's little to no reason to do it in a production situation.

Thanks!
Raymond

Re: Dumb question involving to_tsvector and a view

От
Jasen Betts
Дата:
On 2013-02-23, Raymond C. Rodgers <sinful622@gmail.com> wrote:
> On 02/23/2013 05:26 AM, Tom Lane wrote:
>> A "virtual" tsvector like that is probably going to be useless for
>> searching as soon as you get a meaningful amount of data, because the
>> only way the DB can implement a search is to compute the tsvector
>> value for each table row and then examine it for the target word(s).
>> What you want is a GIST or GIN index on the contents of the tsvector.

> I think the only real advantage to using something like this would be a
> space savings in terms of storing the tsvector data, but I don't see
> that being a significant enough reason to go ahead and use this idea in
> a production situation. As mentioned [by pretty much all of us], once
> the table size is sufficiently large there would be a performance
> penalty by to_tsvector being executed on every record in the table.

Unless the plan comes out as a table scan the index will be used
instead ot to_tsvector()

When there is a table scan to_tsvector will be used instead of reading
from disk, I don't know how fast to_tsvector is compared to disk, but
usually computing a result is faster than reading it from disk.

Storing the tsvector in the table is likely to be faster only when a
tablescan is done and the table is fully cached in ram.

--
⚂⚃ 100% natural

Re: Dumb question involving to_tsvector and a view

От
Kevin Grittner
Дата:
Jasen Betts <jasen@xnet.co.nz> wrote:=0A>On 2013-02-23, Raymond C. Rodgers =
<sinful622@gmail.com> wrote:=0A=0A>> On 02/23/2013 05:26 AM, Tom Lane wrote=
:=0A>>> A "virtual" tsvector like that is probably going to be useless for=
=0A>>> searching as soon as you get a meaningful amount of data, because th=
e=0A>>> only way the DB can implement a search is to compute the tsvector=
=0A>>> value for each table row and then examine it for the target word(s).=
=0A>>> What you want is a GIST or GIN index on the contents of the tsvector=
.=0A>=0A>> I think the only real advantage to using something like this wou=
ld be a=0A>> space savings in terms of storing the tsvector data, but I don=
't see=0A>> that being a significant enough reason to go ahead and use this=
 idea in=0A>> a production situation. As mentioned [by pretty much all of u=
s], once=0A>> the table size is sufficiently large there would be a perform=
ance=0A>> penalty by to_tsvector being executed on every record in the tabl=
e.=0A>=0A>Unless the plan comes out as a table scan the index will be used=
=0A>instead ot to_tsvector()=0A>=0A>When there is a table scan to_tsvector =
will be used instead of reading=0A>from disk, I don't know how fast to_tsve=
ctor is compared to disk, but=0A>usually computing a result is faster than =
reading it from disk.=0A>=0A>Storing the tsvector in the table is likely to=
 be faster only when a=0A>tablescan is done and the table is fully cached i=
n ram.=0A=0AI guess I was being dumb in assuming that it was obvious that a=
 GIN=0Aor GiST index would be needed for decent performance at scale. =0AWi=
thout that, a scan of the whole table (or at least all rows=0Amatching othe=
r search criteria) is needed, which is going to hurt. =0AThe benchmarks I m=
entioned were for a GIN index on the results of=0Athe function which genera=
ted the tsvector, versus a GIN index on=0Athe stored tsvector.=A0 In our ca=
se, a typical scan for document text=0Aagainst years of accumulated court d=
ocuments was about 300 ms=0Aversus about 1.5 seconds.=A0 It may matter that=
 we weren't just=0Alooking for matches, but the top K matches based on the =
ranking=0Afunction.=0A=0A-- =0AKevin Grittner=0AEnterpriseDB: http://www.en=
terprisedb.com=0AThe Enterprise PostgreSQL Company