Обсуждение: TSearch2: find a QUERY that does match a single document
Hello.
TSearch2 allows to search a table of tsvectors by a single tsquery.
I need to solve the reverse problem.
I have a large table of tsquery. I need to find all tsqueries in that table that match a single document tsvector:
CREATE TABLE "test"."test_tsq" (
"id" SERIAL,
"q" TSQUERY NOT NULL,
CONSTRAINT "test_tsq_pkey" PRIMARY KEY("id")
);
insert into test.test_tsq(q)
select to_tsquery(g || 'x' || g) from generate_series(100000, 900000) as g;
explain analyze
select * from test.test_tsq
where to_tsvector('400000x400000') @@ q
This gets a strange explain analyze:
QUERY PLAN
Seq Scan on test_tsq (cost=0.00..17477.01 rows=800 width=36) (actual time=68.698..181.458 rows=1 loops=1)
Filter: ('''400000x400000'':1'::tsvector @@ q)
Total runtime: 181.484 ms
No matter if I use GIST index on test_tsq.q or not: the explain analyze result is the same.
So, why "rows=800"? The table contains much more rows...
TSearch2 allows to search a table of tsvectors by a single tsquery.
I need to solve the reverse problem.
I have a large table of tsquery. I need to find all tsqueries in that table that match a single document tsvector:
CREATE TABLE "test"."test_tsq" (
"id" SERIAL,
"q" TSQUERY NOT NULL,
CONSTRAINT "test_tsq_pkey" PRIMARY KEY("id")
);
insert into test.test_tsq(q)
select to_tsquery(g || 'x' || g) from generate_series(100000, 900000) as g;
explain analyze
select * from test.test_tsq
where to_tsvector('400000x400000') @@ q
This gets a strange explain analyze:
QUERY PLAN
Seq Scan on test_tsq (cost=0.00..17477.01 rows=800 width=36) (actual time=68.698..181.458 rows=1 loops=1)
Filter: ('''400000x400000'':1'::tsvector @@ q)
Total runtime: 181.484 ms
No matter if I use GIST index on test_tsq.q or not: the explain analyze result is the same.
So, why "rows=800"? The table contains much more rows...
On Fri, 12 Sep 2008, Dmitry Koterov wrote: > Hello. > > TSearch2 allows to search a table of tsvectors by a single tsquery. > I need to solve the reverse problem. > > *I have a large table of tsquery. I need to find all tsqueries in that table > that match a single document tsvector: > * > CREATE TABLE "test"."test_tsq" ( > "id" SERIAL, > "q" TSQUERY NOT NULL, > CONSTRAINT "test_tsq_pkey" PRIMARY KEY("id") > ); > > insert into test.test_tsq(q) > select to_tsquery(g || 'x' || g) from generate_series(100000, 900000) as g; > > explain analyze > select * from test.test_tsq > where to_tsvector('400000x400000') @@ q why do you need tsvector @@ q ? Much better to use tsquery = tsquery test=# explain analyze select * from test_tsq where q = '400000x400000'::tsque> QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on test_tsq (cost=0.00..16667.01 rows=1 width=38) (actual time=129.208..341.111 rows=1 loops=1) Filter: (q = '''400000x400000'''::tsquery) Total runtime: 341.134 ms (3 rows) Time: 341.478 ms > > This gets a strange explain analyze: > > QUERY PLAN > Seq Scan on test_tsq (cost=0.00..17477.01 rows=800 width=36) (actual > time=68.698..181.458 rows=1 loops=1) > Filter: ('''400000x400000'':1'::tsvector @@ q) > Total runtime: 181.484 ms > > No matter if I use GIST index on test_tsq.q or not: the explain analyze > result is the same. > So, why "rows=800"? The table contains much more rows... '800' is the number of estimated rows, which is not good, since you got only 1 row. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
why do you need tsvector @@ q ? Much better to use tsquery = tsqueryexplain analyze
select * from test.test_tsq
where to_tsvector('400000x400000') @@ q
test=# explain analyze select * from test_tsq where q = '400000x400000'::tsque>
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Seq Scan on test_tsq (cost=0.00..16667.01 rows=1 width=38) (actual time=129.208..341.111 rows=1 loops=1)
Filter: (q = '''400000x400000'''::tsquery)
Total runtime: 341.134 ms
(3 rows)
M-mmm... Seems your understood me incorrectly.
I have to find NOT queries which are exactly equal to another query, BUT queries which MATCH the GIVEN document. '400000x400000' was a sample only, in real cases it will be 1-2K document.
Here is a more realistic sample:
explain analyze
select * from test.test_tsq
where to_tsvector('
Here is a real document text. It may be long, 1-2K.
In this sample it contains a lexem "400000x400000", so there is a tsquery
in test_tsq.q which matches this document. I need to find all such queries fast.
Of course, in real cases the document text is unpredictable.
') @@ q
I have to find NOT queries which are exactly equal to another query, BUT queries which MATCH the GIVEN document. '400000x400000' was a sample only, in real cases it will be 1-2K document.
Here is a more realistic sample:
explain analyze
select * from test.test_tsq
where to_tsvector('
Here is a real document text. It may be long, 1-2K.
In this sample it contains a lexem "400000x400000", so there is a tsquery
in test_tsq.q which matches this document. I need to find all such queries fast.
Of course, in real cases the document text is unpredictable.
') @@ q
'800' is the number of estimated rows, which is not good, since you got only 1 row.QUERY PLAN
Seq Scan on test_tsq (cost=0.00..17477.01 rows=800 width=36) (actual
time=68.698..181.458 rows=1 loops=1)
Filter: ('''400000x400000'':1'::tsvector @@ q)
Total runtime: 181.484 ms