Обсуждение: tsvector limitations

Поиск
Список
Период
Сортировка

tsvector limitations

От
Tim
Дата:
Dear list,

How big of a file would one need to fill the 1MB limit of a tsvector?
Reading http://www.postgresql.org/docs/9.0/static/textsearch-limitations.html seems to hint that filling a tsvector is improbable.

Is there an easy way of query the bytes of a tsvector?
something like length(tsvector) but bytes(tsvector).

If there no easy method to query the bytes of a tsvector
I realize the answer is highly dependent on the contents of the file, so here are 2 random examples:
How many bytes of a tsvector would a 32MB ascii english unique word list make?
How many bytes of a tsvector would something like "The Lord of the Rings.txt" make?

If this limitation is ever hit is there a common practice for using more than one tsvector?
Using a separate "one to many" table seems like an obvious solution piece,
but I would not know how to detect or calculate how much text to give each tsvector.
Assuming tsvectors can't be linked maybe they would need some overlap.


Thanks in advance.

Re: tsvector limitations

От
"Mark Johnson"
Дата:
 

I found another post where you asked the same questions 5 months ago.  Have you tested in that time?  http://www.spinics.net/lists/pgsql-admin/msg19438.html


A text search vector is an array of distinct lexemes (less any stopwords) and their positions.  Taking your example we get ...

select to_tsvector('the lord of the rings.txt') "answer";
      answer
-------------------
'lord':2, 'rings.txt':5

You can put the length() function around it to just get the number of lexemes.  This is the size in terms of number of distinct lexemes, not size in terms of space utilization.

select length(to_tsvector('the lord of the rings.txt')) "answer";
  answer
--------
        2

You might find the tsvector data consumes 2x the space required by the input text.  It will depend on your configuration and your input data.  Test it and let us know what you find.

-Mark

-----Original Message-----
From: Tim [mailto:elatllat@gmail.com]
Sent: Monday, June 13, 2011 03:19 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] tsvector limitations

Dear list,

How big of a file would one need to fill the 1MB limit of a tsvector?
Reading http://www.postgresql.org/docs/9.0/static/textsearch-limitations.html seems to hint that filling a tsvector is improbable.

Is there an easy way of query the bytes of a tsvector?
something like length(tsvector) but bytes(tsvector).

If there no easy method to query the bytes of a tsvector
I realize the answer is highly dependent on the contents of the file, so here are 2 random examples:
How many bytes of a tsvector would a 32MB ascii english unique word list make?
How many bytes of a tsvector would something like "The Lord of the Rings.txt" make?

If this limitation is ever hit is there a common practice for using more than one tsvector?
Using a separate "one to many" table seems like an obvious solution piece,
but I would not know how to detect or calculate how much text to give each tsvector.
Assuming tsvectors can't be linked maybe they would need some overlap.


Thanks in advance.

Re: tsvector limitations

От
Tim
Дата:
Mark,

That link is a mirror of this mailing list; it's not from 5 months ago.
If you are in the year 2012 please respond with lottery numbers and the like.



On Mon, Jun 13, 2011 at 9:43 PM, Mark Johnson <mark@remingtondatabasesolutions.com> wrote:
 

I found another post where you asked the same questions 5 months ago.  Have you tested in that time?  http://www.spinics.net/lists/pgsql-admin/msg19438.html


A text search vector is an array of distinct lexemes (less any stopwords) and their positions.  Taking your example we get ...

select to_tsvector('the lord of the rings.txt') "answer";
      answer
-------------------
'lord':2, 'rings.txt':5

You can put the length() function around it to just get the number of lexemes.  This is the size in terms of number of distinct lexemes, not size in terms of space utilization.

select length(to_tsvector('the lord of the rings.txt')) "answer";
  answer
--------
        2

You might find the tsvector data consumes 2x the space required by the input text.  It will depend on your configuration and your input data.  Test it and let us know what you find.

-Mark

-----Original Message-----
From: Tim [mailto:elatllat@gmail.com]
Sent: Monday, June 13, 2011 03:19 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] tsvector limitations

Dear list,

How big of a file would one need to fill the 1MB limit of a tsvector?
Reading http://www.postgresql.org/docs/9.0/static/textsearch-limitations.html seems to hint that filling a tsvector is improbable.

Is there an easy way of query the bytes of a tsvector?
something like length(tsvector) but bytes(tsvector).

If there no easy method to query the bytes of a tsvector
I realize the answer is highly dependent on the contents of the file, so here are 2 random examples:
How many bytes of a tsvector would a 32MB ascii english unique word list make?
How many bytes of a tsvector would something like "The Lord of the Rings.txt" make?

If this limitation is ever hit is there a common practice for using more than one tsvector?
Using a separate "one to many" table seems like an obvious solution piece,
but I would not know how to detect or calculate how much text to give each tsvector.
Assuming tsvectors can't be linked maybe they would need some overlap.


Thanks in advance.

Re: tsvector limitations

От
"Kevin Grittner"
Дата:
Tim <elatllat@gmail.com> wrote:

> How many bytes of a tsvector would a 32MB ascii english unique
> word list make?
> How many bytes of a tsvector would something like "The Lord of the
> Rings.txt" make?

It would appear that nobody has run into this as a limit, nor done
those specific tests.  Storing a series of novels as a single value
in a database seems to be bad design, of course, which might explain
the lack of research.

If you get some numbers through testing, please post them for future
reference.

-Kevin

Re: tsvector limitations

От
Tim
Дата:
Hi Kevin,

Thanks for the reply.
I suspect there must have been some testing when the tsvector was created,
and I would be surprised if there is no general "how big is this object" method in PostgreSQL.
That said perhaps this is the wrong mailing list for this question.

If it's "bad design" to store large text documents (pdf,docx,etc) as a BLOBs or on a filesystem
and make them searchable with tsvectors can you suggest a good design?
If making your own search implementation is "better" what is the point of tsvectors?
Maybe I'm missing something here?

On Tue, Jun 14, 2011 at 10:18 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Tim <elatllat@gmail.com> wrote:

> How many bytes of a tsvector would a 32MB ascii english unique
> word list make?
> How many bytes of a tsvector would something like "The Lord of the
> Rings.txt" make?

It would appear that nobody has run into this as a limit, nor done
those specific tests.  Storing a series of novels as a single value
in a database seems to be bad design, of course, which might explain
the lack of research.

If you get some numbers through testing, please post them for future
reference.

-Kevin

Re: tsvector limitations

От
"Kevin Grittner"
Дата:
Tim <elatllat@gmail.com> wrote:

> I would be surprised if there is no general "how big is this
> object" method in PostgreSQL.

You could cast to text and use octet_length().

> If it's "bad design" to store large text documents (pdf,docx,etc)
> as a BLOBs or on a filesystem and make them searchable with
> tsvectors can you suggest a good design?

Well, I suggested that storing a series of novels as a single entry
seemed bad design to me.  Perhaps one entry per novel or even finer
granularity would make more sense in most applications, but there
could be exceptions.  Likewise, a list of distinct words is of
dubious value in most applications' text searches.  We extract text
from court documents and store a tsvector for each document; we
don't aggregate all court documents for a year and create a
tsvector for that -- that would not be useful for us.

> If making your own search implementation is "better" what is the
> point of tsvectors?

I remember you asking about doing that, but I don't think anyone
else has advocated it.

> Maybe I'm missing something here?

If you were to ask for real-world numbers you'd probably get farther
than demanding that people volunteer their time to perform tests
that you define but don't seem willing to run.  Or if you describe
your use case in more detail, with questions about alternative
approaches, you're likely to get useful advice.

-Kevin

Re: tsvector limitations

От
"Kevin Grittner"
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> You could cast to text and use octet_length().

Or perhaps you're looking for pg_column_size().

http://www.postgresql.org/docs/9.0/interactive/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE

-Kevin

Re: tsvector limitations

От
Tim
Дата:
Hi Kevin,

Thanks again for the reply.
I suspect casting and using octet_length() is not accurate.
Using "extract[ed] text" keyword or summaries would indeed be quick but is not what I'm looking for.
I am inquiring about real-world numbers for full text search of large documents, I'm not sure what more detail you could want.
I'm not demanding anything, just using examples to clarify my inquiry.
I am inded open to alternatives.


Thank you Kevin, pg_column_size looks like it's exactly what I'm looking for.

pg_column_size(any) int Number of bytes used to store a particular value (possibly compressed)






On Tue, Jun 14, 2011 at 11:36 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Tim <elatllat@gmail.com> wrote:

> I would be surprised if there is no general "how big is this
> object" method in PostgreSQL.

You could cast to text and use octet_length().

> If it's "bad design" to store large text documents (pdf,docx,etc)
> as a BLOBs or on a filesystem and make them searchable with
> tsvectors can you suggest a good design?

Well, I suggested that storing a series of novels as a single entry
seemed bad design to me.  Perhaps one entry per novel or even finer
granularity would make more sense in most applications, but there
could be exceptions.  Likewise, a list of distinct words is of
dubious value in most applications' text searches.  We extract text
from court documents and store a tsvector for each document; we
don't aggregate all court documents for a year and create a
tsvector for that -- that would not be useful for us.

> If making your own search implementation is "better" what is the
> point of tsvectors?

I remember you asking about doing that, but I don't think anyone
else has advocated it.

> Maybe I'm missing something here?

If you were to ask for real-world numbers you'd probably get farther
than demanding that people volunteer their time to perform tests
that you define but don't seem willing to run.  Or if you describe
your use case in more detail, with questions about alternative
approaches, you're likely to get useful advice.

-Kevin


On Tue, Jun 14, 2011 at 11:44 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> You could cast to text and use octet_length().

Or perhaps you're looking for pg_column_size().

http://www.postgresql.org/docs/9.0/interactive/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE

-Kevin
 

Re: tsvector limitations

От
Tim
Дата:
So I ran this test:
unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
ls -hal ./text.*
#-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
#-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
mv /tmp/text.* /var/lib/postgresql/9.0/main/
cd ~/;psql -d postgres
#psql (9.0.4)
CREATE DATABASE test;
\q
cd ~/;psql -d test
CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),  TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );

and I got this:
#ERROR:  string is too long for tsvector (30990860 bytes, max 1048575 bytes)

doing the math
echo "scale=3;29/(30990860/1048575)"|bc #==0.981

Indicates a (worst case) limit of searching only the first 1MB of a text file before you need to start building your own search maybe on top of tsvector.
The year is 2011 I don't think searching a 2MB text file is to much to expect.
So how am I to use the PGSQL FTS as a "full text search" when the above example can only handle a "small or partial text search"?
If I'm not missing anything maybe the documentation should be adjusted accordingly.

Any thoughts or alternatives are most welcome.

Re: tsvector limitations

От
Craig James
Дата:
On 6/14/11 1:42 PM, Tim wrote:
So I ran this test:
unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
ls -hal ./text.*
#-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
#-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
mv /tmp/text.* /var/lib/postgresql/9.0/main/
cd ~/;psql -d postgres
#psql (9.0.4)
CREATE DATABASE test;
\q
cd ~/;psql -d test
CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),  TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );

and I got this:
#ERROR:  string is too long for tsvector (30990860 bytes, max 1048575 bytes)

doing the math
echo "scale=3;29/(30990860/1048575)"|bc #==0.981

Indicates a (worst case) limit of searching only the first 1MB of a text file before you need to start building your own search maybe on top of tsvector.
The year is 2011 I don't think searching a 2MB text file is to much to expect.
The novel "Hawaii" at 960 pages is roughly 1MB.  tsvector was intended for documents (web pages, news articles, corporate memos, ...), not for books.  What you're asking for is interesting, but you can't complain that an open-source project that was designed for a different purpose doesn't meet your needs.
So how am I to use the PGSQL FTS as a "full text search" when the above example can only handle a "small or partial text search"?
If I'm not missing anything maybe the documentation should be adjusted accordingly.
Maybe a better question is, "So how am I to use PGSQL FTS as a "massively huge text search" when it was designed for nothing bigger than "huge text search"?
Any thoughts or alternatives are most welcome.
I'm curious how tsvector could be useful on a 29 MB document.  That's roughly one whole encyclopedia set.  A document that size should have a huge vocabulary, and tsvector's index would be saturated.

However, if the vocabulary in this 29 MB document isn't that big, then you might consider creating a smaller "document."  You could write a Perl script that scans the document and creates a dictionary which it writes out as a secondary "vocabulary" file that's a list of the unique words in your document.  Create an auxillary column in your database to hold this vocabulary for each document, and use tsvector to index that.  The perl program would be trivial, and tsvector would be happy.

Craig

Re: tsvector limitations

От
"Kevin Grittner"
Дата:
Tim <elatllat@gmail.com> wrote:

> So I ran this test:
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> mv /tmp/text.* /var/lib/postgresql/9.0/main/
> cd ~/;psql -d postgres
> #psql (9.0.4)
> CREATE DATABASE test;
> \q
> cd ~/;psql -d test
> CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),
> TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );
>
> and I got this:
>  #ERROR:  string is too long for tsvector (30990860 bytes, max
> 1048575 bytes)

Your test (whatever data it is that you used) don't seem typical of
English text.  The entire PostgreSQL documentation in HTML form,
when all the html files are concatenated is 11424165 bytes (11MB),
and the tsvector of that is 364410 (356KB).  I don't suppose you
know of some publicly available file on the web that I could use to
reproduce your problem?

> The year is 2011 I don't think searching a 2MB text file is to
> much to expect.

Based on the ratio for the PostgreSQL docs, it seems possible to
index documents considerably larger than that.  Without the markup
(as in the case of a PDF), I bet it would take a lot less than what
I saw for the docs.  A printed or typewritten page usually has about
2KB of text per page.  I used pdftotext to get as text the contents
of a 119 page technical book about database technology, and it came
to 235KB of text.  I made a tsvector for that, and it was 99KB.  So,
at *that* rate you'd need about 100 books that size, totaling
11,900 pages of text in a document to hit the limit you showed.
Well, probably more than that, because some of the words might be
repeated from one book to another.

So, I'm back to wondering what problem you're trying to solve where
this is actually a limitation for you.

-Kevin

Re: tsvector limitations

От
Tim
Дата:
Hi Craig,

Thanks for writing.

If one were to try to increase the limitation of tsvectors (I'm not sure I need to yet; this thread is mainly to determine that.)
Instead of using a solution involving a "vocabulary" file,
one would probably be better off discarding tsvectors making a vocabulary table then linking it to documents with a (dict_id, hit_count, word_id) table
It would be faster, smaller, and more accurate because it would not contain the now useless position information, while it would contain the otherwise lost word count information.

I wonder if anyone has any incite on the inner workings of ORACLE/MSSQL/etc FTS.
Maybe there is a common design pattern I/we can use to handle text files of non trivial vocabulary that preserves position information.
I'm not sure a (dict_id, position, word_id) table would be queryable in a useful way
because as far as I can think at the moment there is no good way to compare the position in different rows with SQL.

I will collect a few more sample files to see how much I relay need this .... multi-language files are probably the worst offenders.
 


On Tue, Jun 14, 2011 at 5:12 PM, Craig James <craig_james@emolecules.com> wrote:
On 6/14/11 1:42 PM, Tim wrote:
So I ran this test:
unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
ls -hal ./text.*
#-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
#-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
mv /tmp/text.* /var/lib/postgresql/9.0/main/
cd ~/;psql -d postgres
#psql (9.0.4)
CREATE DATABASE test;
\q
cd ~/;psql -d test
CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),  TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );

and I got this:
#ERROR:  string is too long for tsvector (30990860 bytes, max 1048575 bytes)

doing the math
echo "scale=3;29/(30990860/1048575)"|bc #==0.981

Indicates a (worst case) limit of searching only the first 1MB of a text file before you need to start building your own search maybe on top of tsvector.
The year is 2011 I don't think searching a 2MB text file is to much to expect.
The novel "Hawaii" at 960 pages is roughly 1MB.  tsvector was intended for documents (web pages, news articles, corporate memos, ...), not for books.  What you're asking for is interesting, but you can't complain that an open-source project that was designed for a different purpose doesn't meet your needs.

So how am I to use the PGSQL FTS as a "full text search" when the above example can only handle a "small or partial text search"?
If I'm not missing anything maybe the documentation should be adjusted accordingly.
Maybe a better question is, "So how am I to use PGSQL FTS as a "massively huge text search" when it was designed for nothing bigger than "huge text search"?

Any thoughts or alternatives are most welcome.
I'm curious how tsvector could be useful on a 29 MB document.  That's roughly one whole encyclopedia set.  A document that size should have a huge vocabulary, and tsvector's index would be saturated.

However, if the vocabulary in this 29 MB document isn't that big, then you might consider creating a smaller "document."  You could write a Perl script that scans the document and creates a dictionary which it writes out as a secondary "vocabulary" file that's a list of the unique words in your document.  Create an auxillary column in your database to hold this vocabulary for each document, and use tsvector to index that.  The perl program would be trivial, and tsvector would be happy.

Craig


Re: tsvector limitations

От
Tim
Дата:
Hi Kevin,

My test was indeed atypical vocabulary; it was a dictionary file.
I was intentionally trying to hit the limit to find out where it was, because the documentation did not directly address it.
I am mainly trying to find out if this actually will be a limitation for me.
Thank you for contributing the test data on the PostgreSQL docs (text=11MB,tsvector=0.356MB) and anonymous technical book (text=0.2MB, tsvector=0.1MB).
It seems that as long as a document uses a small % of the potential language vocabulary it would be hard to hit the limit.
On the other hand if someone tries to upload a dictionary or a file using more than 4% of one vocabulary it would need special handling.
(I guess this would be more likely on larger vocabularies like maybe the Japanese writing systems)
The 4% number is probably low due to misspellings, slang, etc.

Anyway I now have an approximate answer to the original question of where the limit is,
and it's probably safe to for my goals to just check and warn if a file can't be indexed.


On Tue, Jun 14, 2011 at 6:19 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Tim <elatllat@gmail.com> wrote:

> So I ran this test:
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> mv /tmp/text.* /var/lib/postgresql/9.0/main/
> cd ~/;psql -d postgres
> #psql (9.0.4)
> CREATE DATABASE test;
> \q
> cd ~/;psql -d test
> CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),
> TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );
>
> and I got this:
>  #ERROR:  string is too long for tsvector (30990860 bytes, max
> 1048575 bytes)

Your test (whatever data it is that you used) don't seem typical of
English text.  The entire PostgreSQL documentation in HTML form,
when all the html files are concatenated is 11424165 bytes (11MB),
and the tsvector of that is 364410 (356KB).  I don't suppose you
know of some publicly available file on the web that I could use to
reproduce your problem?

> The year is 2011 I don't think searching a 2MB text file is to
> much to expect.

Based on the ratio for the PostgreSQL docs, it seems possible to
index documents considerably larger than that.  Without the markup
(as in the case of a PDF), I bet it would take a lot less than what
I saw for the docs.  A printed or typewritten page usually has about
2KB of text per page.  I used pdftotext to get as text the contents
of a 119 page technical book about database technology, and it came
to 235KB of text.  I made a tsvector for that, and it was 99KB.  So,
at *that* rate you'd need about 100 books that size, totaling
11,900 pages of text in a document to hit the limit you showed.
Well, probably more than that, because some of the words might be
repeated from one book to another.

So, I'm back to wondering what problem you're trying to solve where
this is actually a limitation for you.

-Kevin

Re: tsvector limitations

От
Tim
Дата:
Hi Cliff

Apache SOLR et al especially the regex search abilities look interesting.
They seems to handle files in databases as well as those in filesystems.
It is likely a bit detached, overkill, and heavy for my needs but I'll keep it in mind if PostgreSQL can't fill them.



On Tue, Jun 14, 2011 at 7:23 PM, Cliff Pratt <enkiduonthenet@gmail.com> wrote:
Store the metadata in the database, and store the actual document on
the filesystem. Use Apache SOLR or any good alternative to index the
data for search.

Cheers,

Cliff

Re: tsvector limitations

От
Greg Williamson
Дата:
Kevin Grittner wrote:



> Tim <elatllat@gmail.com> wrote:
>
<...>
> Your test (whatever data it is that you used) don't seem typical of
> English text.  The entire PostgreSQL documentation in HTML form,
> when all the html files are concatenated is 11424165 bytes (11MB),
> and the tsvector of that is 364410 (356KB).  I don't suppose you
> know of some publicly available file on the web that I could use to
> reproduce your problem?

Try trolling texts at the Internet Archive (archive.org) -- lots of stuff that
has been rendered into ASCII ... Government documents and the like from all
periods; novels and the like that are no longer under copyright, so lots of long
classics.

<http://www.archive.org/stream/ataleoftwocities00098gut/old/2city12p_djvu.txt>
for example ... 765K

HTH,

Greg Williamson

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin


Re: tsvector limitations

От
"Kevin Grittner"
Дата:
Greg Williamson <gwilliamson39@yahoo.com> wrote:

> Try trolling texts at the Internet Archive (archive.org) -- lots
> of stuff that has been rendered into ASCII ... Government
> documents and the like from all periods; novels and the like that
> are no longer under copyright, so lots of long classics.
>
>
<http://www.archive.org/stream/ataleoftwocities00098gut/old/2city12p_djvu.txt>

>
> for example ... 765K

Thanks.  OK, for perspactive, A Tale of Two Cities has a tsvector size
of 121KB.

-Kevin

Re: tsvector limitations

От
"Mark Johnson"
Дата:
 

When this discussion first started, I immediately thought about people who full text index their server's log files.  As a test I copied /var/log/messages to $PGDATA and then used the same pg_read_file() function you mentioned earlier to pull the data into a column of type text.  The original file was 4.3 MB, and the db column had length 4334920 and the function pg_column_size reported a size of 1058747.  I  then added a column named tsv of type tsvector, and populated it using to_tsvector().  The function pg_column_size reported 201557.  So in this test a 4.2 MB text file produced a tsvector of size 200 KB.  If this scales linearly, then the max size of an input document would be 21.8 MB before you hit the tsvector limit of 1 MB.  If you run a "find" command on your server for files larger than 20 MB, the percent is quite small maybe 1% of files.  In the specific case of indexing postgresql's log files, you could use parameter log_rotation_size to ensure all files are smaller than N and avoid the tsvector limits. 
-Mark

-----Original Message-----
From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
Sent: Wednesday, June 15, 2011 12:39 PM
To: 'Tim', pgsql-admin@postgresql.org, 'Greg Williamson'
Subject: Re: [ADMIN] tsvector limitations

Greg Williamson wrote: > Try trolling texts at the Internet Archive (archive.org) -- lots > of stuff that has been rendered into ASCII ... Government > documents and the like from all periods; novels and the like that > are no longer under copyright, so lots of long classics. > > > > for example ... 765K Thanks. OK, for perspactive, A Tale of Two Cities has a tsvector size of 121KB. -Kevin -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin

Re: tsvector limitations

От
Tom Lane
Дата:
"Mark Johnson" <mark@remingtondatabasesolutions.com> writes:
> When this discussion first started, I immediately thought about people
> who full text index their server's log files. As a test I copied
> /var/log/messages to $PGDATA and then used the same pg_read_file()
> function you mentioned earlier to pull the data into a column of type
> text. The original file was 4.3 MB, and the db column had length
> 4334920 and the function pg_column_size reported a size of 1058747. I
> then added a column named tsv of type tsvector, and populated it using
> to_tsvector(). The function pg_column_size reported 201557. So in this
> test a 4.2 MB text file produced a tsvector of size 200 KB. If this
> scales linearly,

... which it won't.  There is no real-world full text indexing
application where there aren't many duplications of words.  (The OP
eventually admitted that his "test case" was a dictionary word list
and not an actual document.)  Any discussion of required tsvector
sizes that doesn't account for the actual, nonlinear scaling behavior
isn't worth the electrons it's printed on.

            regards, tom lane

Re: tsvector limitations

От
Oleg Bartunov
Дата:
On Wed, 15 Jun 2011, Tom Lane wrote:
> and not an actual document.)  Any discussion of required tsvector
> sizes that doesn't account for the actual, nonlinear scaling behavior
> isn't worth the electrons it's printed on.

Tom, you're my hero !

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: tsvector limitations

От
Tim
Дата:
Hi Tom,

My limited testing indicates tsvector size has an approximately linear (binomial) correlation with "number of unique words", and "word count".
Presumably your "nonlinear" remark was (correctly) directed at the correlation between file size and tsvector size.

I found that:
VS=UW*AWS+2*WC+550, where PWC<160
approximates the size of a tsvector.

In other words:
tsvector size max 1048575 bytes=(constant#1)*(number of unique words)*(average word size)+(constant#2)*(word count)+(constant#3)
, where per word count<160 (seems related to the "No more than 256 positions per lexeme" restriction) average word size can be approximated.

I'm sure this there are some things the calculation is not accounting for (maybe word variation) but it seemed to work decently for my limited test.
I would not be surprised if it lost accuracy by an order of magnitude applied to a larger data set without improvement.
So a tsvector might hold about 147,609 unique words, or 69,405 with an average repeat of 10.
Practically this limitation is unlikely to be important, however likely it is to be hit a few times among the millions of people that use PostgreSQL (knowingly or otherwise).
The Oxford English Dictionary claims there are about 228,132 unique words with an average word length of 5.1 ( about 2.4M).
The test file I used data from had 2,972,885 words (27M) (average word length of 9).

Some of my testing:
echo "(1*100)*(6.560)+(2*100)+550"|bc
1406.000  # calculated
1406      # expected
echo "(1*100)*(6.560)+(2*1000)+550"|bc
3206.000  # calculated
3202      # expected
echo "(1*100)*(6.560)+(2*5000)+550"|bc
11206.000 # calculated
11202     # expected
echo "(1*100)*(6.560)+(2*10000)+550"|bc
21206.000 # calculated
21202     # expected

echo "(1*100)*(6.560)+(2*100)+550"|bc
1406.000  # calculated
1406      # expected
echo "(1*200)*(6.575)+(2*200)+550"|bc
2265.000  # calculated
2726      # expected
echo "(1*500)*(7.572)+(2*500)+550"|bc
5336.000  # calculated
7378      # expected
echo "(1*1000)*(7.792)+(2*1000)+550"|bc
10342.000 # calculated
10736     # expected
echo "(1*1500)*(8.302)+(2*1500)+550"|bc
16003.000 # calculated
15738     # expected

File sizes:
ls -hal text.*
-rwxrwxrwx 1 postgres postgres 6.5K 2011-06-15 19:42 text.100x10.txt
-rwxrwxrwx 1 postgres postgres  33K 2011-06-15 19:49 text.100x50.txt
-rwxrwxrwx 1 postgres postgres  65K 2011-06-15 19:41 text.100x100.txt
-rwxrwxrwx 1 postgres postgres  97K 2011-06-15 20:05 text.100x150.txt

-rwxrwxrwx 1 postgres postgres  656 2011-06-15 18:01 text.100.txt
-rwxrwxrwx 1 postgres postgres 1.3K 2011-06-15 20:51 text.200.txt
-rwxrwxrwx 1 postgres postgres 3.7K 2011-06-15 20:52 text.500.txt
-rwxrwxrwx 1 postgres postgres 7.7K 2011-06-15 20:52 text.1000.txt
-rwxrwxrwx 1 postgres postgres  13K 2011-06-15 20:52 text.1500.txt


Average word lengths:
bash average_word_length.sh
text.100.txt     6.560
text.200.txt     6.575
text.500.txt     7.572
text.1000.txt    7.792
text.1500.txt    8.302

Tsvector sizes:
select title,pg_column_size(words) from test order by pg_column_size;
       title       | pg_column_size
-------------------+----------------
 text.100.txt      |           1406
 text.100x10.txt   |           3202
 text.100x50.txt   |          11202
 text.100x100.txt  |          21202
 text.100x150.txt  |          31112

 text.100.txt      |           1406
 text.200.txt      |           2726
 text.500.txt      |           7378
 text.1000.txt     |          10736
 text.1500.txt     |          15738





On Wed, Jun 15, 2011 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Mark Johnson" <mark@remingtondatabasesolutions.com> writes:
> When this discussion first started, I immediately thought about people
> who full text index their server's log files. As a test I copied
> /var/log/messages to $PGDATA and then used the same pg_read_file()
> function you mentioned earlier to pull the data into a column of type
> text. The original file was 4.3 MB, and the db column had length
> 4334920 and the function pg_column_size reported a size of 1058747. I
> then added a column named tsv of type tsvector, and populated it using
> to_tsvector(). The function pg_column_size reported 201557. So in this
> test a 4.2 MB text file produced a tsvector of size 200 KB. If this
> scales linearly,

... which it won't.  There is no real-world full text indexing
application where there aren't many duplications of words.  (The OP
eventually admitted that his "test case" was a dictionary word list
and not an actual document.)  Any discussion of required tsvector
sizes that doesn't account for the actual, nonlinear scaling behavior
isn't worth the electrons it's printed on.

                       regards, tom lane

Re: tsvector limitations

От
Tom Lane
Дата:
Tim <elatllat@gmail.com> writes:
> My limited testing indicates tsvector size has an approximately linear
> (binomial) correlation with "number of unique words", and "word count".

Yes, it would be linear in the number of unique words (assuming the
average word length stays constant).  It's not linear in word count,
because tsvector only keeps track of distinct word positions for the
first 16K words in the text.  So for large documents only the number of
unique words is really going to matter, and that grows *much* less than
linearly with document size for real-world documents.

            regards, tom lane