Re: tsvector limitations

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: tsvector limitations
Дата
Msg-id 4DF79808020000250003E649@gw.wicourts.gov
обсуждение исходный текст
Ответ на Re: tsvector limitations  (Tim <elatllat@gmail.com>)
Ответы Re: tsvector limitations  (Tim <elatllat@gmail.com>)
Re: tsvector limitations  (Greg Williamson <gwilliamson39@yahoo.com>)
Список pgsql-admin
Tim <elatllat@gmail.com> wrote:

> So I ran this test:
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> mv /tmp/text.* /var/lib/postgresql/9.0/main/
> cd ~/;psql -d postgres
> #psql (9.0.4)
> CREATE DATABASE test;
> \q
> cd ~/;psql -d test
> CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),
> TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );
>
> and I got this:
>  #ERROR:  string is too long for tsvector (30990860 bytes, max
> 1048575 bytes)

Your test (whatever data it is that you used) don't seem typical of
English text.  The entire PostgreSQL documentation in HTML form,
when all the html files are concatenated is 11424165 bytes (11MB),
and the tsvector of that is 364410 (356KB).  I don't suppose you
know of some publicly available file on the web that I could use to
reproduce your problem?

> The year is 2011 I don't think searching a 2MB text file is to
> much to expect.

Based on the ratio for the PostgreSQL docs, it seems possible to
index documents considerably larger than that.  Without the markup
(as in the case of a PDF), I bet it would take a lot less than what
I saw for the docs.  A printed or typewritten page usually has about
2KB of text per page.  I used pdftotext to get as text the contents
of a 119 page technical book about database technology, and it came
to 235KB of text.  I made a tsvector for that, and it was 99KB.  So,
at *that* rate you'd need about 100 books that size, totaling
11,900 pages of text in a document to hit the limit you showed.
Well, probably more than that, because some of the words might be
repeated from one book to another.

So, I'm back to wondering what problem you're trying to solve where
this is actually a limitation for you.

-Kevin

В списке pgsql-admin по дате отправления:

Предыдущее
От: Craig James
Дата:
Сообщение: Re: tsvector limitations
Следующее
От: Tim
Дата:
Сообщение: Re: tsvector limitations