Re: Indexing MS/Open Office and PDF documents

Поиск
Список
Период
Сортировка
От dennis jenkins
Тема Re: Indexing MS/Open Office and PDF documents
Дата
Msg-id CAAEzAp8LC6RhfLQFbb=+PTG3fBB8uH6zAG+aU3NA1uoRv-KK0A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Indexing MS/Open Office and PDF documents  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Indexing MS/Open Office and PDF documents  (Samba <saasira@gmail.com>)
Список pgsql-general
On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2012-03-16 at 01:57 +0530, Alexander.Bagerman@cognizant.com
> wrote:
>> Hi,
>>
>> We are looking to use Postgres 9 for the document storing and would
>> like to take advantage of the full text search capabilities. We have
>> hard time identifying MS/Open Office and PDF parsers to index stored
>> documents and make them available for text searching. Any advice would
>> be appreciated.
>
> The first step is to find a library that can parse such documents, or
> convert them to a format that can be parsed.

I don't know about MS-Office document parsing, but the "PoDoFo" (pdf
parsing library) can strip text from PDFs.  Every now and then someone
posts to the podofo mailing list with questions related to extracting
text for the purposes of indexing it in FTS capable database.  Podofo
has excellent developer support.  The maintainer is quick to accept
patches, verify bugs, add features, etc...   Disclaimer: I'm not a pdf
nor podofo expert.  I can't help you accomplish what you want.

В списке pgsql-general по дате отправления:

Предыдущее
От: Richard Huxton
Дата:
Сообщение: Re: Indexing MS/Open Office and PDF documents
Следующее
От: Steve Crawford
Дата:
Сообщение: Re: undo update