Обсуждение: [NOVICE] Database of articles, LaTeX code and pictures

Поиск
Список
Период
Сортировка

[NOVICE] Database of articles, LaTeX code and pictures

От
philolilou
Дата:
Hello everyone,

i wish to build a place where to store articles (with picture) and that
can be accessed easily later by researching.

Articles i wish to store, are actually articles of magazine, or some
internet interesting articles.

For the articles of magazines, i thought scan all interesting pages,
make OCR (letter recognition) on them, eventually convert them into
LaTeX formated code, and insert all into PGSQL database.

Once this made, a php driven website will make reseaches in the PGSQL
database.

For this application, i want make researches the following way:

-> i give a word, or a topic and database will search in plain text
through all the documents and give results based of the accuracy and so
the amount of the times that word or topic appears in the article

-> by keywords: i specify keywords in the search and i get all the
articles matching these keywords

Once search made, and results found, a simple clic does deliver the pdf
of the searched article or view it in the navigator as final document
rendered (like the original article).


Questions:

1. Is this possible to make with Postgresql? (store LaTeX code and
images, browse the text of the code + keywords, and retrieve all of this
to a LaTeX program for compile it again)

2. If the answer of question 1 is yes, how can i structure database for
can use search in the latex code?


Thanks for your help.






Re: [NOVICE] Database of articles, LaTeX code and pictures

От
Kevin Grittner
Дата:
On Wed, Jan 18, 2017 at 12:34 PM, philolilou <philolilou@free.fr> wrote:

> i wish to build a place where to store articles (with picture) and that
> can be accessed easily later by researching.
>
> Articles i wish to store, are actually articles of magazine, or some
> internet interesting articles.
>
> For the articles of magazines, i thought scan all interesting pages,
> make OCR (letter recognition) on them, eventually convert them into
> LaTeX formated code, and insert all into PGSQL database.
>
> Once this made, a php driven website will make reseaches in the PGSQL
> database.
>
> For this application, i want make researches the following way:
>
> -> i give a word, or a topic and database will search in plain text
> through all the documents and give results based of the accuracy and so
> the amount of the times that word or topic appears in the article
>
> -> by keywords: i specify keywords in the search and i get all the
> articles matching these keywords

I'm not clear on what you see as the difference between these.

> Once search made, and results found, a simple clic does deliver the pdf
> of the searched article or view it in the navigator as final document
> rendered (like the original article).
>
>
> Questions:
>
> 1. Is this possible to make with Postgresql? (store LaTeX code and
> images, browse the text of the code + keywords, and retrieve all of this
> to a LaTeX program for compile it again)

I helped do something very like this, except that the documents
were text-based PDFs, and we used the poppler library to pull the
text out for processing.  Your job will be easier, because LaTeX is
already text, so you probably won't need to use anything as messy
to work with from within PostgreSQL as the C++ based poppler
library.  You might get away with parsing the LaTeX source
directly, and if not a plperl function should be fairly easy to
write (or adapt from LaTeX2HTML).

> 2. If the answer of question 1 is yes, how can i structure database for
> can use search in the latex code?

After experimenting and benchmarking different options, we chose to
store the document and a tsvector derived from the text of the
document as two columns in a single table, creating a GIN index on
the tsvector column.  We needed some special parsing capabilities,
and found the custom parsing feature unusable; so we achieved that
by using regular expressions to find the necessary information,
which we built as text and cast to tsvector, concatenating the
result with the output of the normal parser/dictionary processing.
The dictionary chain included stop word processing, a snowball
stemmer, and a thesaurus for legal terms (e.g. "power of attorney"
is a phrase which should match that exact sequence of words on a
search much more closely than just having "power" and "attorney"
somewhere near each other in the document).  We were able to give
words in the title of the document higher priority by concatenating
the to_tsvector() of the title (using the priority parameter) with
the body.  Of course, triggers were used to maintain the tsvector
column.

We got accuracy of results that the users liked, with an average
query speed of about 300ms from real-world searches against a large
database of legal documents.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company