Re: [NOVICE] Database of articles, LaTeX code and pictures
От | Kevin Grittner |
---|---|
Тема | Re: [NOVICE] Database of articles, LaTeX code and pictures |
Дата | |
Msg-id | CACjxUsOxtLTiTvNfJxPE375O9zcWZcA1h3UtFwi6L=Jxs1d8Cw@mail.gmail.com обсуждение исходный текст |
Ответ на | [NOVICE] Database of articles, LaTeX code and pictures (philolilou <philolilou@free.fr>) |
Список | pgsql-novice |
On Wed, Jan 18, 2017 at 12:34 PM, philolilou <philolilou@free.fr> wrote: > i wish to build a place where to store articles (with picture) and that > can be accessed easily later by researching. > > Articles i wish to store, are actually articles of magazine, or some > internet interesting articles. > > For the articles of magazines, i thought scan all interesting pages, > make OCR (letter recognition) on them, eventually convert them into > LaTeX formated code, and insert all into PGSQL database. > > Once this made, a php driven website will make reseaches in the PGSQL > database. > > For this application, i want make researches the following way: > > -> i give a word, or a topic and database will search in plain text > through all the documents and give results based of the accuracy and so > the amount of the times that word or topic appears in the article > > -> by keywords: i specify keywords in the search and i get all the > articles matching these keywords I'm not clear on what you see as the difference between these. > Once search made, and results found, a simple clic does deliver the pdf > of the searched article or view it in the navigator as final document > rendered (like the original article). > > > Questions: > > 1. Is this possible to make with Postgresql? (store LaTeX code and > images, browse the text of the code + keywords, and retrieve all of this > to a LaTeX program for compile it again) I helped do something very like this, except that the documents were text-based PDFs, and we used the poppler library to pull the text out for processing. Your job will be easier, because LaTeX is already text, so you probably won't need to use anything as messy to work with from within PostgreSQL as the C++ based poppler library. You might get away with parsing the LaTeX source directly, and if not a plperl function should be fairly easy to write (or adapt from LaTeX2HTML). > 2. If the answer of question 1 is yes, how can i structure database for > can use search in the latex code? After experimenting and benchmarking different options, we chose to store the document and a tsvector derived from the text of the document as two columns in a single table, creating a GIN index on the tsvector column. We needed some special parsing capabilities, and found the custom parsing feature unusable; so we achieved that by using regular expressions to find the necessary information, which we built as text and cast to tsvector, concatenating the result with the output of the normal parser/dictionary processing. The dictionary chain included stop word processing, a snowball stemmer, and a thesaurus for legal terms (e.g. "power of attorney" is a phrase which should match that exact sequence of words on a search much more closely than just having "power" and "attorney" somewhere near each other in the document). We were able to give words in the title of the document higher priority by concatenating the to_tsvector() of the title (using the priority parameter) with the body. Of course, triggers were used to maintain the tsvector column. We got accuracy of results that the users liked, with an average query speed of about 300ms from real-world searches against a large database of legal documents. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-novice по дате отправления: