Re: [Fwd: PostgreSQL to index MEDLINE]

Поиск
Список
Период
Сортировка
От Oleg Bartunov
Тема Re: [Fwd: PostgreSQL to index MEDLINE]
Дата
Msg-id Pine.GSO.4.55.0305271035460.9108@ra.sai.msu.su
обсуждение исходный текст
Ответ на [Fwd: PostgreSQL to index MEDLINE]  (Justin Clift <justin@postgresql.org>)
Список pgsql-general
Olaf,

what's the number of unique words (you want to be searched) in your collection ?
You may try contrib/tsearch V2 from http://www.sai.msu.su/~megera/postgres/gist/
May be I miss something but it looks like rather ordinary search system.

    Oleg
On Tue, 27 May 2003, Justin Clift wrote:

> Hi everyone,
>
> Does anyone feel like assisting Olaf here?
>
> Regards and best wishes,
>
> Justin Clift
>
>
> -------- Original Message --------
> Subject: PostgreSQL to index MEDLINE
> Date: Wed, 21 May 2003 11:47:39 +0200
> From: Olaf Bininda-Emonds <Olaf.Bininda@tz.agrar.tu-muenchen.de>
> To: justin@postgresql.org
>
> Hi,
>
> I have a question concerning the suitability of PostgreSQL to yield a
> word index of MEDLINE, an XML-formatted database of over 12 000 000
> articles from the biomedical literature. A computer scientist friend
> suggested that PostgreSQL might be the answer that we require.
>
> Briefly, our project is as follows:
>
> We would like to establish word co-occurences within each MEDLINE
> entry, with one word representing a gene and the other word
> representing one or more user-input keywords of functional traits. As
> the MEDLINE database is over 30 GB in size (and split into 396 separate
> files), we need an efficient, indexed way to search it.
>
> Our specific needs are as follows:
>
> 1) An index of all the words in the fields "title", "abstract" (not
> always present), and "medical subject headings" such that I can search
> the index to reveal which article (represented by the MedlineID field)
> contains a desired word. This will have two purposes:
>
> 1a) To build a subindex detailing which human genes are being referred
> to in an article. This is slightly complicated, in part because of the
> ever-changing nomenclature of genes. Officially, a gene is known by an
> alphanumeric symbol (of up to eight characters; e.g., BRCA1) and a name
> of one or more words ("breast cancer 1, early onset"). Searching for
> the symbol itself is insufficient, either because the symbol has not
> been used (e.g., older references) or can have other, common meanings
> (e.g., the symbol for "pyruvate carboxylase" is PC). Therefore,
> determining whether a gene is being mentioned in a given article
> requires a combination of the gene symbol being present and/or a
> sufficient number of the words in its name ("sufficient" being
> determined by the length of the gene symbol). I have a perl script that
> can determine this; however, it currently searches through all 396
> files on a pattern matching basis. Using an index would obviously be
> that more efficient, allowing me to query for a lists of articles that
> contain for a given gene, the gene symbol and each of the words in the
> gene name. The intersection of these lists would then be the articles
> held to contain a reference to that gene.
>
> 1b) To enable dynamic searches of the index for user-input keywords of
> functional traits. Searching for phrases ("lipid metabolism") would be
> ideal, but not necessary.
>
> In both cases, the matches would be for exact words (i.e., no stem
> searches).
>
> 2) The program should run in a UNIX environment (as implemented in Mac
> OS X) as I will be writing my own (hideously-inefficient) perl script
> to access it for queries. Generally, the protocol will be as follows:
>
> a) input gene-article subindex into memory
> b) obtain list of user-input keywords (and phrases, which will be split
> into individual words if phrases cannot be searched for)
> c) query index for MEDLINE articles containing each keyword in turn;
> articles will be scored according to the number of keywords they
> contain (i.e., an article with two keywords is scored twice as heavily
> as an article with only one keyword)
> d) for each article in the list in (c), determine whether each of ca.
> 16000 genes is also mentioned in that article (via subindex); scores
> for genes will be positively or negatively incremented according to
> whether or not they are present
> e) for each gene, determine its association score (= number of articles
> in which it co-occurs with at least one keyword - number of articles
> containing a keyword but not that gene).
>
> Therefore, the index should be written to disk as it will be accessed
> repeatedly for each keyword.
>
> ---
>
> I would appreciate your advice as to whether PostgreSQL is suitable for
> this task and, if so, how best to implement it and set up the word
> index. I have attached a sample of the MEDLINE database for you.
>
> With thanks in advance for your help!
>
> Best,
>
> Olaf
>
> ------------------------------------------------------------------------
> -----------------
> Olaf Bininda-Emonds
> Lehrstuhl fЭr Tierzucht
> Technical University of Munich
> Alte Akademie 12
> 85354 Freising-Weihenstephan
> Germany
>
> Phone: +49 (0)8161 713741
> Fax: +49 (0)8161 713107
> e-mail: Olaf.Bininda@tierzucht.tum.de
> WWW: http://www.tierzucht.tum.de/Bininda-Emonds/
>
>
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

В списке pgsql-general по дате отправления:

Предыдущее
От: "Vincent Hikida"
Дата:
Сообщение: Re: newbie sql question...
Следующее
От: Justin Clift
Дата:
Сообщение: Re: PostgreSQL 7.3.1 on WinXP