[Fwd: PostgreSQL to index MEDLINE]

Поиск
Список
Период
Сортировка
От Justin Clift
Тема [Fwd: PostgreSQL to index MEDLINE]
Дата
Msg-id 3ED2EB2E.4060509@postgresql.org
обсуждение исходный текст
Ответы Re: [Fwd: PostgreSQL to index MEDLINE]  (Oleg Bartunov <oleg@sai.msu.su>)
Список pgsql-general
Hi everyone,

Does anyone feel like assisting Olaf here?

Regards and best wishes,

Justin Clift


-------- Original Message --------
Subject: PostgreSQL to index MEDLINE
Date: Wed, 21 May 2003 11:47:39 +0200
From: Olaf Bininda-Emonds <Olaf.Bininda@tz.agrar.tu-muenchen.de>
To: justin@postgresql.org

Hi,

I have a question concerning the suitability of PostgreSQL to yield a
word index of MEDLINE, an XML-formatted database of over 12 000 000
articles from the biomedical literature. A computer scientist friend
suggested that PostgreSQL might be the answer that we require.

Briefly, our project is as follows:

We would like to establish word co-occurences within each MEDLINE
entry, with one word representing a gene and the other word
representing one or more user-input keywords of functional traits. As
the MEDLINE database is over 30 GB in size (and split into 396 separate
files), we need an efficient, indexed way to search it.

Our specific needs are as follows:

1) An index of all the words in the fields "title", "abstract" (not
always present), and "medical subject headings" such that I can search
the index to reveal which article (represented by the MedlineID field)
contains a desired word. This will have two purposes:

1a) To build a subindex detailing which human genes are being referred
to in an article. This is slightly complicated, in part because of the
ever-changing nomenclature of genes. Officially, a gene is known by an
alphanumeric symbol (of up to eight characters; e.g., BRCA1) and a name
of one or more words ("breast cancer 1, early onset"). Searching for
the symbol itself is insufficient, either because the symbol has not
been used (e.g., older references) or can have other, common meanings
(e.g., the symbol for "pyruvate carboxylase" is PC). Therefore,
determining whether a gene is being mentioned in a given article
requires a combination of the gene symbol being present and/or a
sufficient number of the words in its name ("sufficient" being
determined by the length of the gene symbol). I have a perl script that
can determine this; however, it currently searches through all 396
files on a pattern matching basis. Using an index would obviously be
that more efficient, allowing me to query for a lists of articles that
contain for a given gene, the gene symbol and each of the words in the
gene name. The intersection of these lists would then be the articles
held to contain a reference to that gene.

1b) To enable dynamic searches of the index for user-input keywords of
functional traits. Searching for phrases ("lipid metabolism") would be
ideal, but not necessary.

In both cases, the matches would be for exact words (i.e., no stem
searches).

2) The program should run in a UNIX environment (as implemented in Mac
OS X) as I will be writing my own (hideously-inefficient) perl script
to access it for queries. Generally, the protocol will be as follows:

a) input gene-article subindex into memory
b) obtain list of user-input keywords (and phrases, which will be split
into individual words if phrases cannot be searched for)
c) query index for MEDLINE articles containing each keyword in turn;
articles will be scored according to the number of keywords they
contain (i.e., an article with two keywords is scored twice as heavily
as an article with only one keyword)
d) for each article in the list in (c), determine whether each of ca.
16000 genes is also mentioned in that article (via subindex); scores
for genes will be positively or negatively incremented according to
whether or not they are present
e) for each gene, determine its association score (= number of articles
in which it co-occurs with at least one keyword - number of articles
containing a keyword but not that gene).

Therefore, the index should be written to disk as it will be accessed
repeatedly for each keyword.

---

I would appreciate your advice as to whether PostgreSQL is suitable for
this task and, if so, how best to implement it and set up the word
index. I have attached a sample of the MEDLINE database for you.

With thanks in advance for your help!

Best,

Olaf

------------------------------------------------------------------------
-----------------
Olaf Bininda-Emonds
Lehrstuhl für Tierzucht
Technical University of Munich
Alte Akademie 12
85354 Freising-Weihenstephan
Germany

Phone: +49 (0)8161 713741
Fax: +49 (0)8161 713107
e-mail: Olaf.Bininda@tierzucht.tum.de
WWW: http://www.tierzucht.tum.de/Bininda-Emonds/


--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi


Вложения

В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: ERROR: Memory exhausted in AllocSetAlloc(188)
Следующее
От: Jason Ziegler
Дата:
Сообщение: newbie sql question...