Re: Greenplum MapReduce

Поиск

Список

Период

Сортировка

От	Suvankar Roy
Тема	Re: Greenplum MapReduce
Дата	3 августа 2009 г. 21:50:01
Msg-id	OFBBEFFCD1.F9B0CC18-ON65257607.001AFFD4-65257607.001B4553@tcs.com обсуждение исходный текст
Ответ на	Greenplum MapReduce (Suvankar Roy <suvankar.roy@tcs.com>)
Список	pgsql-performance

Дерево обсуждения

Hi Robert,

Thanks much for your valuable inputs....

This spaces and tabs problem is killing me in a way, it is pretty cumbersome to say the least....

Regards,

Suvankar Roy

"Robert Mah" <rmah@pobox.com>
Sent by: Robert Mah <robert.mah@gmail.com>

08/02/2009 10:52 PM

To	"'Suvankar Roy'" <suvankar.roy@tcs.com>, <pgsql-performance@postgresql.org>
cc
Subject	RE: [PERFORM] Greenplum MapReduce

Suvankar:

Check your file for spaces vs tabs (one of them is bad and yes, it matters).

And as an personal aside, this is yet another reason I hate YAML.

Cheers,
Rob

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Suvankar Roy
Sent: Thursday, July 30, 2009 8:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Greenplum MapReduce

Hi all,

Has anybody worked on Greenplum MapReduce programming ?

I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue).

The error is thrown in the 7th line as:
Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red)

If somebody can explain this and the potential solution

%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_db1
USER: gpadmin
DEFINE:
- INPUT:
NAME: doc
TABLE: documents
- INPUT:
NAME: kw
TABLE: keywords
- MAP:
NAME: doc_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in data.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
for term in terms:
yield([doc_id, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- doc_id integer
- data text
RETURNS:
- doc_id integer
- term text
- positions text
- MAP:
NAME: kw_map
LANGUAGE: python
FUNCTION: |
i = 0
terms = {}
for term in keyword.lower().split():
i = i + 1
if term in terms:
terms[term] += ','+str(i)
else:
terms[term] = str(i)
yield([keyword_id, i, term, terms[term]])
OPTIMIZE: STRICT IMMUTABLE
PARAMETERS:
- keyword_id integer
- keyword text
RETURNS:
- keyword_id integer
- nterms integer
- term text
- positions text
- TASK:
NAME: doc_prep
SOURCE: doc
MAP: doc_map
- TASK:
NAME: kw_prep
SOURCE: kw
MAP: kw_map
- INPUT:
NAME: term_join
QUERY: |
SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms,
doc.positions as doc_positions,
kw.positions as kw_positions
FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term)
- REDUCE:
NAME: term_reducer
TRANSITION: term_transition
FINALIZE: term_finalizer
- TRANSITION:
NAME: term_transition
LANGUAGE: python
PARAMETERS:
- state text
- term text
- nterms integer
- doc_positions text
- kw_positions text
FUNCTION: |
if state:
kw_split = state.split(':')
else:
kw_split = []
for i in range(0,nterms):
kw_split.append('')
for kw_p in kw_positions.split(','):
kw_split[int(kw_p)-1] = doc_positions
outstate = kw_split[0]
for s in kw_split[1:]:
outstate = outstate + ':' + s
return outstate
- FINALIZE:
NAME: term_finalizer
LANGUAGE: python
RETURNS:
- count integer
MODE: MULTI
FUNCTION: |
if not state:
return 0
kw_split = state.split(':')
previous = None
for i in range(0,len(kw_split)):
isplit = kw_split[i].split(',')
if any(map(lambda(x): x == '', isplit)):
return 0
adjusted = set(map(lambda(x): int(x)-i, isplit))
if (previous):
previous = adjusted.intersection(previous)
else:
previous = adjusted
if previous:
return len(previous)
return 0
- TASK:
NAME: term_match
SOURCE: term_join
REDUCE: term_reducer
- INPUT:
NAME: final_output
QUERY: |
SELECT doc.*, kw.*, tm.count
FROM documents doc, keywords kw, term_match tm
WHERE doc.doc_id = tm.doc_id
AND kw.keyword_id = tm.keyword_id
AND tm.count > 0
EXECUTE:
- RUN:
SOURCE: final_output
TARGET: STDOUT

Regards,

Suvankar Roy
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you

ForwardSourceID:NT000058B6

=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

В списке pgsql-performance по дате отправления:

Предыдущее

От: Suvankar Roy
Дата: 03 августа 2009 г., 21:49:36
Сообщение: Re: Greenplum MapReduce

Следующее

От: Merlin Moncure
Дата: 04 августа 2009 г., 13:29:09
Сообщение: Re: PostgreSQL 8.4 performance tuning questions

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Greenplum MapReduce

Предыдущее

Следующее