Обсуждение: Multiple DB join

Поиск

Список

Период

Сортировка

Multiple DB join

От

Sumeet Ambre

Дата:

14 августа 2006 г., 18:56:41

Hi All,

I have a database which consists of 20 million records and I've split up 
the db into 6-7 dbs. I have a base database which consists of
the ids with link all the databases. I'm performing search on this 
single base table. After searching i get some ids which are ids in the other
databases which i split up. Now i need to retrieve those records. Is 
there a way i can join tables from multiple databases as we can join 
multiple tables
in a single database.

Thanks,
SA.

Re: Multiple DB join

От

"Shoaib Mir"

Дата:

14 августа 2006 г., 19:04:47

I think using the contrib module 'dblink' (http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/dblink/) can help you here.

Thanks,
--
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)

On 8/15/06, Sumeet Ambre < sambre@indiana.edu> wrote:

Hi All,

I have a database which consists of 20 million records and I've split up
the db into 6-7 dbs. I have a base database which consists of
the ids with link all the databases. I'm performing search on this
single base table. After searching i get some ids which are ids in the other
databases which i split up. Now i need to retrieve those records. Is
there a way i can join tables from multiple databases as we can join
multiple tables
in a single database.

Thanks,
SA.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: Multiple DB join

От

Andrew Sullivan

Дата:

15 августа 2006 г., 07:36:47

On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote:
> Hi All,
> 
> I have a database which consists of 20 million records and I've split up 
> the db into 6-7 dbs. 

You can do this (as someone suggested with dblink), but I'm wondering
why the split?  20 million records isn't very big.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Unfortunately reformatting the Internet is a little more painful 
than reformatting your hard drive when it gets out of whack.    --Scott Morris

Re: Multiple DB join

От

Michael Fuhr

Дата:

15 августа 2006 г., 08:39:01

On Tue, Aug 15, 2006 at 06:36:24AM -0400, Andrew Sullivan wrote:
> On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote:
> > I have a database which consists of 20 million records and I've split up 
> > the db into 6-7 dbs. 
> 
> You can do this (as someone suggested with dblink), but I'm wondering
> why the split?  20 million records isn't very big.

And why multiple databases instead of multiple schemas within the
same database?  Or even all data in the same schema?  Is there a
reason for the segregation?

-- 
Michael Fuhr

Re: Multiple DB join

От

Jorge Godoy

Дата:

15 августа 2006 г., 09:01:30

Michael Fuhr <mike@fuhr.org> writes:

> And why multiple databases instead of multiple schemas within the
> same database?  Or even all data in the same schema?  Is there a
> reason for the segregation?

I can think that spreading processing requirements should be one.  And
distributing load another one.  Disk space can be solved with new disks and
tablespaces, but we can't yet distribute the load through several servers
without partitioning.

-- 
Jorge Godoy      <jgodoy@gmail.com>

Re: Multiple DB join

От

Andrew Sullivan

Дата:

15 августа 2006 г., 10:18:40

On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote:
> I can think that spreading processing requirements should be one.  And
> distributing load another one.  Disk space can be solved with new disks and
> tablespaces, but we can't yet distribute the load through several servers
> without partitioning.

The cost of inter-node communication isn't nothing, though.  It
strikes me as at least possible that the overhead of dblink is going
to be larger than whatever gains one makes from adding a new server. 
For only 20M rows, I find it pretty hard to believe the gain is going
to be demonstrable.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Everything that happens in the world happens at some place.    --Jane Jacobs

Re: Multiple DB join

От

Jorge Godoy

Дата:

15 августа 2006 г., 10:25:29

Andrew Sullivan <ajs@crankycanuck.ca> writes:

> On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote:
>> I can think that spreading processing requirements should be one.  And
>> distributing load another one.  Disk space can be solved with new disks and
>> tablespaces, but we can't yet distribute the load through several servers
>> without partitioning.
>
> The cost of inter-node communication isn't nothing, though.  It
> strikes me as at least possible that the overhead of dblink is going
> to be larger than whatever gains one makes from adding a new server. 
> For only 20M rows, I find it pretty hard to believe the gain is going
> to be demonstrable.

I totally agree with you.  But when your base grows, it might start getting
interesting. 

I'd like to see some sort of data partitioning in PostgreSQL.

-- 
Jorge Godoy      <jgodoy@gmail.com>

Re: Multiple DB join

От

Sumeet Ambre

Дата:

15 августа 2006 г., 11:12:00

Andrew Sullivan wrote:
> On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote:
>   
>> Hi All,
>>
>> I have a database which consists of 20 million records and I've split up 
>> the db into 6-7 dbs. 
>>     
>
> You can do this (as someone suggested with dblink), but I'm wondering
> why the split?  20 million records isn't very big.
>
> A
>   
The Design of the database is because our organization wants to split up 
different datasets into different entities, and there might be a 
possibility that they'll run different instances of postgres for each 
dataset. The problem is that one of the databases consists of 14 million 
records and when i query the base database which consists of 20million 
records, the query runs damn slow.......below is the sample schema for 
my base table and the query i try to run on it....it takes me more than 
2-3 minutes to run a query....Is there any way i could speed this up......

sample schema for my base table
---------------------------------

doc_id  |  title |  authors  |  pub_year  |   abs   |   db_name........ |

In the above scheme the field db_name is the name of the other databases 
which contain the whole record.
I'm trying to run query which searches on title, authors name, pub_year 
or abstract simultaneously...

e.g.  SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors 
~*   '.*something.*')  AND (db_name='something'));

Any suggestions to speed up this query.....

Thanks,
Sumeet.

Re: Multiple DB join

От

Andrew Sullivan

Дата:

15 августа 2006 г., 11:17:29

On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote:
> >  
> The Design of the database is because our organization wants to split up 
> different datasets into different entities, and there might be a 
> possibility that they'll run different instances of postgres for each 
> dataset. 

It's this "different instances" thing that I'm having a tough time
with.  Is this because they want to be able to query local things
when disconnected or something?  I can think of applications for
this, for sure, I'm just suggesting that you make sure you're not
building an optimisation that is (1) premature and (2) possibly a
pessimal operation.

> records, the query runs damn slow.......below is the sample schema for 
> my base table and the query i try to run on it....it takes me more than 
> 2-3 minutes to run a query....Is there any way i could speed this up......

The problem is not your design, nor even the size of the data
exactly, but the query: 

> e.g.  SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors 
> ~*   '.*something.*')  AND (db_name='something'));

You have two initially-unbound search terms there: ILIKE '%' and ~*
'.*' are automatically seqscans, because you have nowhere in the
index to start.  If you really want to do this kind of unbound-string
query, you need to look into full text search.  The above approach is
never going to be fast.

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.     --Dennis Ritchie

Re: Multiple DB join

От

Andrew Sullivan

Дата:

15 августа 2006 г., 11:17:58

On Tue, Aug 15, 2006 at 10:25:00AM -0300, Jorge Godoy wrote:
> I'd like to see some sort of data partitioning in PostgreSQL.

Sure, I think everybody would.  I think it's among the more
interesting problems we have.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The plural of anecdote is not data.    --Roger Brinner

Re: Multiple DB join

От

Sumeet

Дата:

15 августа 2006 г., 11:35:50

On 8/15/06, Andrew Sullivan <ajs@crankycanuck.ca> wrote:

On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote:
> >
> The Design of the database is because our organization wants to split up
> different datasets into different entities, and there might be a
> possibility that they'll run different instances of postgres for each
> dataset.

It's this "different instances" thing that I'm having a tough time
with. Is this because they want to be able to query local things
when disconnected or something? I can think of applications for
this, for sure, I'm just suggesting that you make sure you're not
building an optimisation that is (1) premature and (2) possibly a
pessimal operation.

The reason for splitting up the dbs into differnet instances is that in case one of the postgres instances on the server
goes down for some reason, it doesnt effect the other instances which are running on the same server. Even I dont know
the reason for this kind of approach. But i've to deal with it.

> records, the query runs damn slow.......below is the sample schema for
> my base table and the query i try to run on it....it takes me more than
> 2-3 minutes to run a query....Is there any way i could speed this up......

The problem is not your design, nor even the size of the data
exactly, but the query:

> e.g.  SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors
> ~*   '.*something.*')  AND (db_name='something'));

You have two initially-unbound search terms there: ILIKE '%' and ~*
'.*' are automatically seqscans, because you have nowhere in the
index to start.  If you really want to do this kind of unbound-string
query, you need to look into full text search.  The above approach is
never going to be fast.

I previously thought of doing the full text search indexing thing...but i had a intution that the full text search thing is for fields which have very large strings.......but in my case the strings are not above 200 chars in length.....so would the full text search thing give me the performance which we need...also i tried doing combined indexes on title, authors in my base table......would indexes of any kind help me in this case..

Thanks for ur prompt replies,
Sumeet.

--
Thanks,
Sumeet.

Re: Multiple DB join

От

Andrew Sullivan

Дата:

15 августа 2006 г., 11:45:56

On Tue, Aug 15, 2006 at 10:35:36AM -0400, Sumeet wrote:
> 
> The reason for splitting up the dbs into differnet instances is that in case
> one of the postgres instances on the server
> goes down for some reason, it doesnt effect the other instances which are
> running on the same server. Even I dont know
> the reason for this kind of approach. But i've to deal with it.

I certainly sympathise with the situation where managers decide to
solve problems that never happen.  Given that these are on the same
server, it is _way_ more likely that you'll introduce a problem due
to running several instances of the back end than that one instance
of Postgres will magically die (especially in such a way that other
instances will continue to work).  But if Some Important Person
decreed it, you have my sympathy.

> I previously thought of doing the full text search indexing thing...but i
> had a intution that the full text search thing is for fields which have very
> large strings.......but in my case the  strings are not above 200 chars in

No, it's for data where you're going to be searching in random pieces
of the text.  I think you should look at tsearch2, probably.  

If you had any way to enforce bounded searches, it'd be a different
matter: strings with initial matches but an unbound end are fast. 
(You can do it the other way, too, by some tricks with reversing the
strings.)

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.    --Philip Greenspun

Re: Multiple DB join

От

Sumeet

Дата:

18 августа 2006 г., 12:07:43

On 8/15/06, Andrew Sullivan <ajs@crankycanuck.ca> wrote:

> I previously thought of doing the full text search indexing thing...but i
> had a intution that the full text search thing is for fields which have very
> large strings.......but in my case the strings are not above 200 chars in

No, it's for data where you're going to be searching in random pieces
of the text. I think you should look at tsearch2, probably.

Will the full text search indexing help me achive a good speed in searching keywords???

If you had any way to enforce bounded searches, it'd be a different
matter: strings with initial matches but an unbound end are fast.
(You can do it the other way, too, by some tricks with reversing the
strings.)

can someone plz ellaborate a little about ways we can enforce bounded searches?. I'm basically trying a simple search
i.e. trying to find name of authors user enters into a interface against the 20 miliions records in my db. Can anyone suggest a good way to perform this kind of search ?.

Thanks,
Sumeet.

Re: Multiple DB join

От

Andrew Sullivan

Дата:

18 августа 2006 г., 12:20:44

On Fri, Aug 18, 2006 at 11:07:24AM -0400, Sumeet wrote:
> 
> Will the full text search indexing help me achive a good speed in searching
> keywords???

I think this depends on how you use it.

> can someone plz ellaborate a little about ways we can enforce bounded
> searches?. I'm basically trying a simple search
> i.e. trying to find name of authors user enters into a interface against the
> 20 miliions records in my db. Can anyone suggest a good way to perform this
> kind of search ?.

The thing is, if they're keywords, why are you treating them as
fragments?  Traditionally, keywords are not substrings, but full
barewords.  A bareword match should be fast, because it's looking for
the whole string.  So you shouldn't need the "%" characters.

Maybe the problem that you have something like the following.  If
your data is stored like this

subject     |  keyword
subject1    |  keyword1 keyword2 keyword3

and you want every subject that matches on keyword2, then you have to
search this with SELECT subject WHERE keyword = '%keyword2%'.  The
reason you have to do that is that your data is badly normalised.  Is
that it?

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Information security isn't a technological problem.  It's an economics
problem.    --Bruce Schneier

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Multiple DB join