Обсуждение: Multiple DB join
Hi All, I have a database which consists of 20 million records and I've split up the db into 6-7 dbs. I have a base database which consists of the ids with link all the databases. I'm performing search on this single base table. After searching i get some ids which are ids in the other databases which i split up. Now i need to retrieve those records. Is there a way i can join tables from multiple databases as we can join multiple tables in a single database. Thanks, SA.
I think using the contrib module 'dblink' (http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/dblink/) can help you here.
Thanks,
--
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)
Thanks,
--
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)
On 8/15/06, Sumeet Ambre < sambre@indiana.edu> wrote:
Hi All,
I have a database which consists of 20 million records and I've split up
the db into 6-7 dbs. I have a base database which consists of
the ids with link all the databases. I'm performing search on this
single base table. After searching i get some ids which are ids in the other
databases which i split up. Now i need to retrieve those records. Is
there a way i can join tables from multiple databases as we can join
multiple tables
in a single database.
Thanks,
SA.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote: > Hi All, > > I have a database which consists of 20 million records and I've split up > the db into 6-7 dbs. You can do this (as someone suggested with dblink), but I'm wondering why the split? 20 million records isn't very big. A -- Andrew Sullivan | ajs@crankycanuck.ca Unfortunately reformatting the Internet is a little more painful than reformatting your hard drive when it gets out of whack. --Scott Morris
On Tue, Aug 15, 2006 at 06:36:24AM -0400, Andrew Sullivan wrote: > On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote: > > I have a database which consists of 20 million records and I've split up > > the db into 6-7 dbs. > > You can do this (as someone suggested with dblink), but I'm wondering > why the split? 20 million records isn't very big. And why multiple databases instead of multiple schemas within the same database? Or even all data in the same schema? Is there a reason for the segregation? -- Michael Fuhr
Michael Fuhr <mike@fuhr.org> writes: > And why multiple databases instead of multiple schemas within the > same database? Or even all data in the same schema? Is there a > reason for the segregation? I can think that spreading processing requirements should be one. And distributing load another one. Disk space can be solved with new disks and tablespaces, but we can't yet distribute the load through several servers without partitioning. -- Jorge Godoy <jgodoy@gmail.com>
On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote: > I can think that spreading processing requirements should be one. And > distributing load another one. Disk space can be solved with new disks and > tablespaces, but we can't yet distribute the load through several servers > without partitioning. The cost of inter-node communication isn't nothing, though. It strikes me as at least possible that the overhead of dblink is going to be larger than whatever gains one makes from adding a new server. For only 20M rows, I find it pretty hard to believe the gain is going to be demonstrable. A -- Andrew Sullivan | ajs@crankycanuck.ca Everything that happens in the world happens at some place. --Jane Jacobs
Andrew Sullivan <ajs@crankycanuck.ca> writes: > On Tue, Aug 15, 2006 at 09:01:03AM -0300, Jorge Godoy wrote: >> I can think that spreading processing requirements should be one. And >> distributing load another one. Disk space can be solved with new disks and >> tablespaces, but we can't yet distribute the load through several servers >> without partitioning. > > The cost of inter-node communication isn't nothing, though. It > strikes me as at least possible that the overhead of dblink is going > to be larger than whatever gains one makes from adding a new server. > For only 20M rows, I find it pretty hard to believe the gain is going > to be demonstrable. I totally agree with you. But when your base grows, it might start getting interesting. I'd like to see some sort of data partitioning in PostgreSQL. -- Jorge Godoy <jgodoy@gmail.com>
Andrew Sullivan wrote: > On Mon, Aug 14, 2006 at 05:26:10PM -0400, Sumeet Ambre wrote: > >> Hi All, >> >> I have a database which consists of 20 million records and I've split up >> the db into 6-7 dbs. >> > > You can do this (as someone suggested with dblink), but I'm wondering > why the split? 20 million records isn't very big. > > A > The Design of the database is because our organization wants to split up different datasets into different entities, and there might be a possibility that they'll run different instances of postgres for each dataset. The problem is that one of the databases consists of 14 million records and when i query the base database which consists of 20million records, the query runs damn slow.......below is the sample schema for my base table and the query i try to run on it....it takes me more than 2-3 minutes to run a query....Is there any way i could speed this up...... sample schema for my base table --------------------------------- doc_id | title | authors | pub_year | abs | db_name........ | In the above scheme the field db_name is the name of the other databases which contain the whole record. I'm trying to run query which searches on title, authors name, pub_year or abstract simultaneously... e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors ~* '.*something.*') AND (db_name='something')); Any suggestions to speed up this query..... Thanks, Sumeet.
On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote: > > > The Design of the database is because our organization wants to split up > different datasets into different entities, and there might be a > possibility that they'll run different instances of postgres for each > dataset. It's this "different instances" thing that I'm having a tough time with. Is this because they want to be able to query local things when disconnected or something? I can think of applications for this, for sure, I'm just suggesting that you make sure you're not building an optimisation that is (1) premature and (2) possibly a pessimal operation. > records, the query runs damn slow.......below is the sample schema for > my base table and the query i try to run on it....it takes me more than > 2-3 minutes to run a query....Is there any way i could speed this up...... The problem is not your design, nor even the size of the data exactly, but the query: > e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors > ~* '.*something.*') AND (db_name='something')); You have two initially-unbound search terms there: ILIKE '%' and ~* '.*' are automatically seqscans, because you have nowhere in the index to start. If you really want to do this kind of unbound-string query, you need to look into full text search. The above approach is never going to be fast. -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
On Tue, Aug 15, 2006 at 10:25:00AM -0300, Jorge Godoy wrote: > I'd like to see some sort of data partitioning in PostgreSQL. Sure, I think everybody would. I think it's among the more interesting problems we have. A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
On 8/15/06, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
The reason for splitting up the dbs into differnet instances is that in case one of the postgres instances on the server
goes down for some reason, it doesnt effect the other instances which are running on the same server. Even I dont know
the reason for this kind of approach. But i've to deal with it.
On Tue, Aug 15, 2006 at 10:11:41AM -0400, Sumeet Ambre wrote:
> >
> The Design of the database is because our organization wants to split up
> different datasets into different entities, and there might be a
> possibility that they'll run different instances of postgres for each
> dataset.
It's this "different instances" thing that I'm having a tough time
with. Is this because they want to be able to query local things
when disconnected or something? I can think of applications for
this, for sure, I'm just suggesting that you make sure you're not
building an optimisation that is (1) premature and (2) possibly a
pessimal operation.
The reason for splitting up the dbs into differnet instances is that in case one of the postgres instances on the server
goes down for some reason, it doesnt effect the other instances which are running on the same server. Even I dont know
the reason for this kind of approach. But i've to deal with it.
> records, the query runs damn slow.......below is the sample schema for
> my base table and the query i try to run on it....it takes me more than
> 2-3 minutes to run a query....Is there any way i could speed this up......
The problem is not your design, nor even the size of the data
exactly, but the query:
> e.g. SELECT * FROM base WHERE ((title ILIKE '%something%') AND (authors
> ~* '.*something.*') AND (db_name='something'));
You have two initially-unbound search terms there: ILIKE '%' and ~*
'.*' are automatically seqscans, because you have nowhere in the
index to start. If you really want to do this kind of unbound-string
query, you need to look into full text search. The above approach is
never going to be fast.
I previously thought of doing the full text search indexing thing...but i had a intution that the full text search thing is for fields which have very large strings.......but in my case the strings are not above 200 chars in length.....so would the full text search thing give me the performance which we need...also i tried doing combined indexes on title, authors in my base table......would indexes of any kind help me in this case..
Thanks for ur prompt replies,
Sumeet.
--
Thanks,
Sumeet.
On Tue, Aug 15, 2006 at 10:35:36AM -0400, Sumeet wrote: > > The reason for splitting up the dbs into differnet instances is that in case > one of the postgres instances on the server > goes down for some reason, it doesnt effect the other instances which are > running on the same server. Even I dont know > the reason for this kind of approach. But i've to deal with it. I certainly sympathise with the situation where managers decide to solve problems that never happen. Given that these are on the same server, it is _way_ more likely that you'll introduce a problem due to running several instances of the back end than that one instance of Postgres will magically die (especially in such a way that other instances will continue to work). But if Some Important Person decreed it, you have my sympathy. > I previously thought of doing the full text search indexing thing...but i > had a intution that the full text search thing is for fields which have very > large strings.......but in my case the strings are not above 200 chars in No, it's for data where you're going to be searching in random pieces of the text. I think you should look at tsearch2, probably. If you had any way to enforce bounded searches, it'd be a different matter: strings with initial matches but an unbound end are fast. (You can do it the other way, too, by some tricks with reversing the strings.) A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
On 8/15/06, Andrew Sullivan <ajs@crankycanuck.ca> wrote:
Will the full text search indexing help me achive a good speed in searching keywords???
can someone plz ellaborate a little about ways we can enforce bounded searches?. I'm basically trying a simple search
i.e. trying to find name of authors user enters into a interface against the 20 miliions records in my db. Can anyone suggest a good way to perform this kind of search ?.
Thanks,
Sumeet.
> I previously thought of doing the full text search indexing thing...but i
> had a intution that the full text search thing is for fields which have very
> large strings.......but in my case the strings are not above 200 chars in
No, it's for data where you're going to be searching in random pieces
of the text. I think you should look at tsearch2, probably.
Will the full text search indexing help me achive a good speed in searching keywords???
If you had any way to enforce bounded searches, it'd be a different
matter: strings with initial matches but an unbound end are fast.
(You can do it the other way, too, by some tricks with reversing the
strings.)
can someone plz ellaborate a little about ways we can enforce bounded searches?. I'm basically trying a simple search
i.e. trying to find name of authors user enters into a interface against the 20 miliions records in my db. Can anyone suggest a good way to perform this kind of search ?.
Sumeet.
On Fri, Aug 18, 2006 at 11:07:24AM -0400, Sumeet wrote: > > Will the full text search indexing help me achive a good speed in searching > keywords??? I think this depends on how you use it. > can someone plz ellaborate a little about ways we can enforce bounded > searches?. I'm basically trying a simple search > i.e. trying to find name of authors user enters into a interface against the > 20 miliions records in my db. Can anyone suggest a good way to perform this > kind of search ?. The thing is, if they're keywords, why are you treating them as fragments? Traditionally, keywords are not substrings, but full barewords. A bareword match should be fast, because it's looking for the whole string. So you shouldn't need the "%" characters. Maybe the problem that you have something like the following. If your data is stored like this subject | keyword subject1 | keyword1 keyword2 keyword3 and you want every subject that matches on keyword2, then you have to search this with SELECT subject WHERE keyword = '%keyword2%'. The reason you have to do that is that your data is badly normalised. Is that it? A -- Andrew Sullivan | ajs@crankycanuck.ca Information security isn't a technological problem. It's an economics problem. --Bruce Schneier