Обсуждение: Mass-Data question
Hello friends, I have a question. Currently I am planning a new project that should collect really much data. My question is: What should I do if the disk-space is not enough? Is there something to distribute data over several machines and to collect the data with a select statement if required? The informations stored are needed to be analyzed. It is a enterprise-computing project currently in a development and a little bit planning phase, I want to use postgresql, but how should I handle real mass-data? Don´t tell me to enhance disk-space, whatever we use it´s not enough. We need more than one machine and we need to analyze the data over several machines if possible with one select statement... or is there a better idea how to handle really much data? Its important to us to have realtime-analysis so we can not let the user wait for whatever. Sorry for this question but I have a problem with that thingie. -- Best regards, Boris Köster mailto:koester@x-itec.de
Boris Köster wrote: > Hello friends, > > I have a question. Currently I am planning a new project that should > collect really much data. My question is: > > What should I do if the disk-space is not enough? Is there something > to distribute data over several machines and to collect the data with > a select statement if required? The informations stored are needed to > be analyzed. > > It is a enterprise-computing project currently in a development and a > little bit planning phase, I want to use > postgresql, but how should I handle real mass-data? > > Don´t tell me to enhance disk-space, whatever we use it´s not enough. > We need more than one machine and we need to analyze the data over > several machines if possible with one select statement... or is there > a better idea how to handle really much data? Its important to us to > have realtime-analysis so we can not let the user wait for whatever. > > Sorry for this question but I have a problem with that thingie. Hmm, interesting. I have similar needs. Let's hear what the gurus have to say. But asking independent of PostgreSQL, what do you want the RDBMS to do? You probably want a virtual shared disk storage, such as a RAID system to which you can connect multiple hosts. VMS clusters have that feature. The disks are independent of the hosts. But then of course it's non-trivial to use multiple server hosts on the same database storage. Oracle can do something like that (but you pay heavy $$$). So, what is it you want the system to do? Parallelize a single query over multiple hosts? I wouldn't count on that being available with PostgreSQL any time soon. -Gunther -- Gunther Schadow, M.D., Ph.D. gschadow@regenstrief.org Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
> Boris Kster wrote: > > > What should I do if the disk-space is not enough? Is there something > > to distribute data over several machines and to collect the data with > > a select statement if required? > Hmm, interesting. I have similar needs. As do I. Unfortuantely, I'm not a guru. But I'll be testing out something like this in the next few weeks if all goes well. I was planning to do some fairly simple data partitioning. My initial plan is to drop the data into multiple tables across multiple servers, partitioned by date, and have a master table indicating the names of the various tables and the date ranges they cover. The application will then deal with determining which tables the query will be spread across, construct and submit the appropriate queries (eventually in parallel, if I'm getting a lot of queries crossing multiple tables), and collate the results. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Hello Gunther, Monday, April 15, 2002, 10:53:19 PM, you wrote: GS> Boris Köster wrote: GS> Hmm, interesting. I have similar needs. Let's hear what the gurus GS> have to say. But asking independent of PostgreSQL, what do you want GS> the RDBMS to do? You probably want a virtual shared disk storage, GS> such as a RAID system to which you can connect multiple hosts. VMS GS> clusters have that feature. The disks are independent of the hosts. Yes, that sounds interesting as one option. GS> But then of course it's non-trivial to use multiple server hosts on GS> the same database storage. Oracle can do something like that (but GS> you pay heavy $$$). Hm yes. GS> So, what is it you want the system to do? Parallelize a single query GS> over multiple hosts? I wouldn't count on that being available with GS> PostgreSQL any time soon. I am collecting up to 6000-7000 informations in a timeline of 1-3 Secounds, that 24h a day, 365 days a year. The harddrives may not be fast enough to collect the data, that´s really heavy. All these data must be analyzed in realtime if required by the customer(s). parallelized query is an interseting idea. GS> -Gunther -- Best regards, Boris mailto:koester@x-itec.de
Hello Curt, Tuesday, April 16, 2002, 5:25:25 AM, you wrote: >> Hmm, interesting. I have similar needs. CS> As do I. Unfortuantely, I'm not a guru. But I'll be testing out CS> something like this in the next few weeks if all goes well. I was CS> planning to do some fairly simple data partitioning. My initial CS> plan is to drop the data into multiple tables across multiple CS> servers, partitioned by date, and have a master table indicating CS> the names of the various tables and the date ranges they cover. Aha, interesting. CS> The application will then deal with determining which tables the CS> query will be spread across, construct and submit the appropriate CS> queries (eventually in parallel, if I'm getting a lot of queries CS> crossing multiple tables), and collate the results. Parallel querying sounds very interesting to me. My current plan was to do parallel writing because the hard-drives are not fast enough to collect all the data, your idea of parallel reading is very intersting. I have written a C++ library to access mysql+postgresql databases. My OS is FreeBSD, but it should work with other OSes, too I think. Normally it sounds not very complex to do parallelized reading/writing but getting the results in the right order that is a problem. Maybe I could collect data parallelized from several machines via threads, writing the content to a (new) machine (?) if the numer of rows is not higher than x rows to avoid disk-overrun. The advantage could be that if this works, its possible to use that feature with pgsql+mysql. ---------- ---------- rdbms1 rdbms[n] ---------- ---------- | | | | --------------- | |distributed writing for logfiles or similar into databases | | ---------- |-------- rdbms-tmp temporary db-server (?) | ---------- to analyze the data for parallelized | | reading like a temporary space... ? | | | |---- > Customer-Access for analyzing -------------- Machine with Memory-Queue implementation for fast reading/writing "Collector for writing and distributing the content" -------------- | | Internet ---------- ---------- client1 client[n] ---------- ---------- What do the GURUs think about this? I need this functionality within the next 1-2 month and I could try to code it as a C++ library. If the concept is not bogus, the only question left is if i should give out the source for free or not, this is no solution for a home-user *gg I have no idea. -- Best regards, Boris Köster mailto:koester@x-itec.de
On Tue, 16 Apr 2002, [ISO-8859-15] Boris Köster wrote: > Normally it sounds not very complex to do parallelized > reading/writing but getting the results in the right order that is a > problem. Maybe I could collect data parallelized from several > machines via threads, writing the content to a (new) machine (?) if the numer of rows is > not higher than x rows to avoid disk-overrun. The advantage could be > that if this works, its possible to use that feature with pgsql+mysql. Maybe you can use dblink to retrieve the results from the various "parallel servers" into one central server and then merge them (UNION, maybe?). That would work for simple SELECTs, but when you have a couple of triggers you start getting into trouble. Obviously you would have to split UPDATEs and INSERTs appropiately. Who knows, maybe you can even get it to actually work. -- Alvaro Herrera (<alvherre[@]atentus.com>) "On the other flipper, one wrong move and we're Fatal Exceptions" (T.U.X.: Term Unit X - http://www.thelinuxreview.com/TUX/)
On Tue, 16 Apr 2002, [ISO-8859-15] Boris Kster wrote: > Parallel querying sounds very interesting to me. My current plan was > to do parallel writing because the hard-drives are not fast enough to > collect all the data.... If it's really the hard drives that are not fast enough, you've got a serious problem. The raw write speed of a hard drive is much, much faster than Postgres. But even so, it sounds like you have basically the same problem as I do; how to get loads of data into the system really quickly. > Normally it sounds not very complex to do parallelized > reading/writing but getting the results in the right order that is a > problem. I don't see why. Just run the queries in parallel and merge the results as they come in. Just make sure you use the same ORDER BY on all the queries so you can do a merge sort. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC