Support for distributed queries with semijoins could be possible, if ...
От | Gunther Schadow |
---|---|
Тема | Support for distributed queries with semijoins could be possible, if ... |
Дата | |
Msg-id | 3D0393D0.2070208@aurora.regenstrief.org обсуждение исходный текст |
Ответы |
Re: Support for distributed queries with semijoins could be possible, if ...
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-sql |
Hi again pgsql gurus, if I want to do a distributed join query, I would like to have a way for streaming as much of it as possible. I think there is a modest addition to pgsql that could do that, what do you think? The principle is similar to the COPY FROM STDIN form, just that we would not INSERT that data but do an inner join on data piped to the engine from STDIN. Here is an example. CREATE TABLE Observation ( id OID NOT NULL PRIMARY KEY, patient_id OID NOT NULL REFERENCES Patient(id); type_code VARCHAR NOT NULL; value DOUBLE; ); and the assumption is that this table is horizontally partitioned over different systems, each covering different types of observations (partitioned on type_code clusters.) But all would use the same set of patient ids. Consider we want to query for patients with fever over 100 and white blood cell count under 1000. Without horizontal partitioning, we could write the following query (I am choosing this particular form to prepare the reader for the distributed execution.) SELECT fever.patient_id AS patient_id, fever.value AS fever, wbc.value AS wbc, FROM (SELECT patient_id,value FROM Observation fever WHERE fever.type_code = 'FEVER' AND fever.value > 100)fever INNER JOIN Observation wbc ON fever.patient_id = wbc.patient_id WHERE wbc.type_code = 'WHITE BLOOD CELLCOUNT' AND wbc.value < 1000; Now, consider that fever observations are on system A and white blood cell count observations are on system B, while A and B share the same set of patient ids. We could then distribute as follows: to system A: SELECT fever.patient_id AS patient_id, fever.value AS fever FROM Observation fever WHERE fever.type_code = 'FEVER' AND fever.value > 100; then we use a cursor on that query such that we can get results as early as posible for streaming those results into the next query to system B: SELECT fever.patient_id AS patient_id, fever.value AS fever, wbc.value AS wbc FROM (SELECT patient_id,value FROM STDIN) fever INNER JOIN Observation wbc ON fever.patient_id = wbc.patient_id WHEREwbc.type_code = 'WHITE BLOOD CELL COUNT' AND wbc.value < 1000; Notice the "SELECT ... FROM STDIN" form, which is reminescent of the "COPY ... FORM STDIN". Presumably on the interface one could use the same syntax for tuples as for the COPY FORM (e.g., DELIMITES AS '|' WITH NULL AS '') and then ship the tuples line by line ending with a \. How hard would it be to add such a feature? On the JDBC interface, I presume that a useful form would be to somehow pipe the ResultSet from the first query into the second query, but I don't know if JDBC provides such a mechanism. I think this could be very useful for the time until PostgreSQL can internally deal with horizontal partitioning over distributed data base nodes ;-). thanks, -Gunther -- Gunther Schadow, M.D., Ph.D. gschadow@regenstrief.org Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
В списке pgsql-sql по дате отправления:
Предыдущее
От: Gunther SchadowДата:
Сообщение: Q: will GROUP BY make use of an index to return tuples early?