Support for distributed queries with semijoins could be possible, if ...

Поиск
Список
Период
Сортировка
От Gunther Schadow
Тема Support for distributed queries with semijoins could be possible, if ...
Дата
Msg-id 3D0393D0.2070208@aurora.regenstrief.org
обсуждение исходный текст
Ответы Re: Support for distributed queries with semijoins could be possible, if ...  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-sql
Hi again pgsql gurus,

if I want to do a distributed join query, I would like to have a
way for streaming as much of it as possible. I think there is a
modest addition to pgsql that could do that, what do you think?

The principle is similar to the COPY FROM STDIN form, just that
we would not INSERT that data but do an inner join on data piped
to the engine from STDIN.

Here is an example.

CREATE TABLE Observation (  id          OID     NOT NULL PRIMARY KEY,  patient_id  OID     NOT NULL REFERENCES
Patient(id); type_code   VARCHAR NOT NULL;  value       DOUBLE;
 
);

and the assumption is that this table is horizontally partitioned
over different systems, each covering different types of
observations (partitioned on type_code clusters.) But all
would use the same set of patient ids.

Consider we want to query for patients with fever over 100 and
white blood cell count under 1000. Without horizontal partitioning,
we could write the following query (I am choosing this particular
form to prepare the reader for the distributed execution.)

SELECT fever.patient_id AS patient_id,       fever.value      AS fever,       wbc.value        AS wbc,  FROM (SELECT
patient_id,value          FROM Observation fever         WHERE fever.type_code  = 'FEVER'           AND fever.value >
100)fever     INNER JOIN Observation wbc        ON fever.patient_id = wbc.patient_id WHERE wbc.type_code = 'WHITE BLOOD
CELLCOUNT'   AND wbc.value < 1000;
 

Now, consider that fever observations are on system A and white
blood cell count observations are on system B, while A and B
share the same set of patient ids. We could then distribute as
follows:

to system A:

SELECT fever.patient_id AS patient_id,       fever.value      AS fever  FROM Observation fever WHERE fever.type_code =
'FEVER'  AND fever.value > 100;
 

then we use a cursor on that query such that we can get results
as early as posible for streaming those results into the next
query to system B:

SELECT fever.patient_id AS patient_id,       fever.value      AS fever,       wbc.value        AS wbc  FROM (SELECT
patient_id,value          FROM STDIN) fever     INNER JOIN Observation wbc        ON fever.patient_id = wbc.patient_id
WHEREwbc.type_code = 'WHITE BLOOD CELL COUNT'   AND wbc.value < 1000;
 

Notice the "SELECT ... FROM STDIN" form, which is reminescent of
the "COPY ... FORM STDIN". Presumably on the interface one could
use the same syntax for tuples as for the COPY FORM (e.g., DELIMITES
AS '|' WITH NULL AS '') and then ship the tuples line by line ending
with a \.

How hard would it be to add such a feature?

On the JDBC interface, I presume that a useful form would be to
somehow pipe the ResultSet from the first query into the second
query, but I don't know if JDBC provides such a mechanism.

I think this could be very useful for the time until PostgreSQL
can internally deal with horizontal partitioning over distributed
data base nodes ;-).

thanks,
-Gunther


-- 
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org




В списке pgsql-sql по дате отправления:

Предыдущее
От: Gunther Schadow
Дата:
Сообщение: Q: will GROUP BY make use of an index to return tuples early?
Следующее
От: Gunther Schadow
Дата:
Сообщение: Re: select failure