Обсуждение: Parallel queries in single transaction

Поиск
Список
Период
Сортировка

Parallel queries in single transaction

От
Paul Muntyanu
Дата:
Hello,

   I am working with data warehouse based on postgresql and would like to propose a feature. The idea is to give control and ability for developer to execute queries in parallel within single transaction. Usual flow is next: START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> END_TRANSACTION. However sometimes QUERY1 and QUERY2 are independent and can be executed in parallel mode. E.g.: START_TRANSACTION -> DEFINE_QUERY1(no execution) -> DEFINE_QUERY2(no_execution) -> EXECUTE_QUERY1_AND_QUERY2(in parallel) -> QUERY3 -> END

Of course QUERY1 and QUERY2 can be dependent and then this would not work, but sometimes it is useful, especially when you have bound to e.g. CPU and query stuck.
If we go further, the postgresql engine could possible find such cases by itself and run queries in parallel mode, but that’s sounds too far.

Here is also an example in scala how you can wait for several futures: https://stackoverflow.com/a/16257851/2439539

Syntax:

BEGIN;

--create two temp tables because prepare does not support CTAS

—query1
CREATE TEMP TABLE QUERY1_RESULT ON COMMIT DROP (…);
PREPARE query1 (id int, val varchar) as INSERT INTO QUERY1_RESULT(...) SELECT …

—query2
CREATE TEMP TABLE QUERY2_RESULT ON COMMIT DROP (…);
PREPARE query2 (id int, val varchar) as INSERT INTO QUERY2_RESULT(...) SELECT …

—exec in parallel
execute parallel (query1, query2);

—query3
….

END;


-Paul

Re: Parallel queries in single transaction

От
Tomas Vondra
Дата:
Hi,

On 07/16/2018 09:45 AM, Paul Muntyanu wrote:
> Hello,
> 
>     I am working with data warehouse based on postgresql and would like 
> to propose a feature. The idea is to give control and ability for 
> developer to execute queries in parallel within single transaction. 
> Usual flow is next: START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> 
> END_TRANSACTION. However sometimes QUERY1 and QUERY2 are independent and 
> can be executed in parallel mode. E.g.: START_TRANSACTION -> 
> DEFINE_QUERY1(no execution) -> DEFINE_QUERY2(no_execution) -> 
> EXECUTE_QUERY1_AND_QUERY2(in parallel) -> QUERY3 -> END
> 
> Of course QUERY1 and QUERY2 can be dependent and then this would not 
> work, but sometimes it is useful, especially when you have bound to e.g. 
> CPU and query stuck.

I'm struggling to understand what would be the possible benefits. Either 
the queries are CPU-bound or stuck (e.g. waiting for I/O), they can't be 
both at the same time. If a query is stuck, running it concurrently is 
pointless. If they are CPU-bound, we can run them in parallel (which 
should produce the results faster).

I'd even dare to say that running the queries concurrently can easily 
hinder performance, because the queries will compete for parallel 
workers, preventing some of them from running in parallel mode.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Parallel queries in single transaction

От
Paul Muntyanu
Дата:
Hi Tomas, thanks for looking into. I am more talking about queries which can not be optimized, e.g.
* fullscan of the table and heavy calculations for another one.
* query through FDW for both queries(e.g. one query fetches data from Kafka and another one is fetching from remote Postgres. There are no bounds for both queries for anything except local CPU, network and remote machine)

IO bound is not a problem in case if you have multiple tablesapces. And CPU bound can be not the case when you have 32 cores and 6 max workers per query. Then, during nigtly ETL, I do not have anything except single query running) == 6 cores are occupied. If I can run queries in parallel, I would occupy two IO stacks(two tablespaces) + 12 cores instead of sequentially 6 and then again 6.

Hope that makes sense

-P

On 16 Jul 2018, at 11:44, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

On 07/16/2018 09:45 AM, Paul Muntyanu wrote:
Hello,
   I am working with data warehouse based on postgresql and would like to propose a feature. The idea is to give control and ability for developer to execute queries in parallel within single transaction. Usual flow is next: START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> END_TRANSACTION. However sometimes QUERY1 and QUERY2 are independent and can be executed in parallel mode. E.g.: START_TRANSACTION -> DEFINE_QUERY1(no execution) -> DEFINE_QUERY2(no_execution) -> EXECUTE_QUERY1_AND_QUERY2(in parallel) -> QUERY3 -> END
Of course QUERY1 and QUERY2 can be dependent and then this would not work, but sometimes it is useful, especially when you have bound to e.g. CPU and query stuck.

I'm struggling to understand what would be the possible benefits. Either the queries are CPU-bound or stuck (e.g. waiting for I/O), they can't be both at the same time. If a query is stuck, running it concurrently is pointless. If they are CPU-bound, we can run them in parallel (which should produce the results faster).

I'd even dare to say that running the queries concurrently can easily hinder performance, because the queries will compete for parallel workers, preventing some of them from running in parallel mode.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel queries in single transaction

От
Tomas Vondra
Дата:

On 07/16/2018 12:03 PM, Paul Muntyanu wrote:
> Hi Tomas, thanks for looking into. I am more talking about queries which 
> can not be optimized, e.g.
> * fullscan of the table and heavy calculations for another one.
> * query through FDW for both queries(e.g. one query fetches data from 
> Kafka and another one is fetching from remote Postgres. There are no 
> bounds for both queries for anything except local CPU, network and 
> remote machine)
> 
> IO bound is not a problem in case if you have multiple tablesapces.

But it was you who mentioned "query stuck" not me. I merely pointed out 
that in such cases running queries concurrently won't help.

> And CPU bound can be not the case when you have 32 cores and 6 max workers 
> per query. Then, during nigtly ETL, I do not have anything except single 
> query running) == 6 cores are occupied. If I can run queries in 
> parallel, I would occupy two IO stacks(two tablespaces) + 12 cores 
> instead of sequentially 6 and then again 6.
> 

Well, sure. But you could just as well open multiple connections and 
make the queries concurrent that way. Or change the GUC to increase the 
number of workers for the nightly ETL.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Parallel queries in single transaction

От
Paul Muntyanu
Дата:

> Well, sure. But you could just as well open multiple connections and
> make the queries concurrent that way. Or change the GUC to increase the
> number of workers for the nightly ETL.



This is an option right now for having permanent staging tables for future join. I mistakenly said ETL while it is ELT what means that most of operations are in the database so we try to keep all changes in db code instead of changing engine for execution. In PG11 we have parallel CTAS what is drammatical improvement for us, but there are still will be operations(query plans) which are not parallel.

Having postgresql completely ACID is amazing feature, so when we need to do ELT operation outside the transaction and guarantee that ELT job completed successfully by checking that all steps(multiple transactions with staging tables) are succeeded(with graceful rollback + cleanup in case of failure), makes things more complex. Indeed I still agree that it is possible to workaround by operating on application level.
-P

-P


On Mon, Jul 16, 2018 at 2:28 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:


On 07/16/2018 12:03 PM, Paul Muntyanu wrote:
> Hi Tomas, thanks for looking into. I am more talking about queries which
> can not be optimized, e.g.
> * fullscan of the table and heavy calculations for another one.
> * query through FDW for both queries(e.g. one query fetches data from
> Kafka and another one is fetching from remote Postgres. There are no
> bounds for both queries for anything except local CPU, network and
> remote machine)
>
> IO bound is not a problem in case if you have multiple tablesapces.

But it was you who mentioned "query stuck" not me. I merely pointed out
that in such cases running queries concurrently won't help.

> And CPU bound can be not the case when you have 32 cores and 6 max workers
> per query. Then, during nigtly ETL, I do not have anything except single
> query running) == 6 cores are occupied. If I can run queries in
> parallel, I would occupy two IO stacks(two tablespaces) + 12 cores
> instead of sequentially 6 and then again 6.
>

Well, sure. But you could just as well open multiple connections and
make the queries concurrent that way. Or change the GUC to increase the
number of workers for the nightly ETL.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services