Обсуждение: Fwd: Array of integer indexed nested-loop semi join

Поиск
Список
Период
Сортировка

Fwd: Array of integer indexed nested-loop semi join

От
Mickael van der Beek
Дата:
Hello everyone,

1) Context

I'm working with large tables containing arrays of integers, indexed with "gin__int_ops" GIN indexes offered by the "intarray" extension.
The goal I'm trying to achieve is to do a "nested loop semi join" using the array inclusion operation (@>) as join condition but in an indexed way.
(Basically an INNER JOIN without the duplicate rows and without needing to use columns from the joined table.)

2) Configuration

The queries are run on a PostgreSQL v14 server with 32GB RAM and 8 vCPUs on a 64 bit ARM Neoverse architecture (m6g.2xlarge AWS RDS instance).
PostgreSQL's configuration uses the following key values:

  • work_mem = 8GB (only set for this query)
  • shared_buffers = 8GB
  • effective_cache_size = 22GB
  • max_worker_processes = 8
  • max_parallel_workers_per_gather = 4
  • jit = on
3) Tables

The "light_pages_attributes" contains about 2 million rows, each with an "attributes" column containing on average 20 integers.

CREATE TABLE
  light_pages_attributes
  (
    id            INTEGER   NOT NULL,
    "attributes"  INTEGER[] NOT NULL
  )
;
CREATE INDEX
  light_pages_attributes_attributes
ON
  light_pages_attributes
USING
  gin
  (
    attributes gin__int_ops
  )
;

The "light_pages_views" contains about 25 million rows, each with a "page_ids" column containing on average 20 integers as well.

CREATE TABLE
  light_pages_views
  (
    vector_id     BIGINT    NOT NULL,
    page_ids      INTEGER[] NOT NULL
  )
;
CREATE INDEX
  light_pages_views_page_ids
ON
  light_pages_views
USING
  gin
  (
    page_ids gin__int_ops
  )
;

4) Query

The query I'm trying to optimise is the following:

BEGIN; 
 
SET LOCAL work_mem = '8GB'; 
 
CREATE TEMPORARY VIEW
  urls
  AS
  (
    SELECT ARRAY[lpa.id]
        AS page_id
      FROM
        light_pages_attributes
          AS lpa
      WHERE
        lpa."attributes" @> ARRAY[189376]
  );
EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  COUNT(*)
FROM
  light_pages_views
    AS lpv
WHERE
  EXISTS (
    SELECT
      1
    FROM
      urls
        AS u
    WHERE
      lpv.page_ids @> u.page_id
  )
 
COMMIT;

The last query does not finish after waiting for more than 15 minutes.
(The temporary view creation is very fast and required due to the same query in a CTE greatly reducing performance (by more than 5 min.) due to the optimisation barrier I'm guessing.)
This alternative query, which should be far slower due to the fact that it generates duplicate lines through the INNER JOIN, is in fact much faster, 1 min. and 39 s.:

EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  COUNT(*)
FROM
  (
    SELECT
      1
    FROM
      light_pages_views
        AS lpv
    INNER JOIN
      urls
        AS u
        ON lpv.page_ids @> u.page_id
    GROUP BY
      lpv.vector_id
  )
    AS t
;

Visual query plan: https://explain.dalibo.com/plan/bc3#plan
Raw query plan: https://explain.dalibo.com/plan/bc3#raw

Other strategies I've tried as well:
  • lpv.page_ids @> ANY(SELECT u.page_id FROM urls AS u)
  • FULL OUTER JOIN, not possible due to the condition not being merge-joinable
The end-goal would be to update all matching "light_pages_views" rows by appending an integer to their array of integer.
So possibly millions of tows to be updated. 

Thank you a lot in advance for your help!

Mickael

Re: Array of integer indexed nested-loop semi join

От
Jeff Janes
Дата:
On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:

The last query does not finish after waiting for more than 15 minutes.
(The temporary view creation is very fast and required due to the same query in a CTE greatly reducing performance (by more than 5 min.) due to the optimisation barrier I'm guessing.)

How much over 15 minutes?  20 minutes doesn't seem that long to wait to get a likely definitive answer.  But at the least show us the EXPLAIN without ANALYZE of it, that should take no milliseconds.

And what does it mean for something to take 5 minutes longer than "never finishes"?

(Also, putting every or every other token on a separate line does not make it easier to read)

Cheer,

Jeff

Re: Array of integer indexed nested-loop semi join

От
Mickael van der Beek
Дата:
Hello Jeff,

I have waited a few hours without the query ever finishing which is the reason I said "never finishes".
Especially because the INNER JOIN version finishes within a few minutes while being combinatorial and less efficient.
The query probably only does sequential scans.

You will find the query plan using EXPLAIN here:

Thanks for your help,

Mickael

On Wed, Apr 27, 2022 at 4:28 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:

The last query does not finish after waiting for more than 15 minutes.
(The temporary view creation is very fast and required due to the same query in a CTE greatly reducing performance (by more than 5 min.) due to the optimisation barrier I'm guessing.)

How much over 15 minutes?  20 minutes doesn't seem that long to wait to get a likely definitive answer.  But at the least show us the EXPLAIN without ANALYZE of it, that should take no milliseconds.

And what does it mean for something to take 5 minutes longer than "never finishes"?

(Also, putting every or every other token on a separate line does not make it easier to read)

Cheer,

Jeff



--

Mickael van der Beek

Web developer & Security analyst

Re: Array of integer indexed nested-loop semi join

От
Mickael van der Beek
Дата:
Hello Jeff,

Sorry for the delay, here are the EXPLAIN ANALYSE results for one single row in the inner-query:

Nested Loop Semi Join  (cost=10000000993.81..10004731160.70 rows=536206 width=28) (actual time=93765.182..93765.183 rows=0 loops=1)
  Output: fu.w2_page_idxs
  Join Filter: (fu.w2_page_idxs && (ARRAY[fact_pages.idx]))
  Rows Removed by Join Filter: 53762825
  Buffers: shared hit=569194 read=2821768
  I/O Timings: read=56586.955
  ->  Seq Scan on public.fact_users fu  (cost=10000000000.00..10003925857.68 rows=53620568 width=28) (actual time=79.139..67423.779 rows=53762825 loops=1)
        Output: fu.w2_page_idxs
        Buffers: shared hit=567884 read=2821768
        I/O Timings: read=56586.955
  ->  Materialize  (cost=993.81..994.50 rows=1 width=32) (actual time=0.000..0.000 rows=1 loops=53762825)
        Output: (ARRAY[fact_pages.idx])
        Buffers: shared hit=148
        ->  Limit  (cost=993.81..994.48 rows=1 width=32) (actual time=26.382..26.383 rows=1 loops=1)
              Output: (ARRAY[fact_pages.idx])
              Buffers: shared hit=148
              ->  Bitmap Heap Scan on public.fact_pages  (cost=993.81..70645.00 rows=103556 width=32) (actual time=26.378..26.379 rows=1 loops=1)
                    Output: ARRAY[fact_pages.idx]
                    Recheck Cond: (fact_pages.attribute_idxs && '{300000160}'::integer[])
                    Heap Blocks: exact=1
                    Buffers: shared hit=148
                    ->  Bitmap Index Scan on fact_pages_attribute_idxs_int  (cost=0.00..967.92 rows=103556 width=0) (actual time=14.865..14.865 rows=101462 loops=1)
                          Index Cond: (fact_pages.attribute_idxs && '{300000160}'::integer[])
                          Buffers: shared hit=147
Query Identifier: 6779965332684941204
Planning:
  Buffers: shared hit=2
Planning Time: 0.162 ms
JIT:
  Functions: 10
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.507 ms, Inlining 9.797 ms, Optimization 54.902 ms, Emission 14.314 ms, Total 80.521 ms
Execution Time: 93766.772 ms

Query:

EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  fu.w2_page_idxs
FROM
  fact_users
    AS fu
WHERE
  EXISTS (
    SELECT
    FROM
      (
        SELECT
          ARRAY[idx] AS page_idx
        FROM
          fact_pages
        WHERE
          attribute_idxs && ARRAY[300000160]
        FETCH FIRST 1 ROWS ONLY
      )
        AS fp
    WHERE
      fu.w2_page_idxs && fp.page_idx
  )
;

Without any surprises, the planner is using a sequential scan on the "fact_users" table which is very large instead of using the GIN index set on the "w2_page_idxs" column.

Link to the query plan visualiser: https://explain.dalibo.com/plan/1vC

Thank you very much in advance,

Mickael

On Wed, Apr 27, 2022 at 4:54 PM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:
Hello Jeff,

I have waited a few hours without the query ever finishing which is the reason I said "never finishes".
Especially because the INNER JOIN version finishes within a few minutes while being combinatorial and less efficient.
The query probably only does sequential scans.

You will find the query plan using EXPLAIN here:

Thanks for your help,

Mickael

On Wed, Apr 27, 2022 at 4:28 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:

The last query does not finish after waiting for more than 15 minutes.
(The temporary view creation is very fast and required due to the same query in a CTE greatly reducing performance (by more than 5 min.) due to the optimisation barrier I'm guessing.)

How much over 15 minutes?  20 minutes doesn't seem that long to wait to get a likely definitive answer.  But at the least show us the EXPLAIN without ANALYZE of it, that should take no milliseconds.

And what does it mean for something to take 5 minutes longer than "never finishes"?

(Also, putting every or every other token on a separate line does not make it easier to read)

Cheer,

Jeff



--

Mickael van der Beek

Web developer & Security analyst



--

Mickael van der Beek

Web developer & Security analyst

Re: Array of integer indexed nested-loop semi join

От
Jeff Janes
Дата:


On Fri, May 20, 2022 at 6:42 AM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:

Query:

EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  fu.w2_page_idxs
FROM
  fact_users
    AS fu
WHERE
  EXISTS (
    SELECT
    FROM
      (
        SELECT
          ARRAY[idx] AS page_idx
        FROM
          fact_pages
        WHERE
          attribute_idxs && ARRAY[300000160]
        FETCH FIRST 1 ROWS ONLY
      )
        AS fp
    WHERE
      fu.w2_page_idxs && fp.page_idx
  )
;

Without any surprises, the planner is using a sequential scan on the "fact_users" table which is very large instead of using the GIN index set on the "w2_page_idxs" column.

For me, using the subquery in and expression, instead of the EXISTS, does get it to use the gin index.  And I think it must give the same results.

SELECT
  fu.w2_page_idxs
FROM  fact_users AS fu
WHERE
      fu.w2_page_idxs && ARRAY[(select idx from fact_pages where attribute_idxs && ARRAY[3003] FETCH FIRST 1 ROWS ONLY)];

But why are you using intarray?  That is unnecessary here, and by creating ambiguity about the array operators it might be harmful. 

Cheers,

Jeff

Re: Array of integer indexed nested-loop semi join

От
Mickael van der Beek
Дата:
Hello Jeff,

Sadly, the query you suggested won't work because you are only returning the first row of the matching inner query rows.
Example:

SELECT
  u.idx,
  u.page_idxs
FROM
  (
    VALUES
      (1, ARRAY[11, 21, 31]),
      (2, ARRAY[12, 21, 32]),
      (3, ARRAY[13, 23, 31])
  )
    AS u(idx, page_idxs)
WHERE
  u.page_idxs && ARRAY[(
    SELECT
      p.idx
    FROM
      (
        VALUES
          (11, ARRAY[101, 201, 301]),
          (21, ARRAY[102, 201, 302]),
          (13, ARRAY[103, 203, 301])
      )
        AS p(idx, attribute_idxs)
    WHERE
      p.attribute_idxs && ARRAY[201]
    FETCH FIRST 1 ROWS ONLY
  )]
;

This query only returns one row while it should actually return two:

1 {11,21,31}

The INNER JOIN version of the query will return all matching rows but also include duplicates:

SELECT
  u.idx,
  u.page_idxs
FROM
  (
    VALUES
      (1, ARRAY[11, 21, 31]),
      (2, ARRAY[12, 21, 32]),
      (3, ARRAY[13, 23, 31])
  )
    AS u(idx, page_idxs)
INNER JOIN
  (
    SELECT
      p.idx
    FROM
      (
        VALUES
          (11, ARRAY[101, 201, 301]),
          (21, ARRAY[102, 201, 302]),
          (13, ARRAY[103, 203, 301])
      )
        AS p(idx, attribute_idxs)
    WHERE
      p.attribute_idxs && ARRAY[201]
  )
  AS p2
  ON u.page_idxs && ARRAY[p2.idx]
;

Results:

1 {11,21,31}
1 {11,21,31}
2 {12,21,32}

As far as I know, the the IN + sub-expression query can't work since the left side of the operation is an array of integers and the right side a set of rows with a single integer column.
The reason I'm using integer arrays is because it is the only way I have found in PostgreSQL to get fast inclusion / exclusion checks on large datasets (hundreds of millions of values). 
Did I misunderstand your response?
Thank you for the ongoing help,

Mickael

On Mon, May 23, 2022 at 4:45 AM Jeff Janes <jeff.janes@gmail.com> wrote:


On Fri, May 20, 2022 at 6:42 AM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:

Query:

EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  fu.w2_page_idxs
FROM
  fact_users
    AS fu
WHERE
  EXISTS (
    SELECT
    FROM
      (
        SELECT
          ARRAY[idx] AS page_idx
        FROM
          fact_pages
        WHERE
          attribute_idxs && ARRAY[300000160]
        FETCH FIRST 1 ROWS ONLY
      )
        AS fp
    WHERE
      fu.w2_page_idxs && fp.page_idx
  )
;

Without any surprises, the planner is using a sequential scan on the "fact_users" table which is very large instead of using the GIN index set on the "w2_page_idxs" column.

For me, using the subquery in and expression, instead of the EXISTS, does get it to use the gin index.  And I think it must give the same results.

SELECT
  fu.w2_page_idxs
FROM  fact_users AS fu
WHERE
      fu.w2_page_idxs && ARRAY[(select idx from fact_pages where attribute_idxs && ARRAY[3003] FETCH FIRST 1 ROWS ONLY)];

But why are you using intarray?  That is unnecessary here, and by creating ambiguity about the array operators it might be harmful. 

Cheers,

Jeff



--

Mickael van der Beek

Web developer & Security analyst

Re: Array of integer indexed nested-loop semi join

От
Jeff Janes
Дата:
On Mon, May 23, 2022 at 3:57 AM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:
Hello Jeff,

Sadly, the query you suggested won't work because you are only returning the first row of the matching inner query rows.

Sure, but the query I replaced did the same thing.  (I thought that was what you wanted, but I guess that was just to get it to run fast enough to ever finish--in that case it is probably better to use EXPLAIN without the ANALYZE so that we can see the plan of the correct query).  To get around that one-row limit you have to write it somewhat differently, getting rid of the ARRAY and adding an array_agg():

SELECT fu.*
FROM
  fact_users AS fu
WHERE
      fu.w2_page_idxs && (select array_agg(idx) from fact_pages where attribute_idxs && ARRAY[201]);

This way of writing it is better, as it still works with the LIMIT 1 but also works without it.  This still uses the indexes for me, at least when enable_seqscan is off.


The INNER JOIN version of the query will return all matching rows but also include duplicates:

You could just add a DISTINCT to get rid of the duplicates.  Of course that will also take some time on a large returned data set, but probably less time than scanning a giant table.  I think this is probably cleaner than the alternatives.
 

The reason I'm using integer arrays is because it is the only way I have found in PostgreSQL to get fast inclusion / exclusion checks on large datasets (hundreds of millions of values). 
Did I misunderstand your response?

I don't know if you misunderstood.  I meant specifically the intarray extension.  You can use integer arrays with built-in GIN indexes without help from the intarray extension.  Maybe you know that already and are just saying that the extension is even faster than the built-in indexed operators are and you need that extra speed.

Cheers,

Jeff

Re: Array of integer indexed nested-loop semi join

От
Mickael van der Beek
Дата:
Hello Jeff,

Thank you again for your advice.

I did indeed think of the ARRAY_AGG() version of the query.
Although this method is very fast (and does use indexes) for smallish array sizes, this is sadly not practical in my case because the arrays of matching rows can reach multiple hundreds of thousands of rows.
I thought of maybe "batching" the ARRAY_AGG() in batches of max n rows in a subquery and then calculating intersection on that but it doesn't seem practical or faster in the end.

> You could just add a DISTINCT to get rid of the duplicates.  Of course that will also take some time on a large returned data set, but probably less time than scanning a giant table.  I think this is probably cleaner than the alternatives.

Yes, and a GROUP BY will do the trick as well.
The fact that the current solution is a "nested loop" instead of a "nested loop semi join" means that the query is much slower due to needing to GROUP BY the rows.
This is why I tried various version using EXISTS, ANY, ARRAY_AGG(), etc with no avail.
Would you have an idea on why PostgreSQL doesn't use the existing indexes for this type of subqueries ?

> I don't know if you misunderstood.  I meant specifically the intarray extension.  You can use integer arrays with built-in GIN indexes without help from the intarray extension.  Maybe you know that already and are just saying that the extension is even faster than the built-in indexed operators are and you need that extra speed.

Are there specific advantages to not using the intarray extension and it's indexes in this case?
I was under the impression that it supported more operation types and was generally faster for this niche use case.

Thank you again for your help,

Mickael
 


On Mon, May 23, 2022 at 4:11 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, May 23, 2022 at 3:57 AM Mickael van der Beek <mickael.van.der.beek@gmail.com> wrote:
Hello Jeff,

Sadly, the query you suggested won't work because you are only returning the first row of the matching inner query rows.

Sure, but the query I replaced did the same thing.  (I thought that was what you wanted, but I guess that was just to get it to run fast enough to ever finish--in that case it is probably better to use EXPLAIN without the ANALYZE so that we can see the plan of the correct query).  To get around that one-row limit you have to write it somewhat differently, getting rid of the ARRAY and adding an array_agg():

SELECT fu.*
FROM
  fact_users AS fu
WHERE
      fu.w2_page_idxs && (select array_agg(idx) from fact_pages where attribute_idxs && ARRAY[201]);

This way of writing it is better, as it still works with the LIMIT 1 but also works without it.  This still uses the indexes for me, at least when enable_seqscan is off.


The INNER JOIN version of the query will return all matching rows but also include duplicates:

You could just add a DISTINCT to get rid of the duplicates.  Of course that will also take some time on a large returned data set, but probably less time than scanning a giant table.  I think this is probably cleaner than the alternatives.
 

The reason I'm using integer arrays is because it is the only way I have found in PostgreSQL to get fast inclusion / exclusion checks on large datasets (hundreds of millions of values). 
Did I misunderstand your response?

I don't know if you misunderstood.  I meant specifically the intarray extension.  You can use integer arrays with built-in GIN indexes without help from the intarray extension.  Maybe you know that already and are just saying that the extension is even faster than the built-in indexed operators are and you need that extra speed.

Cheers,

Jeff



--

Mickael van der Beek

Web developer & Security analyst