Обсуждение: How bad is using queries with thousands of values for operators IN or ANY?

Поиск
Список
Период
Сортировка

How bad is using queries with thousands of values for operators IN or ANY?

От
Thorsten Schöning
Дата:
Hi all,

I have lots of queries in which I need to restrict access to rows
using some decimal row-ID and am mostly doing so with using the
operator IN in WHERE-clauses. Additionally I'm mostly embedding the
IDs as ","-seperated list into the query directly, e.g. because I
already hit a limitation of ~32k parameters of the JDBC-driver[1] for
Postgres.

I really thought that in most cases simply sending a large amount of
IDs embedded into the query is better than looping, because it safes
roundtrips to access the DB, the planner of the DB has all pieces of
information it needs to decide best strategies etc. OTOH, with recent
tests and an increased number of IDs of about factor 100, I have
additional load in Tomcat before actually sending the query to the DB
already and in the DB itself as well of course. I've attached an
example query and plan.

> ->  Hash  (cost=242592.66..242592.66 rows=6825 width=39) (actual time=91.117..91.117 rows=40044 loops=3)
>       Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 3016kB
>       ->  Hash Join  (cost=137.57..242592.66 rows=6825 width=39) (actual time=10.194..82.412 rows=40044 loops=3)
>             Hash Cond: (meter.meter_bcd = meter_bcd.id)
>             ->  Index Scan using pk_meter on meter  (cost=0.42..242237.10 rows=40044 width=25) (actual
time=9.350..71.276rows=40044 loops=3) 
>                   Index Cond: (id = ANY ('{[...]}'::integer[]))
>             ->  Hash  (cost=135.73..135.73 rows=113 width=22) (actual time=0.830..0.830 rows=113 loops=3)

Do you know of any obvious limitations of the JDBC-driver of handling
such large queries? In the end, the query is mostly large text with
only very few bind parameters.

Do you know of any obvious problem in Postgres itself with that query,
when parsing it or alike? Do things simply take how long they take and
are mostly comparable to looping or is there some additional overhead
the larger the query itself gets? From my naive expectation, comparing
IDs shouldn't care if things get looped or transmitted at once.

I'm just trying to collect some input for where to look at to optimize
things in the future. Thanks!

[1]: https://github.com/pgjdbc/pgjdbc/issues/90

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow
Вложения

Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Tim Cross
Дата:
Thorsten Schöning <tschoening@am-soft.de> writes:

> Hi all,
>
> I have lots of queries in which I need to restrict access to rows
> using some decimal row-ID and am mostly doing so with using the
> operator IN in WHERE-clauses. Additionally I'm mostly embedding the
> IDs as ","-seperated list into the query directly, e.g. because I
> already hit a limitation of ~32k parameters of the JDBC-driver[1] for
> Postgres.
>
> I really thought that in most cases simply sending a large amount of
> IDs embedded into the query is better than looping, because it safes
> roundtrips to access the DB, the planner of the DB has all pieces of
> information it needs to decide best strategies etc. OTOH, with recent
> tests and an increased number of IDs of about factor 100, I have
> additional load in Tomcat before actually sending the query to the DB
> already and in the DB itself as well of course. I've attached an
> example query and plan.
>
>> ->  Hash  (cost=242592.66..242592.66 rows=6825 width=39) (actual time=91.117..91.117 rows=40044 loops=3)
>>       Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 3016kB
>>       ->  Hash Join  (cost=137.57..242592.66 rows=6825 width=39) (actual time=10.194..82.412 rows=40044 loops=3)
>>             Hash Cond: (meter.meter_bcd = meter_bcd.id)
>>             ->  Index Scan using pk_meter on meter  (cost=0.42..242237.10 rows=40044 width=25) (actual
time=9.350..71.276rows=40044 loops=3) 
>>                   Index Cond: (id = ANY ('{[...]}'::integer[]))
>>             ->  Hash  (cost=135.73..135.73 rows=113 width=22) (actual time=0.830..0.830 rows=113 loops=3)
>
> Do you know of any obvious limitations of the JDBC-driver of handling
> such large queries? In the end, the query is mostly large text with
> only very few bind parameters.
>
> Do you know of any obvious problem in Postgres itself with that query,
> when parsing it or alike? Do things simply take how long they take and
> are mostly comparable to looping or is there some additional overhead
> the larger the query itself gets? From my naive expectation, comparing
> IDs shouldn't care if things get looped or transmitted at once.
>
> I'm just trying to collect some input for where to look at to optimize
> things in the future. Thanks!
>
> [1]: https://github.com/pgjdbc/pgjdbc/issues/90
>
> Mit freundlichen Grüßen,
>
> Thorsten Schöning

It would help to see the query as well as the plan.

Where are these 100s of IDs coming from? I sometimes find this a sign
you could be re-structuring your query to be a join between two tables
where one table contains the IDs of interest rather than trying to embed
them into the query as part of a where clause.

--
Tim Cross



Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Pavel Stehule
Дата:


po 31. 8. 2020 v 10:04 odesílatel Thorsten Schöning <tschoening@am-soft.de> napsal:
Hi all,

I have lots of queries in which I need to restrict access to rows
using some decimal row-ID and am mostly doing so with using the
operator IN in WHERE-clauses. Additionally I'm mostly embedding the
IDs as ","-seperated list into the query directly, e.g. because I
already hit a limitation of ~32k parameters of the JDBC-driver[1] for
Postgres.

I really thought that in most cases simply sending a large amount of
IDs embedded into the query is better than looping, because it safes
roundtrips to access the DB, the planner of the DB has all pieces of
information it needs to decide best strategies etc. OTOH, with recent
tests and an increased number of IDs of about factor 100, I have
additional load in Tomcat before actually sending the query to the DB
already and in the DB itself as well of course. I've attached an
example query and plan.

> ->  Hash  (cost=242592.66..242592.66 rows=6825 width=39) (actual time=91.117..91.117 rows=40044 loops=3)
>       Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 3016kB
>       ->  Hash Join  (cost=137.57..242592.66 rows=6825 width=39) (actual time=10.194..82.412 rows=40044 loops=3)
>             Hash Cond: (meter.meter_bcd = meter_bcd.id)
>             ->  Index Scan using pk_meter on meter  (cost=0.42..242237.10 rows=40044 width=25) (actual time=9.350..71.276 rows=40044 loops=3)
>                   Index Cond: (id = ANY ('{[...]}'::integer[]))
>             ->  Hash  (cost=135.73..135.73 rows=113 width=22) (actual time=0.830..0.830 rows=113 loops=3)

Do you know of any obvious limitations of the JDBC-driver of handling
such large queries? In the end, the query is mostly large text with
only very few bind parameters.

Do you know of any obvious problem in Postgres itself with that query,
when parsing it or alike? Do things simply take how long they take and
are mostly comparable to looping or is there some additional overhead
the larger the query itself gets? From my naive expectation, comparing
IDs shouldn't care if things get looped or transmitted at once.

I'm just trying to collect some input for where to look at to optimize
things in the future. Thanks!

It is not good - it increases the memory necessary for query parsing, optimizer and executor are slower.

Postgres currently has not any optimization for processing searching in these long lists - so this search is very slow against other methods.

I think this is a signal so something in the design database or architecture is wrong. Sure, there can be exception, but the Postgres has not any optimization for this design

Regards

Pavel


[1]: https://github.com/pgjdbc/pgjdbc/issues/90

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Thorsten Schöning
Дата:
Guten Tag Tim Cross,
am Montag, 31. August 2020 um 10:21 schrieben Sie:

> It would help to see the query as well as the plan.

I've attached both to my mail and my received copy shows it's
available. Maybe the following file hoster works better for you:

https://gofile.io/d/vWPliD
https://gofile.io/d/AiaDmN

> Where are these 100s of IDs coming from? I sometimes find this a sign
> you could be re-structuring your query to be a join between two tables
> where one table contains the IDs of interest rather than trying to embed
> them into the query as part of a where clause.

It's somewhat like that, but on a higher level: My app has different
functions, one searching for some rows according to some criteria and
returning IDs, one reader for those IDs and caring about additional
things. It's like that because both has already been available
individually and is necessary individually, while sometimes it make
sense to combine them.

Integrating things is an option of cource, that's exactly why I asked
for opinions about how bad the current implementation is. If this
concrete apsect is not too bad, I might simply keep it and look
elsewhere to improve.

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow




Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Thorsten Schöning
Дата:
Guten Tag Laurenz Albe,
am Montag, 31. August 2020 um 10:23 schrieben Sie:

> Is it an option to change that usage pattern?

Not everywhere easily, because I simply have places in which I'm
working with IDs only and e.g. need to check if they are owned by some
user. Those checks are abstracted away into individual functions
instead of JOINing always, as the latter makes queries more difficult
to maintain.

Of course there are alternatives like row level security I might have
a look at etc. And that's exactly why I asked the question, to make my
mind about that thing, if it's worth or necessary to look into
alternatives at all etc.

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow




Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Thorsten Schöning
Дата:
Guten Tag Pavel Stehule,
am Montag, 31. August 2020 um 10:51 schrieben Sie:

> It is not good - it increases the memory necessary for query parsing,
> optimizer and executor are slower.

At least memory and I/O were not a problem regarding my tests,
CPU-load was pretty high. But I was unable to see if it's because
comparing IDs or parsing the query or alike.

Can one see the difference somehow? This would be the only chance to
see if splitting the large query up into multiple smaller is an
overall improvement.

> Postgres currently has not any optimization for processing searching in
> these long lists - so this search is very slow against other methods.

Which other methods do you have in mind, populating a temporary table
and joining that? I tested things like those in context of other
queries and creating those tables, indexes etc. introduced more
overhead than the query benefitted.

Additionally, it's not always these large queries. Most of the times
it's far less IDs and things are fast. I even introduced embedding
those IDs because the query was faster afterwards instead of using
some JOIN and comparing the IDs, even though indexes were in place and
used etc.

So for what query size or number of IDs to compare in IN would you
consider a different approach at all?

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow




Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Thomas Kellerer
Дата:
Thorsten Schöning schrieb am 31.08.2020 um 12:37:
> So for what query size or number of IDs to compare in IN would you
> consider a different approach at all?


In my experience "hundreds" of IDs tend to be quite slow if used with an IN clause.

Rewriting the IN to a JOIN against a VALUES clause is very often faster:

So instead of:

  select *
  from t
  where id in (1,2,3, .... ,500);

using this:

  select *
  from t
    join (
       values (1),(2),(3),...(500)
    ) as x(id) on x.id = t.id

produces more often than not a more efficient execution plan (assuming no values are duplicated in the IN list)

Obviously I don't know if such a re-write is even feasible though.

Thomas




Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Pavel Stehule
Дата:


po 31. 8. 2020 v 13:29 odesílatel Thomas Kellerer <shammat@gmx.net> napsal:
Thorsten Schöning schrieb am 31.08.2020 um 12:37:
> So for what query size or number of IDs to compare in IN would you
> consider a different approach at all?


In my experience "hundreds" of IDs tend to be quite slow if used with an IN clause.

Rewriting the IN to a JOIN against a VALUES clause is very often faster:

So instead of:

  select *
  from t
  where id in (1,2,3, .... ,500);

using this:

  select *
  from t
    join (
       values (1),(2),(3),...(500)
    ) as x(id) on x.id = t.id

produces more often than not a more efficient execution plan (assuming no values are duplicated in the IN list)

Obviously I don't know if such a re-write is even feasible though.

yes - this query probably will have a slow start, but the execution will be fast. Unfortunately, there are not available statistics.


Thomas



Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Kyotaro Horiguchi
Дата:
At Mon, 31 Aug 2020 16:04:43 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in
> po 31. 8. 2020 v 13:29 odesílatel Thomas Kellerer <shammat@gmx.net> napsal:
>
> > Thorsten Schöning schrieb am 31.08.2020 um 12:37:
> > > So for what query size or number of IDs to compare in IN would you
> > > consider a different approach at all?
> >
> >
> > In my experience "hundreds" of IDs tend to be quite slow if used with an
> > IN clause.
> >
> > Rewriting the IN to a JOIN against a VALUES clause is very often faster:
> >
> > So instead of:
> >
> >   select *
> >   from t
> >   where id in (1,2,3, .... ,500);
> >
> > using this:
> >
> >   select *
> >   from t
> >     join (
> >        values (1),(2),(3),...(500)
> >     ) as x(id) on x.id = t.id
> >
> > produces more often than not a more efficient execution plan (assuming no
> > values are duplicated in the IN list)
> >
> > Obviously I don't know if such a re-write is even feasible though.
> >
>
> yes - this query probably will have a slow start, but the execution will be
> fast. Unfortunately, there are not available statistics.

 FWIW, the attached is the dusted-off version of a part of a stalled
 development of mine, which unconditionally(!) creates on-the-fly
 statistics on VALUES list. It seems to work for certain cases,
 although the planning time increases significantly.

=$ CREATE TABLE t1 AS SELECT a, a * 2 AS b FROM generate_series(0, 99999) a;
=$ CREATE INDEX ON t1 (a);
> perl q.pl(*) | psql

*: q.pl:
> print "explain analyze select b from t1 join (values ";
> foreach $i (0..10000) {
>     print ", " if ($i > 0);
>     printf("(%d)", $i/10 + 1000);
> }
> print ") as v(v) on (v.v = t1.a);";


patched:

 Merge Join  (cost=824.25..1005.19 rows=10001 width=4) (actual time=13.513..24.285 rows=10001 loops=1)
   Merge Cond: (t1.a = "*VALUES*".column1)
   ->  Index Scan using t1_a_idx on t1  (cost=0.29..3050.29 rows=100000 width=8) (actual time=0.033..1.629 rows=2002
loops=1)
   ->  Sort  (cost=789.47..814.47 rows=10001 width=4) (actual time=12.557..14.546 rows=10001 loops=1)
         Sort Key: "*VALUES*".column1
         Sort Method: quicksort  Memory: 931kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.271 rows=10001
loops=1)
 Planning Time: 17.290 ms
 Execution Time: 26.344 ms
(9 rows)

master:
 Hash Join  (cost=250.03..2168.03 rows=10001 width=4) (actual time=14.482..77.205 rows=10001 loops=1)
   Hash Cond: (t1.a = "*VALUES*".column1)
   ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.017..23.540 rows=100000 loops=1)
   ->  Hash  (cost=125.01..125.01 rows=10001 width=4) (actual time=13.786..13.788 rows=10001 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 480kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.503 rows=10001
loops=1)
 Planning Time: 12.365 ms
 Execution Time: 78.567 ms
(8 rows)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
From c6680488178bb7eeae66c92f46d475ac69102481 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 6 Jul 2017 14:46:49 +0900
Subject: [PATCH] Temporary statistics on VALUES list

When using "IN (VALUES" instead of "IN (list.." to obtain better
plans, sometimes the final estimate goes bad because of the lack of
that for VALUES. This patch allowes VALUES to have statistics and
improves the estimate in the case.
---
 src/backend/commands/analyze.c   | 289 +++++++++++++++++++++----------
 src/backend/utils/adt/selfuncs.c |  59 ++++++-
 src/include/commands/vacuum.h    |   2 +
 src/include/nodes/parsenodes.h   |   2 +
 4 files changed, 260 insertions(+), 92 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..aba06415c8 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -102,6 +102,9 @@ static int    compare_rows(const void *a, const void *b);
 static int    acquire_inherited_sample_rows(Relation onerel, int elevel,
                                           HeapTuple *rows, int targrows,
                                           double *totalrows, double *totaldeadrows);
+static HeapTuple construct_stats_tuple(VacAttrStats *stats,
+                                       Oid attrrelid, bool inh,
+                                       Relation statsrel, HeapTuple oldtup);
 static void update_attstats(Oid relid, bool inh,
                             int natts, VacAttrStats **vacattrstats);
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
@@ -278,6 +281,86 @@ analyze_rel(Oid relid, RangeVar *relation,
     pgstat_progress_end_command();
 }
 
+/*
+ * analyze_values() -- analyze one attribute in a values list then put the
+ * result into vardata.
+ */
+static List *valuesrows;
+
+static Datum
+values_attrFetchFunc(VacAttrStatsP stats, int rownum, bool *isNull)
+{
+    List *row = (List *) list_nth(valuesrows, rownum);
+    Const *con;
+
+    Assert(IsA(row, List));
+    Assert(stats->tupattnum <= list_length(row));
+    con = (Const *) list_nth(row, stats->tupattnum - 1);
+    Assert(IsA(con, Const));
+
+    return con->constvalue;
+}
+
+HeapTuple
+analyze_values(Var *var, List *values_lists)
+{
+    int i;
+    HeapTuple typetup;
+    HeapTuple result = NULL;
+    VacAttrStats *stats;
+    int rows;
+    bool ok;
+    MemoryContext analyze_cxt =
+        AllocSetContextCreate(CurrentMemoryContext, "Temp Analyze",
+                              ALLOCSET_DEFAULT_SIZES);
+
+    valuesrows = values_lists;
+    rows = list_length(valuesrows);
+    stats = (VacAttrStats *) palloc0(sizeof(VacAttrStats));
+    stats->attr = (Form_pg_attribute) palloc(ATTRIBUTE_FIXED_PART_SIZE);
+    stats->attr->attrelid = InvalidOid;
+    stats->attr->attstattarget = default_statistics_target;
+    stats->attrtypid = var->vartype;
+    stats->attrtypmod = var->vartypmod;
+    typetup = SearchSysCacheCopy1(TYPEOID, ObjectIdGetDatum(var->vartype));
+    if (!HeapTupleIsValid(typetup))
+        elog(ERROR, "cache lookup failed for type %u", var->vartype);
+    stats->attrtype = (Form_pg_type) GETSTRUCT(typetup);
+    stats->attrcollid = var->varcollid;
+    stats->anl_context = analyze_cxt;
+    stats->tupattnum = var->varattno;
+    stats->nodelay = true;        /* don't make a delay during analyze */
+
+    for (i = 0 ; i < STATISTIC_NUM_SLOTS ; i++)
+    {
+        stats->statypid[i] = stats->attrtypid;
+        stats->statyplen[i] = stats->attrtype->typlen;
+        stats->statypbyval[i] = stats->attrtype->typbyval;
+        stats->statypalign[i] = stats->attrtype->typalign;
+    }
+    if (OidIsValid(stats->attrtype->typanalyze))
+        ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
+                                           PointerGetDatum(stats)));
+    else
+        ok = std_typanalyze(stats);
+    if (ok && stats->compute_stats != NULL && stats->minrows > 0)
+    {
+        Relation sd;
+
+        stats->compute_stats(stats, values_attrFetchFunc, rows, rows);
+        sd = table_open(StatisticRelationId, AccessShareLock);
+        result = construct_stats_tuple(stats, InvalidOid, false, sd, NULL);
+        table_close(sd, AccessShareLock);
+    }
+
+    heap_freetuple(typetup);
+    pfree(stats->attr);
+    pfree(stats);
+    MemoryContextDelete(analyze_cxt);
+
+    return result;
+}
+
 /*
  *    do_analyze_rel() -- analyze one relation, recursively or not
  *
@@ -1421,6 +1504,112 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
     return numrows;
 }
 
+/* construct_stats_tuple() -- construct a statistics tuple */
+static HeapTuple
+construct_stats_tuple(VacAttrStats *stats, Oid attrrelid, bool inh,
+                      Relation statsrel, HeapTuple oldtup)
+{
+    HeapTuple stup;
+    int            i,
+                k,
+                n;
+    Datum        values[Natts_pg_statistic];
+    bool        nulls[Natts_pg_statistic];
+    bool        replaces[Natts_pg_statistic];
+
+    /*
+     * Construct a new pg_statistic tuple
+     */
+    for (i = 0; i < Natts_pg_statistic; ++i)
+    {
+        nulls[i] = false;
+        replaces[i] = true;
+    }
+
+    values[Anum_pg_statistic_starelid - 1] = ObjectIdGetDatum(attrrelid);
+    values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(stats->attr->attnum);
+    values[Anum_pg_statistic_stainherit - 1] = BoolGetDatum(inh);
+    values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
+    values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
+    values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
+    i = Anum_pg_statistic_stakind1 - 1;
+    for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
+    {
+        values[i++] = Int16GetDatum(stats->stakind[k]); /* stakindN */
+    }
+    i = Anum_pg_statistic_staop1 - 1;
+    for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
+    {
+        values[i++] = ObjectIdGetDatum(stats->staop[k]);    /* staopN */
+    }
+    i = Anum_pg_statistic_stacoll1 - 1;
+    for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
+    {
+        values[i++] = ObjectIdGetDatum(stats->stacoll[k]);    /* stacollN */
+    }
+    i = Anum_pg_statistic_stanumbers1 - 1;
+    for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
+    {
+        int            nnum = stats->numnumbers[k];
+
+        if (nnum > 0)
+        {
+            Datum       *numdatums = (Datum *) palloc(nnum * sizeof(Datum));
+            ArrayType  *arry;
+
+            for (n = 0; n < nnum; n++)
+                numdatums[n] = Float4GetDatum(stats->stanumbers[k][n]);
+            /* XXX knows more than it should about type float4: */
+            arry = construct_array(numdatums, nnum,
+                                   FLOAT4OID,
+                                   sizeof(float4), true, TYPALIGN_INT);
+            values[i++] = PointerGetDatum(arry);    /* stanumbersN */
+        }
+        else
+        {
+            nulls[i] = true;
+            values[i++] = (Datum) 0;
+        }
+    }
+    i = Anum_pg_statistic_stavalues1 - 1;
+    for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
+    {
+        if (stats->numvalues[k] > 0)
+        {
+            ArrayType  *arry;
+
+            arry = construct_array(stats->stavalues[k],
+                                   stats->numvalues[k],
+                                   stats->statypid[k],
+                                   stats->statyplen[k],
+                                   stats->statypbyval[k],
+                                   stats->statypalign[k]);
+            values[i++] = PointerGetDatum(arry);    /* stavaluesN */
+        }
+        else
+        {
+            nulls[i] = true;
+            values[i++] = (Datum) 0;
+        }
+    }
+
+    if (HeapTupleIsValid(oldtup))
+    {
+        /* construct a modified tuple */
+        stup = heap_modify_tuple(oldtup,
+                                 RelationGetDescr(statsrel),
+                                 values,
+                                 nulls,
+                                 replaces);
+    }
+    else
+    {
+        /* construct a new tuple */
+        stup = heap_form_tuple(RelationGetDescr(statsrel), values, nulls);
+    }
+
+    return stup;
+}
 
 /*
  *    update_attstats() -- update attribute statistics for one relation
@@ -1460,114 +1649,29 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
         VacAttrStats *stats = vacattrstats[attno];
         HeapTuple    stup,
                     oldtup;
-        int            i,
-                    k,
-                    n;
-        Datum        values[Natts_pg_statistic];
-        bool        nulls[Natts_pg_statistic];
-        bool        replaces[Natts_pg_statistic];
 
         /* Ignore attr if we weren't able to collect stats */
         if (!stats->stats_valid)
             continue;
 
-        /*
-         * Construct a new pg_statistic tuple
-         */
-        for (i = 0; i < Natts_pg_statistic; ++i)
-        {
-            nulls[i] = false;
-            replaces[i] = true;
-        }
-
-        values[Anum_pg_statistic_starelid - 1] = ObjectIdGetDatum(relid);
-        values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(stats->attr->attnum);
-        values[Anum_pg_statistic_stainherit - 1] = BoolGetDatum(inh);
-        values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
-        values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
-        values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
-        i = Anum_pg_statistic_stakind1 - 1;
-        for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
-        {
-            values[i++] = Int16GetDatum(stats->stakind[k]); /* stakindN */
-        }
-        i = Anum_pg_statistic_staop1 - 1;
-        for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
-        {
-            values[i++] = ObjectIdGetDatum(stats->staop[k]);    /* staopN */
-        }
-        i = Anum_pg_statistic_stacoll1 - 1;
-        for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
-        {
-            values[i++] = ObjectIdGetDatum(stats->stacoll[k]);    /* stacollN */
-        }
-        i = Anum_pg_statistic_stanumbers1 - 1;
-        for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
-        {
-            int            nnum = stats->numnumbers[k];
-
-            if (nnum > 0)
-            {
-                Datum       *numdatums = (Datum *) palloc(nnum * sizeof(Datum));
-                ArrayType  *arry;
-
-                for (n = 0; n < nnum; n++)
-                    numdatums[n] = Float4GetDatum(stats->stanumbers[k][n]);
-                /* XXX knows more than it should about type float4: */
-                arry = construct_array(numdatums, nnum,
-                                       FLOAT4OID,
-                                       sizeof(float4), true, TYPALIGN_INT);
-                values[i++] = PointerGetDatum(arry);    /* stanumbersN */
-            }
-            else
-            {
-                nulls[i] = true;
-                values[i++] = (Datum) 0;
-            }
-        }
-        i = Anum_pg_statistic_stavalues1 - 1;
-        for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
-        {
-            if (stats->numvalues[k] > 0)
-            {
-                ArrayType  *arry;
-
-                arry = construct_array(stats->stavalues[k],
-                                       stats->numvalues[k],
-                                       stats->statypid[k],
-                                       stats->statyplen[k],
-                                       stats->statypbyval[k],
-                                       stats->statypalign[k]);
-                values[i++] = PointerGetDatum(arry);    /* stavaluesN */
-            }
-            else
-            {
-                nulls[i] = true;
-                values[i++] = (Datum) 0;
-            }
-        }
-
         /* Is there already a pg_statistic tuple for this attribute? */
         oldtup = SearchSysCache3(STATRELATTINH,
                                  ObjectIdGetDatum(relid),
                                  Int16GetDatum(stats->attr->attnum),
                                  BoolGetDatum(inh));
 
+        /* construct a statistics tuple */
+        stup = construct_stats_tuple(stats, relid, inh, sd, oldtup);
+
         if (HeapTupleIsValid(oldtup))
         {
             /* Yes, replace it */
-            stup = heap_modify_tuple(oldtup,
-                                     RelationGetDescr(sd),
-                                     values,
-                                     nulls,
-                                     replaces);
             ReleaseSysCache(oldtup);
             CatalogTupleUpdate(sd, &stup->t_self, stup);
         }
         else
         {
             /* No, insert new tuple */
-            stup = heap_form_tuple(RelationGetDescr(sd), values, nulls);
             CatalogTupleInsert(sd, stup);
         }
 
@@ -1776,7 +1880,8 @@ compute_trivial_stats(VacAttrStatsP stats,
         Datum        value;
         bool        isnull;
 
-        vacuum_delay_point();
+        if (!stats->nodelay)
+            vacuum_delay_point();
 
         value = fetchfunc(stats, i, &isnull);
 
@@ -1892,7 +1997,8 @@ compute_distinct_stats(VacAttrStatsP stats,
         int            firstcount1,
                     j;
 
-        vacuum_delay_point();
+        if (!stats->nodelay)
+            vacuum_delay_point();
 
         value = fetchfunc(stats, i, &isnull);
 
@@ -2239,7 +2345,8 @@ compute_scalar_stats(VacAttrStatsP stats,
         Datum        value;
         bool        isnull;
 
-        vacuum_delay_point();
+        if (!stats->nodelay)
+            vacuum_delay_point();
 
         value = fetchfunc(stats, i, &isnull);
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 00c7afc66f..225938df94 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5226,10 +5226,67 @@ examine_simple_variable(PlannerInfo *root, Var *var,
             examine_simple_variable(rel->subroot, var, vardata);
         }
     }
+    else if (rte->rtekind == RTE_VALUES)
+    {
+        HeapTuple    statstup;
+        bool        all_const = true;
+        ListCell   *lc1, *lc2;
+
+        /* known that this attr doesn't have statistics */
+        if (bms_is_member(var->varattno, rte->notmpstats))
+            return;
+
+        if (!rte->tmpstats ||
+            !HeapTupleIsValid(rte->tmpstats[var->varattno]))
+        {
+            /* no stats for this var. calcuate stats if possible */
+
+            /* we allow only all-const values lists */
+            foreach (lc1, rte->values_lists)
+            {
+                foreach (lc2, (List *) lfirst(lc1))
+                {
+                    if (!IsA(lfirst(lc2), Const))
+                    {
+                        all_const = false;
+                        break;
+                    }
+                }
+                if (!all_const)
+                    break;
+            }
+            if (!all_const)
+            {
+                /* set negative info and return */
+                rte->notmpstats =
+                    bms_add_member(rte->notmpstats, var->varattno);
+                return;
+            }
+        }
+
+        if (!rte->tmpstats)
+        {
+            int ncols = list_length(rte->coltypes);
+            rte->tmpstats =
+                (HeapTuple*) palloc0(sizeof(HeapTuple) * (ncols + 1));
+        }
+
+        if (!rte->tmpstats[var->varattno])
+            rte->tmpstats[var->varattno] =
+                analyze_values(var, rte->values_lists);
+
+        statstup = heap_copytuple(rte->tmpstats[var->varattno]);
+        if (HeapTupleIsValid(statstup))
+        {
+            vardata->statsTuple = statstup;
+            vardata->freefunc = heap_freetuple;
+            vardata->acl_ok = true;
+        }
+    }
     else
     {
         /*
-         * Otherwise, the Var comes from a FUNCTION, VALUES, or CTE RTE.  (We
+         * Otherwise, the Var comes from a FUNCTION, or CTE RTE.  (We
          * won't see RTE_JOIN here because join alias Vars have already been
          * flattened.)    There's not much we can do with function outputs, but
          * maybe someday try to be smarter about VALUES and/or CTEs.
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index a4cd721400..41ac254f6e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -126,6 +126,7 @@ typedef struct VacAttrStats
     Form_pg_type attrtype;        /* copy of pg_type row for attrtypid */
     Oid            attrcollid;        /* collation of data being analyzed */
     MemoryContext anl_context;    /* where to save long-lived data */
+    bool        nodelay;        /* true to skip delays */
 
     /*
      * These fields must be filled in by the typanalyze routine, unless it
@@ -283,6 +284,7 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 extern void analyze_rel(Oid relid, RangeVar *relation,
                         VacuumParams *params, List *va_cols, bool in_outer_xact,
                         BufferAccessStrategy bstrategy);
+extern HeapTuple analyze_values(Var *var, List *values_list);
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 47d4c07306..7966e647cb 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1123,6 +1123,8 @@ typedef struct RangeTblEntry
     Bitmapset  *updatedCols;    /* columns needing UPDATE permission */
     Bitmapset  *extraUpdatedCols;    /* generated columns being updated */
     List       *securityQuals;    /* security barrier quals to apply, if any */
+    struct HeapTupleData **tmpstats;/*temporary statistics, indexed by attnum */
+    Bitmapset  *notmpstats;        /* negative cache of tmpstats */
 } RangeTblEntry;
 
 /*
-- 
2.18.4


Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Michael Lewis
Дата:
On Tue, Sep 1, 2020 at 1:22 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
 FWIW, the attached is the dusted-off version of a part of a stalled
 development of mine, which unconditionally(!) creates on-the-fly
 statistics on VALUES list. It seems to work for certain cases,
 although the planning time increases significantly.

=$ CREATE TABLE t1 AS SELECT a, a * 2 AS b FROM generate_series(0, 99999) a;
=$ CREATE INDEX ON t1 (a);
> perl q.pl(*) | psql

*: q.pl:
> print "explain analyze select b from t1 join (values ";
> foreach $i (0..10000) {
>       print ", " if ($i > 0);
>       printf("(%d)", $i/10 + 1000);
> }
> print ") as v(v) on (v.v = t1.a);";


patched:

 Merge Join  (cost=824.25..1005.19 rows=10001 width=4) (actual time=13.513..24.285 rows=10001 loops=1)
   Merge Cond: (t1.a = "*VALUES*".column1)
   ->  Index Scan using t1_a_idx on t1  (cost=0.29..3050.29 rows=100000 width=8) (actual time=0.033..1.629 rows=2002 loops=1)
   ->  Sort  (cost=789.47..814.47 rows=10001 width=4) (actual time=12.557..14.546 rows=10001 loops=1)
         Sort Key: "*VALUES*".column1
         Sort Method: quicksort  Memory: 931kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.271 rows=10001 loops=1)
 Planning Time: 17.290 ms
 Execution Time: 26.344 ms
(9 rows)

master:
 Hash Join  (cost=250.03..2168.03 rows=10001 width=4) (actual time=14.482..77.205 rows=10001 loops=1)
   Hash Cond: (t1.a = "*VALUES*".column1)
   ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.017..23.540 rows=100000 loops=1)
   ->  Hash  (cost=125.01..125.01 rows=10001 width=4) (actual time=13.786..13.788 rows=10001 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 480kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.503 rows=10001 loops=1)
 Planning Time: 12.365 ms
 Execution Time: 78.567 ms
(8 rows)

We end up abusing the option of creating temp tables and analyzing them to get around the pain of queries going off the rails because of bad stats or lack of stats on values. I believe most/all of the core team and perhaps most contributors are against query hints in general (with some very good reasons) but it might be amazing to have the option to incur the planning time cost in some cases at least.

For my case, I think the changes coming in PG v13 or maybe v14 for joins being helped by multivariate statistics will mitigate the pain point.

Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Pavel Stehule
Дата:


út 1. 9. 2020 v 9:22 odesílatel Kyotaro Horiguchi <horikyota.ntt@gmail.com> napsal:
At Mon, 31 Aug 2020 16:04:43 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in
> po 31. 8. 2020 v 13:29 odesílatel Thomas Kellerer <shammat@gmx.net> napsal:
>
> > Thorsten Schöning schrieb am 31.08.2020 um 12:37:
> > > So for what query size or number of IDs to compare in IN would you
> > > consider a different approach at all?
> >
> >
> > In my experience "hundreds" of IDs tend to be quite slow if used with an
> > IN clause.
> >
> > Rewriting the IN to a JOIN against a VALUES clause is very often faster:
> >
> > So instead of:
> >
> >   select *
> >   from t
> >   where id in (1,2,3, .... ,500);
> >
> > using this:
> >
> >   select *
> >   from t
> >     join (
> >        values (1),(2),(3),...(500)
> >     ) as x(id) on x.id = t.id
> >
> > produces more often than not a more efficient execution plan (assuming no
> > values are duplicated in the IN list)
> >
> > Obviously I don't know if such a re-write is even feasible though.
> >
>
> yes - this query probably will have a slow start, but the execution will be
> fast. Unfortunately, there are not available statistics.

 FWIW, the attached is the dusted-off version of a part of a stalled
 development of mine, which unconditionally(!) creates on-the-fly
 statistics on VALUES list. It seems to work for certain cases,
 although the planning time increases significantly.

=$ CREATE TABLE t1 AS SELECT a, a * 2 AS b FROM generate_series(0, 99999) a;
=$ CREATE INDEX ON t1 (a);
> perl q.pl(*) | psql

*: q.pl:
> print "explain analyze select b from t1 join (values ";
> foreach $i (0..10000) {
>       print ", " if ($i > 0);
>       printf("(%d)", $i/10 + 1000);
> }
> print ") as v(v) on (v.v = t1.a);";


patched:

 Merge Join  (cost=824.25..1005.19 rows=10001 width=4) (actual time=13.513..24.285 rows=10001 loops=1)
   Merge Cond: (t1.a = "*VALUES*".column1)
   ->  Index Scan using t1_a_idx on t1  (cost=0.29..3050.29 rows=100000 width=8) (actual time=0.033..1.629 rows=2002 loops=1)
   ->  Sort  (cost=789.47..814.47 rows=10001 width=4) (actual time=12.557..14.546 rows=10001 loops=1)
         Sort Key: "*VALUES*".column1
         Sort Method: quicksort  Memory: 931kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.271 rows=10001 loops=1)
 Planning Time: 17.290 ms
 Execution Time: 26.344 ms
(9 rows)

master:
 Hash Join  (cost=250.03..2168.03 rows=10001 width=4) (actual time=14.482..77.205 rows=10001 loops=1)
   Hash Cond: (t1.a = "*VALUES*".column1)
   ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.017..23.540 rows=100000 loops=1)
   ->  Hash  (cost=125.01..125.01 rows=10001 width=4) (actual time=13.786..13.788 rows=10001 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 480kB
         ->  Values Scan on "*VALUES*"  (cost=0.00..125.01 rows=10001 width=4) (actual time=0.002..8.503 rows=10001 loops=1)
 Planning Time: 12.365 ms
 Execution Time: 78.567 ms
(8 rows)

regards.


nice :)

Pavel

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: How bad is using queries with thousands of values for operators IN or ANY?

От
Kyotaro Horiguchi
Дата:
At Tue, 1 Sep 2020 18:46:01 +0200, Pavel Stehule <pavel.stehule@gmail.com> wrote in 
> nice :)

thanks ;p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center