Shared detoast Datum proposal

Поиск
Список
Период
Сортировка
От Andy Fan
Тема Shared detoast Datum proposal
Дата
Msg-id 87il4jrk1l.fsf@163.com
обсуждение исходный текст
Ответы Re: Shared detoast Datum proposal  (Andy Fan <zhihuifan1213@163.com>)
Список pgsql-hackers
Problem:
--------
Toast works well for its claimed purposes. However the current detoast
infrastructure doesn't shared the detoast datum in the query's
lifespan, which will cause some inefficiency like this:

SELECT big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'
FROM t;

In the above case, the big_jsonb_col will be detoast 3 times for every
record.

a more common case maybe:

CREATE TABLE t(a numeric);
insert into t select i FROM generate_series(1, 10)i;
SELECT * FROM t WHERE a > 3;

Here we needs detoasted because of VARATT_CAN_MAKE_SHORT, and it needs to
be detoasted twice, one is in a > 3,  the other one is in the targetlist,
where we need to call numeric_out. 

Proposal
--------

When we access some desired toast column in EEOP_{SCAN|INNER_OUTER}_VAR
steps, we can detoast it immediately and save it back to
slot->tts_values.  With this way, when any other expressions seeks for
the Datum, it get the detoast version. Planner decides which Vars
should use this feature, executor manages it detoast action and memory
management. 

Planner design
---------------

1. This feature only happen at the Plan Node where the detoast would
   happen in the previous topology, for example:

   for example:
   
   SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col);

   toast_col just happens at the Join node's slot even if we have a
   projection on t1 or t2 at the scan node (except the Parameterized path).

   However if

   SELECT t1.toast_col, t2.toast_col
   FROM t1 join t2
   USING (toast_col)
   WHERE t1.toast_col > 'a';

   the detoast *may* happen at the scan of level t1 since "t1.toast_col >
   'a'" accesses the Var within a FuncCall ('>' operator), which will
   cause a detoast. (However it should not happen if it is under a Sort
   node, for details see Planner Design section 2).

   At the implementation side, I added "Bitmapset  *reference_attrs;" to
   Scan node which show if the Var should be accessed with the
   pre-detoast way in expression execution engine. the value is
   maintained at the create_plan/set_plan_refs stage. 

   Two similar fields are added in Join node.

       Bitmapset  *outer_reference_attrs;
    Bitmapset  *inner_reference_attrs;

   In the both case, I tracked the level of walker/mutator, if the level
   greater than 1 when we access a Var, the 'var->varattno - 1' is added
   to the bitmapset. Some special node should be ignored, see
   increase_level_for_pre_detoast for details.

2. We also need to make sure the detoast datum will not increase the
   work_mem usage for the nodes like Sort/Hash etc, all of such nodes
   can be found with search 'CP_SMALL_TLIST' flags. 

   If the a node under a Sort-Hash-like nodes, we have some extra
   checking to see if a Var is a *directly input* of such nodes. If yes,
   we can't detoast it in advance, or else, we know the Var has been
   discarded before goes to these nodes, we still can use the shared
   detoast feature.

   The simplest cases to show this is:

   For example:

   2.1
   Query-1
   explain (verbose) select * from t1 where b > 'a';
   -- b can be detoast in advance.

   Query-2
   explain (verbose) select * from t1 where b > 'a' order by c;
   -- b can't be detoast since it will makes the Sort use more work_mem.

   Query-3
   explain (verbose) select a, c from t1 where b > 'a' order by c;
   -- b can be pre-detoasted, since it is discarded before goes to Sort
   node. In this case it doesn't do anything good, but for some complex
   case like Query-4, it does. 

Query-4   
explain (costs off, verbose)
select t3.*
from t1, t2, t3
where t2.c > '999999999999999'
and t2.c = t1.c
and t3.b = t1.b;

                           QUERY PLAN                          
--------------------------------------------------------------
 Hash Join
   Output: t3.a, t3.b, t3.c
   Hash Cond: (t3.b = t1.b)
   ->  Seq Scan on public.t3
         Output: t3.a, t3.b, t3.c
   ->  Hash
         Output: t1.b
         ->  Nested Loop
               Output: t1.b  <-- Note here 3
               ->  Seq Scan on public.t2
                     Output: t2.a, t2.b, t2.c
                     Filter: (t2.c > '9999...'::text) <--- Note Here 1
               ->  Index Scan using t1_c_idx on public.t1
                     Output: t1.a, t1.b, t1.c
                     Index Cond: (t1.c = t2.c)  <--- Note Here 2 
(15 rows) 
 
In this case, detoast datum for t2.c can be shared and it benefits for
t2.c = t1.c and no harm for the Hash node.


Execution side
--------------

Once we decide a Var should be pre-detoasted for a given plan node, a
special step named as EEOP_{INNER/OUTER/SCAN}_VAR_TOAST will be created
during ExecInitExpr stage. The special steps are introduced to avoid its
impacts on the non-related EEOP_{INNER/OUTER/SCAN}_VAR code path.

slot->tts_values is used to store the detoast datum so that any other
expressions can access it pretty easily.

Because of the above design, the detoast datum should have a same
lifespan as any other slot->tts_values[*], so the default
ecxt_per_tuple_memory is not OK for this. At last I used slot->tts_mcxt
to hold the memory, and maintaining these lifecycles in execTuples.c. To
know which datum in slot->tts_values is pre-detoasted, Bitmapset *
slot->detoast_attrs is introduced.

During the execution of these steps, below code like this is used:

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
    if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum]))
    {
        Datum        oldDatum;
        MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

        oldDatum = slot->tts_values[attnum];
        slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
                                                                (struct varlena *) oldDatum));
        Assert(slot->tts_nvalid > attnum);
        if (oldDatum != slot->tts_values[attnum])
            slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
        MemoryContextSwitchTo(old);
    }
}


Testing
-------
- shared_detoast_slow.sql is used to test the planner related codes changes.

  'set jit to off' will enable more INFO logs about which Var is
  pre-detoast in which node level.

- the cases the patch doesn't help much.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;

Q3:
select a from w where a > 0;

Q4:
select b from w where b > 0;

pgbench -n -f 1.sql postgres -T 10 -M prepared

run 5 times and calculated the average value.

| Qry No |  Master | patched |  perf | comment                                 |
|--------+---------+---------+-------+-----------------------------------------|
|      3 | 309.438 | 308.411 |     0 | nearly zero impact on them              |
|      4 | 431.735 | 420.833 | +2.6% | save the detoast effort for numeric_out |

- good case

  setup:

create table b(big jsonb);
insert into b select
jsonb_object_agg( x::text,
random()::text || random()::text || random()::text )
from generate_series(1,600) f(x);
insert into b select (select big from b) from generate_series(1, 1000)i;


  workload:
  Q1:
  select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

  Q2:
  select 1 from b where length(big->>'1') > 0 and length(big->>'2') > 2;


| No | Master | patched |
|----+--------+---------|
|  1 |   1677 |     357 |
|  2 |    638 |     332 |


Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.
3. This feature builds a strong relationship between slot->tts_values
   and slot->pre_detoast_attrs. for example slot->tts_values[1] holds a
   detoast datum, so 1 is in the slot->pre_detoast_attrs bitmapset,
   however if someone changes the value directly, like
   "slot->tts_values[1] = 2;" but without touching the slot's
   pre_detoast_attrs then troubles comes. The good thing is the detoast
   can only happen on Scan/Join nodes. so any other slot is impossible
   to have such issue. I run the following command to find out such
   cases, looks none of them is a slot form Scan/Join node.

   egrep -nri 'slot->tts_values\[[^\]]*\] = *

Any thought?

-- 
Best Regards
Andy Fan


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Synchronizing slots from primary to standby
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Track in pg_replication_slots the reason why slots conflict?