Shared detoast Datum proposal
От | Andy Fan |
---|---|
Тема | Shared detoast Datum proposal |
Дата | |
Msg-id | 87il4jrk1l.fsf@163.com обсуждение исходный текст |
Ответы |
Re: Shared detoast Datum proposal
(Andy Fan <zhihuifan1213@163.com>)
|
Список | pgsql-hackers |
Problem: -------- Toast works well for its claimed purposes. However the current detoast infrastructure doesn't shared the detoast datum in the query's lifespan, which will cause some inefficiency like this: SELECT big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c' FROM t; In the above case, the big_jsonb_col will be detoast 3 times for every record. a more common case maybe: CREATE TABLE t(a numeric); insert into t select i FROM generate_series(1, 10)i; SELECT * FROM t WHERE a > 3; Here we needs detoasted because of VARATT_CAN_MAKE_SHORT, and it needs to be detoasted twice, one is in a > 3, the other one is in the targetlist, where we need to call numeric_out. Proposal -------- When we access some desired toast column in EEOP_{SCAN|INNER_OUTER}_VAR steps, we can detoast it immediately and save it back to slot->tts_values. With this way, when any other expressions seeks for the Datum, it get the detoast version. Planner decides which Vars should use this feature, executor manages it detoast action and memory management. Planner design --------------- 1. This feature only happen at the Plan Node where the detoast would happen in the previous topology, for example: for example: SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col); toast_col just happens at the Join node's slot even if we have a projection on t1 or t2 at the scan node (except the Parameterized path). However if SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col) WHERE t1.toast_col > 'a'; the detoast *may* happen at the scan of level t1 since "t1.toast_col > 'a'" accesses the Var within a FuncCall ('>' operator), which will cause a detoast. (However it should not happen if it is under a Sort node, for details see Planner Design section 2). At the implementation side, I added "Bitmapset *reference_attrs;" to Scan node which show if the Var should be accessed with the pre-detoast way in expression execution engine. the value is maintained at the create_plan/set_plan_refs stage. Two similar fields are added in Join node. Bitmapset *outer_reference_attrs; Bitmapset *inner_reference_attrs; In the both case, I tracked the level of walker/mutator, if the level greater than 1 when we access a Var, the 'var->varattno - 1' is added to the bitmapset. Some special node should be ignored, see increase_level_for_pre_detoast for details. 2. We also need to make sure the detoast datum will not increase the work_mem usage for the nodes like Sort/Hash etc, all of such nodes can be found with search 'CP_SMALL_TLIST' flags. If the a node under a Sort-Hash-like nodes, we have some extra checking to see if a Var is a *directly input* of such nodes. If yes, we can't detoast it in advance, or else, we know the Var has been discarded before goes to these nodes, we still can use the shared detoast feature. The simplest cases to show this is: For example: 2.1 Query-1 explain (verbose) select * from t1 where b > 'a'; -- b can be detoast in advance. Query-2 explain (verbose) select * from t1 where b > 'a' order by c; -- b can't be detoast since it will makes the Sort use more work_mem. Query-3 explain (verbose) select a, c from t1 where b > 'a' order by c; -- b can be pre-detoasted, since it is discarded before goes to Sort node. In this case it doesn't do anything good, but for some complex case like Query-4, it does. Query-4 explain (costs off, verbose) select t3.* from t1, t2, t3 where t2.c > '999999999999999' and t2.c = t1.c and t3.b = t1.b; QUERY PLAN -------------------------------------------------------------- Hash Join Output: t3.a, t3.b, t3.c Hash Cond: (t3.b = t1.b) -> Seq Scan on public.t3 Output: t3.a, t3.b, t3.c -> Hash Output: t1.b -> Nested Loop Output: t1.b <-- Note here 3 -> Seq Scan on public.t2 Output: t2.a, t2.b, t2.c Filter: (t2.c > '9999...'::text) <--- Note Here 1 -> Index Scan using t1_c_idx on public.t1 Output: t1.a, t1.b, t1.c Index Cond: (t1.c = t2.c) <--- Note Here 2 (15 rows) In this case, detoast datum for t2.c can be shared and it benefits for t2.c = t1.c and no harm for the Hash node. Execution side -------------- Once we decide a Var should be pre-detoasted for a given plan node, a special step named as EEOP_{INNER/OUTER/SCAN}_VAR_TOAST will be created during ExecInitExpr stage. The special steps are introduced to avoid its impacts on the non-related EEOP_{INNER/OUTER/SCAN}_VAR code path. slot->tts_values is used to store the detoast datum so that any other expressions can access it pretty easily. Because of the above design, the detoast datum should have a same lifespan as any other slot->tts_values[*], so the default ecxt_per_tuple_memory is not OK for this. At last I used slot->tts_mcxt to hold the memory, and maintaining these lifecycles in execTuples.c. To know which datum in slot->tts_values is pre-detoasted, Bitmapset * slot->detoast_attrs is introduced. During the execution of these steps, below code like this is used: static inline void ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum) { if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum])) { Datum oldDatum; MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt); oldDatum = slot->tts_values[attnum]; slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum)); Assert(slot->tts_nvalid > attnum); if (oldDatum != slot->tts_values[attnum]) slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum); MemoryContextSwitchTo(old); } } Testing ------- - shared_detoast_slow.sql is used to test the planner related codes changes. 'set jit to off' will enable more INFO logs about which Var is pre-detoast in which node level. - the cases the patch doesn't help much. create table w(a int, b numeric); insert into w select i, i from generate_series(1, 1000000)i; Q3: select a from w where a > 0; Q4: select b from w where b > 0; pgbench -n -f 1.sql postgres -T 10 -M prepared run 5 times and calculated the average value. | Qry No | Master | patched | perf | comment | |--------+---------+---------+-------+-----------------------------------------| | 3 | 309.438 | 308.411 | 0 | nearly zero impact on them | | 4 | 431.735 | 420.833 | +2.6% | save the detoast effort for numeric_out | - good case setup: create table b(big jsonb); insert into b select jsonb_object_agg( x::text, random()::text || random()::text || random()::text ) from generate_series(1,600) f(x); insert into b select (select big from b) from generate_series(1, 1000)i; workload: Q1: select big->'1', big->'2', big->'3', big->'5', big->'10' from b; Q2: select 1 from b where length(big->>'1') > 0 and length(big->>'2') > 2; | No | Master | patched | |----+--------+---------| | 1 | 1677 | 357 | | 2 | 638 | 332 | Some Known issues: ------------------ 1. Currently only Scan & Join nodes are considered for this feature. 2. JIT is not adapted for this purpose yet. 3. This feature builds a strong relationship between slot->tts_values and slot->pre_detoast_attrs. for example slot->tts_values[1] holds a detoast datum, so 1 is in the slot->pre_detoast_attrs bitmapset, however if someone changes the value directly, like "slot->tts_values[1] = 2;" but without touching the slot's pre_detoast_attrs then troubles comes. The good thing is the detoast can only happen on Scan/Join nodes. so any other slot is impossible to have such issue. I run the following command to find out such cases, looks none of them is a slot form Scan/Join node. egrep -nri 'slot->tts_values\[[^\]]*\] = * Any thought? -- Best Regards Andy Fan
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: Amit KapilaДата:
Сообщение: Re: Track in pg_replication_slots the reason why slots conflict?