Обсуждение: Enabling deduplication with system catalog indexes

Поиск
Список
Период
Сортировка

Enabling deduplication with system catalog indexes

От
Peter Geoghegan
Дата:
System catalog indexes do not support deduplication as a matter of
policy. I chose to do things that way during the Postgres 13
development cycle due to the restriction on using storage parameters
with system catalog indexes. At the time I felt that *forcing* the use
of deduplication with system catalog indexes might expose users to
problems. But this is something that seems worth revisiting now. (I
haven't actually investigated what it would take to make system
catalogs support the 'deduplicate_items' parameter, but that may not
matter now.)

I would like to enable deduplication within system catalog indexes for
Postgres 15. Leaving it disabled forever seems kind of arbitrary at
best. In general enabling deduplication (or not disabling it) has only
a fixed, small downside in the worst case. It has a huge upside in
favorable cases. Deduplication is part of our high level strategy for
avoiding nbtree index bloat from version churn (non-HOT updates with
several indexes that are never "logically modified"). It effectively
cooperates with and enhances the new enhancements to index deletion in
Postgres 14. Plus these recent index deletion enhancements more or
less eliminated a theoretical downside of deduplication: now it
doesn't really matter that posting list tuples only have a single
LP_DEAD bit (if it ever did). This is because we can now do granular
posting list TID deletion, provided the deletion process visits the
same heap block in passing.

I can find no evidence that even one single user found it useful to
disable deduplication while using Postgres 13 in production (by
searching for "deduplicate_items" on Google). While I myself said that
there might be a regression of up to 2% of throughput back in early
2020, that was under highly unrealistic conditions, that could never
apply to system catalogs -- I was being conservative. Most system
catalog indexes are unique indexes, where there is no possible
overhead from deduplication unless we already know for sure that the
index is subject to some kind of version churn (and so have high
confidence that deduplication will be at least somewhat effective at
buying time for VACUUM). The non-unique system catalog indexes seem
pretty likely to benefit from deduplication in the usual obvious way
(not so much because of versioning and bloat). The two pg_depend
non-unique indexes tend to have a fair number of duplicates.

-- 
Peter Geoghegan



Re: Enabling deduplication with system catalog indexes

От
Peter Geoghegan
Дата:
On Wed, Sep 29, 2021 at 11:27 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I would like to enable deduplication within system catalog indexes for
> Postgres 15.

I decided to run a simple experiment, to give us some idea of what
benefits my proposal gives users: I ran "make installcheck" on a newly
initdb'd database (master branch), and then with the attached patch
(which enables deduplication with system catalog indexes) applied.

I ran a query that shows the 20 largest system catalog indexes in each
case. I'm interested in when and where we see improvements to space
utilization. Any reduction in index size must be a result of index
deduplication (excluding any noise-level changes).

Master branch:

regression=# SELECT
  pg_size_pretty(pg_relation_size(c.oid)) as sz,
  c.relname
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY pg_relation_size(c.oid) DESC LIMIT 20;
   sz    |              relname
---------+-----------------------------------
 1088 kB | pg_attribute_relid_attnam_index
 928 kB  | pg_depend_depender_index
 800 kB  | pg_attribute_relid_attnum_index
 736 kB  | pg_depend_reference_index
 352 kB  | pg_proc_proname_args_nsp_index
 216 kB  | pg_description_o_c_o_index
 200 kB  | pg_class_relname_nsp_index
 184 kB  | pg_type_oid_index
 176 kB  | pg_class_tblspc_relfilenode_index
 160 kB  | pg_type_typname_nsp_index
 104 kB  | pg_proc_oid_index
 64 kB   | pg_class_oid_index
 64 kB   | pg_statistic_relid_att_inh_index
 56 kB   | pg_collation_name_enc_nsp_index
 48 kB   | pg_constraint_conname_nsp_index
 48 kB   | pg_amop_fam_strat_index
 48 kB   | pg_amop_opr_fam_index
 48 kB   | pg_largeobject_loid_pn_index
 48 kB   | pg_operator_oprname_l_r_n_index
 48 kB   | pg_index_indexrelid_index
(20 rows)

Patch:

   sz    |              relname
---------+-----------------------------------
 1048 kB | pg_attribute_relid_attnam_index
 888 kB  | pg_depend_depender_index
 752 kB  | pg_attribute_relid_attnum_index
 616 kB  | pg_depend_reference_index
 352 kB  | pg_proc_proname_args_nsp_index
 216 kB  | pg_description_o_c_o_index
 192 kB  | pg_class_relname_nsp_index
 184 kB  | pg_type_oid_index
 152 kB  | pg_type_typname_nsp_index
 144 kB  | pg_class_tblspc_relfilenode_index
 104 kB  | pg_proc_oid_index
 72 kB   | pg_class_oid_index
 56 kB   | pg_collation_name_enc_nsp_index
 56 kB   | pg_statistic_relid_att_inh_index
 48 kB   | pg_index_indexrelid_index
 48 kB   | pg_amop_fam_strat_index
 48 kB   | pg_amop_opr_fam_index
 48 kB   | pg_largeobject_loid_pn_index
 48 kB   | pg_operator_oprname_l_r_n_index
 40 kB   | pg_index_indrelid_index
(20 rows)

The improvements to space utilization for the larger indexes
(especially the two pg_depends non-unique indexes) is smaller than I
remember from last time around, back in early 2020. This is probably
due to a combination of the Postgres 14 work and the pg_depend PIN
optimization work from commit a49d0812.

The single biggest difference is the decrease in the size of
pg_depend_reference_index -- it goes from 736 kB to 616 kB. Another
notable difference is that pg_class_tblspc_relfilenode_index shrinks,
going from 176 kB to 144 kB. These are not huge differences, but they
still seem worth having.

The best argument in favor of my proposal is definitely the index
bloat argument, which this test case tells us little or nothing about.
I'm especially concerned about scenarios where logical replication is
used, or where index deletion and VACUUM are inherently unable to
remove older index tuple versions for some other reason.

--
Peter Geoghegan

Вложения

Re: Enabling deduplication with system catalog indexes

От
Peter Geoghegan
Дата:
On Wed, Sep 29, 2021 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I decided to run a simple experiment, to give us some idea of what
> benefits my proposal gives users: I ran "make installcheck" on a newly
> initdb'd database (master branch), and then with the attached patch
> (which enables deduplication with system catalog indexes) applied.

I will commit this patch in a few days, barring objections.

-- 
Peter Geoghegan



Re: Enabling deduplication with system catalog indexes

От
"Bossart, Nathan"
Дата:
On 9/30/21, 3:44 PM, "Peter Geoghegan" <pg@bowt.ie> wrote:
> On Wed, Sep 29, 2021 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> I decided to run a simple experiment, to give us some idea of what
>> benefits my proposal gives users: I ran "make installcheck" on a newly
>> initdb'd database (master branch), and then with the attached patch
>> (which enables deduplication with system catalog indexes) applied.
>
> I will commit this patch in a few days, barring objections.

+1

Nathan


Re: Enabling deduplication with system catalog indexes

От
Peter Geoghegan
Дата:
On Fri, Oct 1, 2021 at 2:35 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> On 9/30/21, 3:44 PM, "Peter Geoghegan" <pg@bowt.ie> wrote:
> > I will commit this patch in a few days, barring objections.
>
> +1

Okay, pushed.

Thanks
-- 
Peter Geoghegan