Making strxfrm() blobs in indexes work

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Making strxfrm() blobs in indexes work
Дата
Msg-id CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com
обсуждение исходный текст
Ответы Re: Making strxfrm() blobs in indexes work  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Making strxfrm() blobs in indexes work  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
On more occasions than I care to recall, someone has suggested that it
would be valuable to do something with strxfrm() blobs in order to
have cheaper locale-aware text comparisons. One obvious place to do so
would be in indexes, but in the past that has been dismissed on the
following grounds:

* Index-only scans need fully formed datums to work, and strxfrm() is
a one-way function (or so I believe). There is no reason to think that
the original string can be reassembled from the blob, so clearly that
won't fly.

* There is a high cost to be paid in storage overhead. Even for
collations like "en_US.UTF-8", that can mean that the blob is as much
as 3-4 times larger than the original text string. Who is to say that
we'll come out ahead even with the savings of just doing a strcmp()
rather than a strcoll()?

So, even though using strxfrm() blobs can be much faster, it seems
hard to find a place to use them. A compelling improvement in
performance for index scans is so close, and yet so far. But having
mulled it over during my recent work on various aspects of the B-Tree
code, I believe that there is a way that we can cheat and come out
ahead with strxfrm() blobs in indexes.

I have a sample dataset that I like to work off of. It's the database
of the Mouse Genome Database, available from
http://www.informatics.jax.org/. This is real data, from a real
Postgres database, and so I recommend working off one of their weekly
dumps if you're looking for a real dataset that is big enough to
exercise most interesting things, and yet not too big. It seems that
there is an entire ecosystem of tools for medical researchers around
the dataset, and they have the right attitude about opening it all up,
which is pretty cool.

Anyway, I quickly threw together the following query, which shows the
break-down of leaf pages, inner pages and root pages among the largest
indexes, and the proportion of each per index:

[local] pg@mouse=# with tots as (
SELECT count(*) c, type, relname from       (select relname, relpages, generate_series(1, relpages - 1) i
from pg_class c join pg_namespace n on c.relnamespace = n.oid where
relkind = 'i' and nspname = 'mgd') r,       lateral (select * from bt_page_stats('mgd.' || relname, i)) u
group by relname, type)
select tots.relname, relpages -1 as non_meta_pages, c, c/sum(c)
over(partition by tots.relname) as prop_of_index, type from tots join
pg_class c on c.relname = tots.relname order by 2 desc, 1, type;
                   relname                     | non_meta_pages |   c  |       prop_of_index        | type

------------------------------------------------+----------------+--------+----------------------------+------acc_accession_0
                              |         106181 |
 
520 |     0.00489729801000178940 | iacc_accession_0                                |         106181 |
105660 |     0.99509328410920974562 | lacc_accession_0                                |         106181 |1 |
0.000009417880788464979610| racc_accession_idx_accid                        |         106181 |
 
520 |     0.00489729801000178940 | iacc_accession_idx_accid                        |         106181 |
105660 |     0.99509328410920974562 | lacc_accession_idx_accid                        |         106181 |1 |
0.000009417880788464979610| rmgi_notechunk_idx_note                         |         101127 |
 
4817 |     0.04763317412758214918 | imgi_notechunk_idx_note                         |         101127 |
96309 |     0.95235693731644367973 | lmgi_notechunk_idx_note                         |         101127 |1 |
0.000009888555974171091795| racc_accession_1                                |          80507 |
 
302 |     0.00375122660141354168 | iacc_accession_1                                |          80507 |
80204 |     0.99623635211844932739 | lacc_accession_1                                |          80507 |1 |
0.000012421280137130932714| racc_accession_idx_prefixpart                   |          80507 |
 
302 |     0.00375122660141354168 | iacc_accession_idx_prefixpart                   |          80507 |
80204 |     0.99623635211844932739 | lacc_accession_idx_prefixpart                   |          80507 |1 |
0.000012421280137130932714| r
 

***SNIP***

To the surprise of no one, typically ~99.5% of B-Tree index pages are
leaf pages. I do see one index on a column of very large text strings
(a "notes" column) that is only about 95% leaf pages, which is omitted
here. But the general picture is that over 99% of non-meta pages are
leaf pages. Again, this is hardly surprising: That's more or less why
B-Tree indexes work so well when partially cached.

I'm sure anyone that has read this far knows where I'm going with
this: why can't we just have strxfrm() blobs in the inner pages,
implying large savings for a big majority of text comparisons that
service index scans, without bloating the indexes too badly, and
without breaking anything? We only use inner pages to find leaf pages.
They're redundant copies of the data within the index.

My reading of the page split code is that inserting a downlink into
the parent post-split is totally naive of the details (including the
size) of the pointed-to IndexTuple - _bt_insert_parent() does this:

/* Recursively update the parent */
_bt_insertonpg(rel, pbuf, stack->bts_parent,           new_item, stack->bts_offset + 1,           is_only);

So ISTM that we could come up with an infrastructure, possibly just
for insertion scanKeys (limiting the code footprint of all of this) in
order to inner-page-process datums at this juncture, and store a blob
instead, for later savings on walking the tree. I believe that this
technique has applicability beyond strxfrm(), and a type-naive
infrastructure could be developed to support it, with a degree of
complexity not too far in excess of, say, the SortSupport
infrastructure. The realization that we don't really need to recover
the original information directly from the inner pages gives us scope
to push things in several useful directions, I suspect. Furthermore,
because it takes a long time to fill inner pages, bloating them might
not be that bad in larger indexes. My guess is that on the whole it
would be well worth it.

This is obviously pretty hand-wavy, but I thought it was still worth
sharing while the basic idea was fresh in my mind. Thoughts?
-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: Fwd: Request for error explaination || Adding a new integer in indextupleData Structure
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pg_upgrade: make the locale comparison more tolerating