Re: Feature request for adoptive indexes

Поиск

Список

Период

Сортировка

От	Hayk Manukyan
Тема	Re: Feature request for adoptive indexes
Дата	2 ноября 2021 г. 15:04:29
Msg-id	CAF+kZOEmFL2FQXODbshbUntYZ9tejjEcN_NwVtWTEg16uihmzQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Feature request for adoptive indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: Feature request for adoptive indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>) Re: Feature request for adoptive indexes (Pavel Borisov <pashkin.elfe@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Tomas Vondra

> Are you suggesting those are not the actual best/worst cases and we

> should use some other indexes? If yes, which ones?

I would say yes.

In my case I am not querying only sequence column.

I have the following cases which I want to optimize.

1. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and scan_id = <something>

2. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and Issue_flag = <something>

3. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and sequence = <something>

Those are queries that my app send to db that is why I said that from read perspective our best case is 3 separate indexes for

(job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp, year, issue_flag) and any other solution like

(job, nlp, year, sequence, Scan_ID, issue_flag) OR (job, nlp, year ) INCLUDE(sequence, Scan_ID, issue_flag) OR just (job, nlp, year) can be considered as worst case

I will remind that in real world scenario I have ~50m rows and about ~5k rows for each (job, nlp, year )

From write perspective as far as we want to have only one index our best case can be considered any of

(job, nlp, year, sequence, Scan_ID, issue_flag) OR (job, nlp, year ) INCLUDE(sequence, Scan_ID, issue_flag)

and the worst case will be having 3 separate queries like in read perspective

(job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp, year, issue_flag)

So I think the comparison that we did is not right because we are comparing different/wrong things.

For right results we need to compare this two cases when we are doing random queries with 1,2,3 and random writes.

вт, 2 нояб. 2021 г. в 01:16, Tomas Vondra <tomas.vondra@enterprisedb.com>:

On 11/1/21 21:06, Robert Haas wrote:
> On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> If I get what you propose, you want to have a "top" tree for (job, nlp,
>> year), which "splits" the data set into subsets of ~5000-7000 rows. And
>> then for each subset you want a separate "small" trees on each of the
>> other columns, so in this case three trees.
>>
>> Well, the problem with this is pretty obvious - each of the small trees
>> requires separate copies of the leaf pages. And remember, in a btree the
>> internal pages are usually less than 1% of the index, so this pretty
>> much triples the size of the index. And if you insert a row into the
>> index, it has to insert the item pointer into each of the small trees,
>> likely requiring a separate I/O for each.
>>
>> So I'd bet this is not any different from just having three separate
>> indexes - it doesn't save space, doesn't save I/O, nothing.
>
> I agree. In a lot of cases, it's actually useful to define the index
> on fewer columns, like (job, nlp, year) or even just (job, nlp) or
> even just (job) because it makes the index so much smaller and that's
> pretty important. If you have enough duplicate entries in a (job, nlp,
> year) index to justify create a (job, nlp, year, sequence) index, the
> number of distinct (job, nlp, year) tuples has to be small compared to
> the number of (job, nlp, year, sequence) tuples - and that means that
> you wouldn't actually save much by combining your (job, nlp, year,
> sequence) index with a (job, nlp, year, other-stuff) index. As you
> say, the internal pages aren't the problem.
>
> I don't intend to say that there's no possible use for this kind of
> technology. Peter G. says that people are writing research papers
> about that and they probably wouldn't be doing that unless they'd
> found some case where it's a big win. But this example seems extremely
> unconvincing.
>

I actually looked at the use case mentioned by Peter G, i.e. merged
indexes with master-detail clustering (see e.g. [1]), but that seems
like a rather different thing. The master-detail refers to storing rows
from multiple tables, interleaved in a way that allows faster joins. So
it's essentially a denormalization tool.

Perhaps there's something we could learn about efficient storage of the
small trees, but I haven't found any papers describing that (I haven't
spent much time on the search, though).

[1] Algorithms for merged indexes, Goetz Graefe
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.7709

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Jeevan Ladhe
Дата: 02 ноября 2021 г., 14:52:29
Сообщение: Re: refactoring basebackup.c

Следующее

От: "osumi.takamichi@fujitsu.com"
Дата: 02 ноября 2021 г., 15:07:01
Сообщение: RE: Failed transaction statistics to measure the logical replication progress

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Feature request for adoptive indexes

Предыдущее

Следующее