Re: Feature request for adoptive indexes

Поиск
Список
Период
Сортировка
От Hayk Manukyan
Тема Re: Feature request for adoptive indexes
Дата
Msg-id CAF+kZOEmFL2FQXODbshbUntYZ9tejjEcN_NwVtWTEg16uihmzQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Feature request for adoptive indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Feature request for adoptive indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: Feature request for adoptive indexes  (Pavel Borisov <pashkin.elfe@gmail.com>)
Список pgsql-hackers
Tomas Vondra
> Are you suggesting those are not the actual best/worst cases and we
> should use some other indexes? If yes, which ones?

I would say yes.
In my case I am not querying only sequence column.
I have the following cases which I want to optimize.
1. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and  scan_id = <something>  
2. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and  Issue_flag = <something>  
3. Select * from Some_table where job = <somthing> and nlp = <something> and year = <something> and  sequence = <something>  
Those are queries that my app send to db that is why I said that from read perspective our best case is 3 separate indexes for 
 (job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp, year,  issue_flag)  and any other solution like 
 (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year ) INCLUDE(sequence, Scan_ID, issue_flag)  OR just (job, nlp, year) can be considered as worst case 
I will remind that in real world scenario I have ~50m rows and about ~5k rows for each (job, nlp, year )
From write perspective as far as we want to have only one index our best case can be considered any of
(job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year ) INCLUDE(sequence, Scan_ID, issue_flag) 
and the worst case will be having 3 separate queries like in read perspective 
(job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp, year,  issue_flag)  

So I think the comparison that we did is not right because we are comparing different/wrong things.

For right results we need to compare this two cases when we are doing random queries with 1,2,3  and random writes.


вт, 2 нояб. 2021 г. в 01:16, Tomas Vondra <tomas.vondra@enterprisedb.com>:


On 11/1/21 21:06, Robert Haas wrote:
> On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> If I get what you propose, you want to have a "top" tree for (job, nlp,
>> year), which "splits" the data set into subsets of ~5000-7000 rows. And
>> then for each subset you want a separate "small" trees on each of the
>> other columns, so in this case three trees.
>>
>> Well, the problem with this is pretty obvious - each of the small trees
>> requires separate copies of the leaf pages. And remember, in a btree the
>> internal pages are usually less than 1% of the index, so this pretty
>> much triples the size of the index. And if you insert a row into the
>> index, it has to insert the item pointer into each of the small trees,
>> likely requiring a separate I/O for each.
>>
>> So I'd bet this is not any different from just having three separate
>> indexes - it doesn't save space, doesn't save I/O, nothing.
>
> I agree. In a lot of cases, it's actually useful to define the index
> on fewer columns, like (job, nlp, year) or even just (job, nlp) or
> even just (job) because it makes the index so much smaller and that's
> pretty important. If you have enough duplicate entries in a (job, nlp,
> year) index to justify create a (job, nlp, year, sequence) index, the
> number of distinct (job, nlp, year) tuples has to be small compared to
> the number of (job, nlp, year, sequence) tuples - and that means that
> you wouldn't actually save much by combining your (job, nlp, year,
> sequence) index with a (job, nlp, year, other-stuff) index. As you
> say, the internal pages aren't the problem.
>
> I don't intend to say that there's no possible use for this kind of
> technology. Peter G. says that people are writing research papers
> about that and they probably wouldn't be doing that unless they'd
> found some case where it's a big win. But this example seems extremely
> unconvincing.
>

I actually looked at the use case mentioned by Peter G, i.e. merged
indexes with master-detail clustering (see e.g. [1]), but that seems
like a rather different thing. The master-detail refers to storing rows
from multiple tables, interleaved in a way that allows faster joins. So
it's essentially a denormalization tool.

Perhaps there's something we could learn about efficient storage of the
small trees, but I haven't found any papers describing that (I haven't
spent much time on the search, though).

[1] Algorithms for merged indexes, Goetz Graefe
     https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.7709


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jeevan Ladhe
Дата:
Сообщение: Re: refactoring basebackup.c
Следующее
От: "osumi.takamichi@fujitsu.com"
Дата:
Сообщение: RE: Failed transaction statistics to measure the logical replication progress