Обсуждение: ML-based indexing ("The Case for Learned Index Structures", a paperfrom Google)

Поиск

Список

Период

Сортировка

ML-based indexing ("The Case for Learned Index Structures", a paperfrom Google)

От

Nikolay Samokhvalov

Дата:

12 декабря 2017 г., 02:11:50

Very interesting read: https://arxiv.org/abs/1712.01208

HN discussion: https://news.ycombinator.com/item?id=15894896

Some of the comments (from Twitter https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co at GOOG just released a paper showing how machine-learned indexes can replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than B-Trees, 10-100x less space. Executes on GPU, which are getting faster unlike CPU. Amazing."

Can those ideas be applied to Postgres in its current state? Or it's not really down-to-earth?

Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)

От

Oleg Bartunov

Дата:

12 декабря 2017 г., 19:33:15

On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
> Very interesting read: https://arxiv.org/abs/1712.01208
>
> HN discussion: https://news.ycombinator.com/item?id=15894896
>
> Some of the comments (from Twitter
> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
> at GOOG just released a paper showing how machine-learned indexes can
> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
> unlike CPU. Amazing."
>
> Can those ideas be applied to Postgres in its current state? Or it's not
> really down-to-earth?

Oleg made some analysis of the paper.

Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)

От

Oleg Ivanov

Дата:

12 декабря 2017 г., 23:02:04

On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> <samokhvalov@gmail.com> wrote:
>> Very interesting read: https://arxiv.org/abs/1712.01208
>>
>> HN discussion: https://news.ycombinator.com/item?id=15894896
>>
>> Some of the comments (from Twitter
>> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
>> at GOOG just released a paper showing how machine-learned indexes can
>> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
>> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
>> unlike CPU. Amazing."
>>
>> Can those ideas be applied to Postgres in its current state? Or it's not
>> really down-to-earth?
> Oleg made some analysis of the paper.
If the keys are comparable and the data distribution is simple enough 
(which seems to happen rather often) and the data does not change, we 
can learn the distribution, and so perform the search faster, and save 
the memory also. That is what authors wrote and that is what must work 
in my opinion.

The limitations of the approach, which authors mention in their work:
+ The data does not change. The proposed method can be extended more or 
less straightforward to nearly append-only workloads and to workloads in 
which the data distribution does not change or changes very slowly (the 
data itself or its amount may change, nevertheless). Otherwise it is 
challenging, because the key positions cannot be considered as 
independent values anymore, but it looks still possible. The other 
proposed by the authors option is to store new objects in buffer and 
retrain model in the background, but that does not look as a nice solution.
+ GPU are fast, but the latency of doing operations on the GPU almost 
certainly avoids all speed benefits for such small models. The only case 
in which it is reasonable to use GPU is training models (i. e. building 
such indexes).

The following left unclear for me from the paper:
+ How effectively the approach can be implemented taking into account 
the limitations above.
+ For hash functions authors propose to replace hash function with the 
learned CDF, which can decrease the amount of collisions, which decreaes 
the memory consumption. That is reasonable, but this hash function 
implicitly uses the information about the order on the keys. I suppose 
that using the ordering on the data and with access to the data 
histogram one could achieve similar memory consumption decrease.
+ The proposed hierarchical learning looks natural for the problem, but 
it is not clear that it is the best option. For example, one might use 
the end-to-end learning procedure with weights sparsity regularizers as 
well. I wonder whether the last approach can obtain the competitive 
results with the first one, and if not, then why. Anyway, I have a 
feeling that the proposed method has a room for improvement.

In general, the approach still needs some additional research (mostly 
about the changing data), and has some points to think about how to 
implement them properly (paging, for example), but it looks promising 
enough nevertheless.

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Stefan Keller

Дата:

20 апреля 2021 г., 20:56:07

Dear Olegs, dear Nikolay, dear all

Allow me to revive this thread:

Are there any advances in a learned index for PostgreSQL?

Background: I'm trying to benchmark those experimental indices. For
this I did some bibliography work (see below). Fun fact: Not only
Postgres people love high-proof drinks, sorry, high-proof index names:
see "From wisckey to bourbon" :-).

My main discovery is a discussion report - which included Stonebraker
- entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first
two takeaways are "#1: The initial comparisons of learned indices with
optimized traditional indices should be further expanded to include
concurrency control and multi-user settings. (...) #2: A key benefit
of learned indices may come from the learned structures requiring
lower space utilization, rather than a reduction in search time."

Yours, Stefan


[1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel,
J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment
and Prognosis. Data Engineering, 3. Web access:
https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf


Bibliography (without any claim for completeness):

Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018,
May). The case for learned index structures. In Proceedings of the
2018 International Conference on Management of Data (pp. 489-504). Web
access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909

"Recursive model index, a learned index structure" (based on Kraska et
al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi
(Go language), https://github.com/learnedsystems/RMI (Rust)

Kaur, T. (2018). Learned Index Structures: Practical Implementations
and Future Directions. Master Thesis. Web access:
https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf
 (pretends to be open source C++ but none found)

Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska,
T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned
index. In Proceedings of the Third International Workshop on
Exploiting Artificial Intelligence Techniques for Data Management (pp.
1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659),
Source code: https://github.com/learnedsystems/RadixSpline (C++).

Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B.,
Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to
bourbon: A learned index for log-structured merge trees. In 14th
{USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 20) (pp. 155-171). Web access:
https://www.usenix.org/system/files/osdi20-dai_0.pdf

Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... &
Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In
Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data (pp. 969-984). Web access:
https://doi.org/10.1145/3318464.3389711 /
https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code:
https://github.com/microsoft/ALEX (C++, MIT license)

Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov
<o.ivanov@postgrespro.ru>:
>
> On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> > <samokhvalov@gmail.com> wrote:
> >> Very interesting read: https://arxiv.org/abs/1712.01208
> >>
> >> HN discussion: https://news.ycombinator.com/item?id=15894896
> >>
> >> Some of the comments (from Twitter
> >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
> >> at GOOG just released a paper showing how machine-learned indexes can
> >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
> >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
> >> unlike CPU. Amazing."
> >>
> >> Can those ideas be applied to Postgres in its current state? Or it's not
> >> really down-to-earth?
> > Oleg made some analysis of the paper.
> If the keys are comparable and the data distribution is simple enough
> (which seems to happen rather often) and the data does not change, we
> can learn the distribution, and so perform the search faster, and save
> the memory also. That is what authors wrote and that is what must work
> in my opinion.
>
> The limitations of the approach, which authors mention in their work:
> + The data does not change. The proposed method can be extended more or
> less straightforward to nearly append-only workloads and to workloads in
> which the data distribution does not change or changes very slowly (the
> data itself or its amount may change, nevertheless). Otherwise it is
> challenging, because the key positions cannot be considered as
> independent values anymore, but it looks still possible. The other
> proposed by the authors option is to store new objects in buffer and
> retrain model in the background, but that does not look as a nice solution.
> + GPU are fast, but the latency of doing operations on the GPU almost
> certainly avoids all speed benefits for such small models. The only case
> in which it is reasonable to use GPU is training models (i. e. building
> such indexes).
>
> The following left unclear for me from the paper:
> + How effectively the approach can be implemented taking into account
> the limitations above.
> + For hash functions authors propose to replace hash function with the
> learned CDF, which can decrease the amount of collisions, which decreaes
> the memory consumption. That is reasonable, but this hash function
> implicitly uses the information about the order on the keys. I suppose
> that using the ordering on the data and with access to the data
> histogram one could achieve similar memory consumption decrease.
> + The proposed hierarchical learning looks natural for the problem, but
> it is not clear that it is the best option. For example, one might use
> the end-to-end learning procedure with weights sparsity regularizers as
> well. I wonder whether the last approach can obtain the competitive
> results with the first one, and if not, then why. Anyway, I have a
> feeling that the proposed method has a room for improvement.
>
> In general, the approach still needs some additional research (mostly
> about the changing data), and has some points to think about how to
> implement them properly (paging, for example), but it looks promising
> enough nevertheless.
>

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Andrey Borodin

Дата:

20 апреля 2021 г., 21:18:12


> 20 апр. 2021 г., в 22:56, Stefan Keller <sfkeller@gmail.com> написал(а):
>
> Are there any advances in a learned index for PostgreSQL?

BTW take a look into PGM [0]. I'm slowly working on implementing it.
I think it is kind of straightforward to implement it as extension.
I've started from forking B-tree[1]. I've removed support of anything that is not int4.
Then I plan to remove opclass\comparator abstraction layer.
And then add interpolation search over the page before binsearch.
And, maybe, add delta compression. Or, perhaps, fix WAL, which is completely broken. Or add some vectorisation.
The concurrency is from a regular B-tree, of cause.

This is my pet project. So, unsurprisingly, I've done only parts of big plan :) I would appreciate any help.

Thanks!

Best regards, Andrey Borodin.


[0] https://pgm.di.unipi.it/
[1] https://github.com/x4m/pg2m

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Peter Geoghegan

Дата:

20 апреля 2021 г., 22:24:29

On Tue, Apr 20, 2021 at 11:18 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> BTW take a look into PGM [0]. I'm slowly working on implementing it.
> I think it is kind of straightforward to implement it as extension.
> I've started from forking B-tree[1]. I've removed support of anything that is not int4.
> Then I plan to remove opclass\comparator abstraction layer.
> And then add interpolation search over the page before binsearch.

FWIW I'm a skeptic of learned indexes. There are lots of reasons why I
don't think that they're practical, but it boils down to this: in
general data structures that work well don't need anybody to make a
case for them. They simply work well for the task they were designed
for.

Anything is possible, but I strongly doubt that learned indexes are
the exception to that general rule. Why hasn't anybody been able to
apply the techniques in any real world database system to date? I'm
sure that it's possible to make things much faster if it's possible to
make certain wide-reaching assumptions. They talk about big
improvements in space utilization compared to B-Trees as if there was
some very fixed idea of how that works in B-Trees. Why haven't they
compared their model to grotty heuristics that practical experience
has shown to work, such as the rightmost leaf page split heuristic?
Any paper about learned indexes that I've ever read doesn't even
acknowledge the existence of these heuristics.

-- 
Peter Geoghegan

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Chapman Flack

Дата:

20 апреля 2021 г., 22:35:00

On 04/20/21 15:24, Peter Geoghegan wrote:
> data structures that work well don't need anybody to make a case for them.
> They simply work well for the task they were designed for.

How would showing that to be true for data structure X be different from
making a case for data structure X?

Regards,
-Chap

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Peter Geoghegan

Дата:

20 апреля 2021 г., 22:45:19

On Tue, Apr 20, 2021 at 12:35 PM Chapman Flack <chap@anastigmatix.net> wrote:
> How would showing that to be true for data structure X be different from
> making a case for data structure X?

You don't have to understand the theoretical basis of B-Tree indexes
to see that they work well. In fact, it took at least a decade for
somebody to formalize how the space utilization works with B-Trees
containing random data. Of course theory matters, but the fact is that
B-Trees had been widely used for commercial and scientific
applications that whole time.

Maybe I'll be wrong about learned indexes - who knows? But the burden
of proof is not mine. I prefer to spend my time on things that I am
reasonably confident will work out well ahead of time.

--
Peter Geoghegan

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

"Jonah H. Harris"

Дата:

20 апреля 2021 г., 22:51:30

On Tue, Apr 20, 2021 at 3:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Apr 20, 2021 at 12:35 PM Chapman Flack <chap@anastigmatix.net> wrote:
> How would showing that to be true for data structure X be different from
> making a case for data structure X?

You don't have to understand the theoretical basis of B-Tree indexes
to see that they work well. In fact, it took at least a decade for
somebody to formalize how the space utilization works with B-Trees
containing random data. Of course theory matters, but the fact is that
B-Trees had been widely used for commercial and scientific
applications that whole time.

Maybe I'll be wrong about learned indexes - who knows? But the burden
of proof is not mine. I prefer to spend my time on things that I am
reasonably confident will work out well ahead of time.

Agreed on all of your takes, Peter. In time, they will probably be more realistic. But, at present, I tend to see the research papers make comparisons between learned vs. traditional pitching the benefits of the former without any of the well-known optimizations of the latter - as if time stood still since the original B-Tree. Similarly, where most academic research starts to fall apart in practicality is the lack of addressing realistic write volumes and related concurrency issues. I'm happy to be disproven on this, though.

Jonah H. Harris

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Peter Geoghegan

Дата:

20 апреля 2021 г., 23:22:27

On Tue, Apr 20, 2021 at 12:51 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
>> Maybe I'll be wrong about learned indexes - who knows? But the burden
>> of proof is not mine. I prefer to spend my time on things that I am
>> reasonably confident will work out well ahead of time.
>
>
> Agreed on all of your takes, Peter. In time, they will probably be more realistic.

A big problem when critically evaluating any complicated top-down
model in the abstract is that it's too easy for the designer to hide
*risk* (perhaps inadvertently). If you are allowed to make what
amounts to an assumption that you have perfect foreknowledge of the
dataset, then sure, you can do a lot with that certainty. You can
easily find a way to make things faster or more space efficient by
some ridiculous multiple that way (like 10x, 100x, whatever).

None of these papers ever get around to explaining why what they've
come up with is not simply fool's gold. The assumption that you can
have robust foreknowledge of the dataset seems incredibly fragile,
even if your model is *almost* miraculously good. I have no idea how
fair that is. But my job is to make Postgres better, not to judge
papers. My mindset is very matter of fact and practical.

-- 
Peter Geoghegan

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Stefan Keller

Дата:

21 апреля 2021 г., 00:29:43

Just for the records: A learned index as no more foreknowledge about
the dataset as other indices.

I'd give learned indexes at least a change to provide a
proof-of-concept. And I want to learn more about the requirements to
be accepted as a new index (before undergoing month's of code
sprints).

As you may have seen, the "Stonebraker paper" I cited [1] is also
sceptic requiring full parity on features (like "concurrency control,
recovery, non main memory,and multi-user settings")! Non main memory
code I understand.
=> But index read/write operations and multi-user settings are part of
a separate software (transaction manager), aren't they?

~Stefan

[1] https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf

Am Di., 20. Apr. 2021 um 22:22 Uhr schrieb Peter Geoghegan <pg@bowt.ie>:
>
> On Tue, Apr 20, 2021 at 12:51 PM Jonah H. Harris <jonah.harris@gmail.com> wrote:
> >> Maybe I'll be wrong about learned indexes - who knows? But the burden
> >> of proof is not mine. I prefer to spend my time on things that I am
> >> reasonably confident will work out well ahead of time.
> >
> >
> > Agreed on all of your takes, Peter. In time, they will probably be more realistic.
>
> A big problem when critically evaluating any complicated top-down
> model in the abstract is that it's too easy for the designer to hide
> *risk* (perhaps inadvertently). If you are allowed to make what
> amounts to an assumption that you have perfect foreknowledge of the
> dataset, then sure, you can do a lot with that certainty. You can
> easily find a way to make things faster or more space efficient by
> some ridiculous multiple that way (like 10x, 100x, whatever).
>
> None of these papers ever get around to explaining why what they've
> come up with is not simply fool's gold. The assumption that you can
> have robust foreknowledge of the dataset seems incredibly fragile,
> even if your model is *almost* miraculously good. I have no idea how
> fair that is. But my job is to make Postgres better, not to judge
> papers. My mindset is very matter of fact and practical.
>
> --
> Peter Geoghegan

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Tom Lane

Дата:

21 апреля 2021 г., 00:49:48

Stefan Keller <sfkeller@gmail.com> writes:
> I'd give learned indexes at least a change to provide a
> proof-of-concept. And I want to learn more about the requirements to
> be accepted as a new index (before undergoing month's of code
> sprints).

There's enough support these days that you can build a new index
type as an extension, without touching the core code at all.
I'd advise proceeding that way, at least until you have a pretty
convincing prototype.

            regards, tom lane

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Peter Geoghegan

Дата:

21 апреля 2021 г., 00:51:32

On Tue, Apr 20, 2021 at 2:29 PM Stefan Keller <sfkeller@gmail.com> wrote:
> Just for the records: A learned index as no more foreknowledge about
> the dataset as other indices.

Maybe. ML models are famously prone to over-interpreting training
data. In any case I am simply not competent to assess how true this
is.

> I'd give learned indexes at least a change to provide a
> proof-of-concept. And I want to learn more about the requirements to
> be accepted as a new index (before undergoing month's of code
> sprints).

I have everything to gain and nothing to lose by giving them a chance
-- I'm not required to do anything to give them a chance, after all. I
just want to be clear that I'm a skeptic now rather than later. I'm
not the one making a big investment of my time here.

> As you may have seen, the "Stonebraker paper" I cited [1] is also
> sceptic requiring full parity on features (like "concurrency control,
> recovery, non main memory,and multi-user settings")! Non main memory
> code I understand.
> => But index read/write operations and multi-user settings are part of
> a separate software (transaction manager), aren't they?

It's easy for me to be a skeptic -- again, what do I have to lose by
freely expressing my opinion? Mostly I'm just saying that I wouldn't
work on this because ISTM that there is significant uncertainty about
the outcome, but much less uncertainty about the outcome of
alternative projects of comparable difficulty. That's fundamentally
how I assess what to work on. There is plenty of uncertainty on my end
-- but that's beside the point.

-- 
Peter Geoghegan

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Stefan Keller

Дата:

21 апреля 2021 г., 11:52:19

Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There's enough support these days that you can build a new index
> type as an extension, without touching the core code at all.

Thanks. I'm ramping up knowledge about extending PG with C++.

I'm still interested to understand in principle what an index has to
do with concurrency control, in order to divide
concerns/reponsibilities of code.

Di., 20. Apr. 2021 23:51 Uhr Peter Geoghegan <pg@bowt.ie> wrote:
> It's easy for me to be a skeptic

Isn't being skeptic a requirement for all of us to be a db engineer :-)

> but much less uncertainty about the outcome of alternative projects of comparable difficulty

Oh. As mentioned above I'm trying to get an overview of indices. So,
if you have hints about other new indexes (like PGM, VODKA for
text/ts, or Hippo), I'm interested.

 ~Stefan

Am Di., 20. Apr. 2021 um 23:51 Uhr schrieb Peter Geoghegan <pg@bowt.ie>:
>
> On Tue, Apr 20, 2021 at 2:29 PM Stefan Keller <sfkeller@gmail.com> wrote:
> > Just for the records: A learned index as no more foreknowledge about
> > the dataset as other indices.
>
> Maybe. ML models are famously prone to over-interpreting training
> data. In any case I am simply not competent to assess how true this
> is.
>
> > I'd give learned indexes at least a change to provide a
> > proof-of-concept. And I want to learn more about the requirements to
> > be accepted as a new index (before undergoing month's of code
> > sprints).
>
> I have everything to gain and nothing to lose by giving them a chance
> -- I'm not required to do anything to give them a chance, after all. I
> just want to be clear that I'm a skeptic now rather than later. I'm
> not the one making a big investment of my time here.
>
> > As you may have seen, the "Stonebraker paper" I cited [1] is also
> > sceptic requiring full parity on features (like "concurrency control,
> > recovery, non main memory,and multi-user settings")! Non main memory
> > code I understand.
> > => But index read/write operations and multi-user settings are part of
> > a separate software (transaction manager), aren't they?
>
> It's easy for me to be a skeptic -- again, what do I have to lose by
> freely expressing my opinion? Mostly I'm just saying that I wouldn't
> work on this because ISTM that there is significant uncertainty about
> the outcome, but much less uncertainty about the outcome of
> alternative projects of comparable difficulty. That's fundamentally
> how I assess what to work on. There is plenty of uncertainty on my end
> -- but that's beside the point.
>
> --
> Peter Geoghegan

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Oleg Bartunov

Дата:

21 апреля 2021 г., 12:16:30

On Tue, Apr 20, 2021 at 8:56 PM Stefan Keller <sfkeller@gmail.com> wrote:

Dear Olegs, dear Nikolay, dear all

Allow me to revive this thread:

Are there any advances in a learned index for PostgreSQL?

Background: I'm trying to benchmark those experimental indices. For
this I did some bibliography work (see below). Fun fact: Not only
Postgres people love high-proof drinks, sorry, high-proof index names:
see "From wisckey to bourbon" :-).

Have you seen recent paper "Benchmarking Learned Indexes" ?

https://vldb.org/pvldb/vol14/p1-marcus.pdf

I haven't look into the code https://github.com/learnedsystems/sosd

My main discovery is a discussion report - which included Stonebraker
- entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first
two takeaways are "#1: The initial comparisons of learned indices with
optimized traditional indices should be further expanded to include
concurrency control and multi-user settings. (...) #2: A key benefit
of learned indices may come from the learned structures requiring
lower space utilization, rather than a reduction in search time."

Yours, Stefan

[1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel,
J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment
and Prognosis. Data Engineering, 3. Web access:
https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf

Bibliography (without any claim for completeness):

Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018,
May). The case for learned index structures. In Proceedings of the
2018 International Conference on Management of Data (pp. 489-504). Web
access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909

"Recursive model index, a learned index structure" (based on Kraska et
al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi
(Go language), https://github.com/learnedsystems/RMI (Rust)

Kaur, T. (2018). Learned Index Structures: Practical Implementations
and Future Directions. Master Thesis. Web access:
https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf
(pretends to be open source C++ but none found)

Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska,
T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned
index. In Proceedings of the Third International Workshop on
Exploiting Artificial Intelligence Techniques for Data Management (pp.
1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659),
Source code: https://github.com/learnedsystems/RadixSpline (C++).

Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B.,
Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to
bourbon: A learned index for log-structured merge trees. In 14th
{USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 20) (pp. 155-171). Web access:
https://www.usenix.org/system/files/osdi20-dai_0.pdf

Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... &
Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In
Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data (pp. 969-984). Web access:
https://doi.org/10.1145/3318464.3389711 /
https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code:
https://github.com/microsoft/ALEX (C++, MIT license)

Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov
<o.ivanov@postgrespro.ru>:
>
> On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> > <samokhvalov@gmail.com> wrote:
> >> Very interesting read: https://arxiv.org/abs/1712.01208
> >>
> >> HN discussion: https://news.ycombinator.com/item?id=15894896
> >>
> >> Some of the comments (from Twitter
> >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
> >> at GOOG just released a paper showing how machine-learned indexes can
> >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
> >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
> >> unlike CPU. Amazing."
> >>
> >> Can those ideas be applied to Postgres in its current state? Or it's not
> >> really down-to-earth?
> > Oleg made some analysis of the paper.
> If the keys are comparable and the data distribution is simple enough
> (which seems to happen rather often) and the data does not change, we
> can learn the distribution, and so perform the search faster, and save
> the memory also. That is what authors wrote and that is what must work
> in my opinion.
>
> The limitations of the approach, which authors mention in their work:
> + The data does not change. The proposed method can be extended more or
> less straightforward to nearly append-only workloads and to workloads in
> which the data distribution does not change or changes very slowly (the
> data itself or its amount may change, nevertheless). Otherwise it is
> challenging, because the key positions cannot be considered as
> independent values anymore, but it looks still possible. The other
> proposed by the authors option is to store new objects in buffer and
> retrain model in the background, but that does not look as a nice solution.
> + GPU are fast, but the latency of doing operations on the GPU almost
> certainly avoids all speed benefits for such small models. The only case
> in which it is reasonable to use GPU is training models (i. e. building
> such indexes).
>
> The following left unclear for me from the paper:
> + How effectively the approach can be implemented taking into account
> the limitations above.
> + For hash functions authors propose to replace hash function with the
> learned CDF, which can decrease the amount of collisions, which decreaes
> the memory consumption. That is reasonable, but this hash function
> implicitly uses the information about the order on the keys. I suppose
> that using the ordering on the data and with access to the data
> histogram one could achieve similar memory consumption decrease.
> + The proposed hierarchical learning looks natural for the problem, but
> it is not clear that it is the best option. For example, one might use
> the end-to-end learning procedure with weights sparsity regularizers as
> well. I wonder whether the last approach can obtain the competitive
> results with the first one, and if not, then why. Anyway, I have a
> feeling that the proposed method has a room for improvement.
>
> In general, the approach still needs some additional research (mostly
> about the changing data), and has some points to think about how to
> implement them properly (paging, for example), but it looks promising
> enough nevertheless.
>

Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Bruce Momjian

Дата:

21 апреля 2021 г., 16:46:40

On Wed, Apr 21, 2021 at 10:52:19AM +0200, Stefan Keller wrote:
> Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > There's enough support these days that you can build a new index
> > type as an extension, without touching the core code at all.
> 
> Thanks. I'm ramping up knowledge about extending PG with C++.
> 
> I'm still interested to understand in principle what an index has to
> do with concurrency control, in order to divide
> concerns/reponsibilities of code.

The issue is that some index structures, like bitmap indexes, have very
poor concurrent performance.  This means that some indexes perform very
well for a single user but poorly for multiple users.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Stefan Keller

Дата:

21 апреля 2021 г., 19:01:04

Mi., 21. Apr. 2021, 11:16 Uhr, Oleg Bartunov <obartunov@postgrespro.ru> wrote:
> Have you seen recent paper "Benchmarking Learned Indexes" ?

Yes. I skipped it after that this benchmark "just" compares the
algorithm implementations.

What's needed - and what many here as well as the "ML-In-Databases"
paper from Kraska et al. (2021) are saying - is, that a new index
(like a learned index) should be implemented as a PostgreSQL
extension.

Mi., 21. Apr. 2021, 15:46 Uhr, Bruce Momjian <bruce@momjian.us> wrote:
> The issue is that some index structures, like bitmap indexes, have very
> poor concurrent performance.  This means that some indexes perform very
> well for a single user but poorly for multiple users.

I see now. That looks to me like a second step of an experiment to
implement a possible new index.

~Stefan

Am Mi., 21. Apr. 2021 um 15:46 Uhr schrieb Bruce Momjian <bruce@momjian.us>:
>
> On Wed, Apr 21, 2021 at 10:52:19AM +0200, Stefan Keller wrote:
> > Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > There's enough support these days that you can build a new index
> > > type as an extension, without touching the core code at all.
> >
> > Thanks. I'm ramping up knowledge about extending PG with C++.
> >
> > I'm still interested to understand in principle what an index has to
> > do with concurrency control, in order to divide
> > concerns/reponsibilities of code.
>
> The issue is that some index structures, like bitmap indexes, have very
> poor concurrent performance.  This means that some indexes perform very
> well for a single user but poorly for multiple users.
>
> --
>   Bruce Momjian  <bruce@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>

Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)

От

Andrey Borodin

Дата:

22 апреля 2021 г., 06:39:33


> 21 апр. 2021 г., в 21:01, Stefan Keller <sfkeller@gmail.com> написал(а):
>
> What's needed - and what many here as well as the "ML-In-Databases"
> paper from Kraska et al. (2021) are saying - is, that a new index
> (like a learned index) should be implemented as a PostgreSQL
> extension.

BTW, you don't have to implement index from scratch. You can take one of many existing and modify towards Learned
Index.
Index contains many parts: scans, builds, vacuums, wal, opclasses etc. And simply adhering to API is a lot of work.
I'll be happy to help you to with formulating key differences and, probably, code of an extension.

Best regards, Andrey Borodin.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: ML-based indexing ("The Case for Learned Index Structures", a paperfrom Google)