Обсуждение: ML-based indexing ("The Case for Learned Index Structures", a paperfrom Google)
ML-based indexing ("The Case for Learned Index Structures", a paperfrom Google)
От
Nikolay Samokhvalov
Дата:
Very interesting read: https://arxiv.org/abs/1712.01208
HN discussion: https://news.ycombinator.com/item?id=15894896
Some of the comments (from Twitter https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co at GOOG just released a paper showing how machine-learned indexes can replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than B-Trees, 10-100x less space. Executes on GPU, which are getting faster unlike CPU. Amazing."
Can those ideas be applied to Postgres in its current state? Or it's not really down-to-earth?
Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)
От
Oleg Bartunov
Дата:
On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov <samokhvalov@gmail.com> wrote: > Very interesting read: https://arxiv.org/abs/1712.01208 > > HN discussion: https://news.ycombinator.com/item?id=15894896 > > Some of the comments (from Twitter > https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co > at GOOG just released a paper showing how machine-learned indexes can > replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than > B-Trees, 10-100x less space. Executes on GPU, which are getting faster > unlike CPU. Amazing." > > Can those ideas be applied to Postgres in its current state? Or it's not > really down-to-earth? Oleg made some analysis of the paper.
Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)
От
Oleg Ivanov
Дата:
On 12/12/2017 04:33 PM, Oleg Bartunov wrote: > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov > <samokhvalov@gmail.com> wrote: >> Very interesting read: https://arxiv.org/abs/1712.01208 >> >> HN discussion: https://news.ycombinator.com/item?id=15894896 >> >> Some of the comments (from Twitter >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co >> at GOOG just released a paper showing how machine-learned indexes can >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster >> unlike CPU. Amazing." >> >> Can those ideas be applied to Postgres in its current state? Or it's not >> really down-to-earth? > Oleg made some analysis of the paper. If the keys are comparable and the data distribution is simple enough (which seems to happen rather often) and the data does not change, we can learn the distribution, and so perform the search faster, and save the memory also. That is what authors wrote and that is what must work in my opinion. The limitations of the approach, which authors mention in their work: + The data does not change. The proposed method can be extended more or less straightforward to nearly append-only workloads and to workloads in which the data distribution does not change or changes very slowly (the data itself or its amount may change, nevertheless). Otherwise it is challenging, because the key positions cannot be considered as independent values anymore, but it looks still possible. The other proposed by the authors option is to store new objects in buffer and retrain model in the background, but that does not look as a nice solution. + GPU are fast, but the latency of doing operations on the GPU almost certainly avoids all speed benefits for such small models. The only case in which it is reasonable to use GPU is training models (i. e. building such indexes). The following left unclear for me from the paper: + How effectively the approach can be implemented taking into account the limitations above. + For hash functions authors propose to replace hash function with the learned CDF, which can decrease the amount of collisions, which decreaes the memory consumption. That is reasonable, but this hash function implicitly uses the information about the order on the keys. I suppose that using the ordering on the data and with access to the data histogram one could achieve similar memory consumption decrease. + The proposed hierarchical learning looks natural for the problem, but it is not clear that it is the best option. For example, one might use the end-to-end learning procedure with weights sparsity regularizers as well. I wonder whether the last approach can obtain the competitive results with the first one, and if not, then why. Anyway, I have a feeling that the proposed method has a room for improvement. In general, the approach still needs some additional research (mostly about the changing data), and has some points to think about how to implement them properly (paging, for example), but it looks promising enough nevertheless.
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Stefan Keller
Дата:
Dear Olegs, dear Nikolay, dear all Allow me to revive this thread: Are there any advances in a learned index for PostgreSQL? Background: I'm trying to benchmark those experimental indices. For this I did some bibliography work (see below). Fun fact: Not only Postgres people love high-proof drinks, sorry, high-proof index names: see "From wisckey to bourbon" :-). My main discovery is a discussion report - which included Stonebraker - entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first two takeaways are "#1: The initial comparisons of learned indices with optimized traditional indices should be further expanded to include concurrency control and multi-user settings. (...) #2: A key benefit of learned indices may come from the learned structures requiring lower space utilization, rather than a reduction in search time." Yours, Stefan [1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel, J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment and Prognosis. Data Engineering, 3. Web access: https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf Bibliography (without any claim for completeness): Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018, May). The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (pp. 489-504). Web access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909 "Recursive model index, a learned index structure" (based on Kraska et al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi (Go language), https://github.com/learnedsystems/RMI (Rust) Kaur, T. (2018). Learned Index Structures: Practical Implementations and Future Directions. Master Thesis. Web access: https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf (pretends to be open source C++ but none found) Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska, T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (pp. 1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659), Source code: https://github.com/learnedsystems/RadixSpline (C++). Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B., Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to bourbon: A learned index for log-structured merge trees. In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20) (pp. 155-171). Web access: https://www.usenix.org/system/files/osdi20-dai_0.pdf Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... & Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 969-984). Web access: https://doi.org/10.1145/3318464.3389711 / https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code: https://github.com/microsoft/ALEX (C++, MIT license) Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov <o.ivanov@postgrespro.ru>: > > On 12/12/2017 04:33 PM, Oleg Bartunov wrote: > > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov > > <samokhvalov@gmail.com> wrote: > >> Very interesting read: https://arxiv.org/abs/1712.01208 > >> > >> HN discussion: https://news.ycombinator.com/item?id=15894896 > >> > >> Some of the comments (from Twitter > >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co > >> at GOOG just released a paper showing how machine-learned indexes can > >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than > >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster > >> unlike CPU. Amazing." > >> > >> Can those ideas be applied to Postgres in its current state? Or it's not > >> really down-to-earth? > > Oleg made some analysis of the paper. > If the keys are comparable and the data distribution is simple enough > (which seems to happen rather often) and the data does not change, we > can learn the distribution, and so perform the search faster, and save > the memory also. That is what authors wrote and that is what must work > in my opinion. > > The limitations of the approach, which authors mention in their work: > + The data does not change. The proposed method can be extended more or > less straightforward to nearly append-only workloads and to workloads in > which the data distribution does not change or changes very slowly (the > data itself or its amount may change, nevertheless). Otherwise it is > challenging, because the key positions cannot be considered as > independent values anymore, but it looks still possible. The other > proposed by the authors option is to store new objects in buffer and > retrain model in the background, but that does not look as a nice solution. > + GPU are fast, but the latency of doing operations on the GPU almost > certainly avoids all speed benefits for such small models. The only case > in which it is reasonable to use GPU is training models (i. e. building > such indexes). > > The following left unclear for me from the paper: > + How effectively the approach can be implemented taking into account > the limitations above. > + For hash functions authors propose to replace hash function with the > learned CDF, which can decrease the amount of collisions, which decreaes > the memory consumption. That is reasonable, but this hash function > implicitly uses the information about the order on the keys. I suppose > that using the ordering on the data and with access to the data > histogram one could achieve similar memory consumption decrease. > + The proposed hierarchical learning looks natural for the problem, but > it is not clear that it is the best option. For example, one might use > the end-to-end learning procedure with weights sparsity regularizers as > well. I wonder whether the last approach can obtain the competitive > results with the first one, and if not, then why. Anyway, I have a > feeling that the proposed method has a room for improvement. > > In general, the approach still needs some additional research (mostly > about the changing data), and has some points to think about how to > implement them properly (paging, for example), but it looks promising > enough nevertheless. >
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Andrey Borodin
Дата:
> 20 апр. 2021 г., в 22:56, Stefan Keller <sfkeller@gmail.com> написал(а): > > Are there any advances in a learned index for PostgreSQL? BTW take a look into PGM [0]. I'm slowly working on implementing it. I think it is kind of straightforward to implement it as extension. I've started from forking B-tree[1]. I've removed support of anything that is not int4. Then I plan to remove opclass\comparator abstraction layer. And then add interpolation search over the page before binsearch. And, maybe, add delta compression. Or, perhaps, fix WAL, which is completely broken. Or add some vectorisation. The concurrency is from a regular B-tree, of cause. This is my pet project. So, unsurprisingly, I've done only parts of big plan :) I would appreciate any help. Thanks! Best regards, Andrey Borodin. [0] https://pgm.di.unipi.it/ [1] https://github.com/x4m/pg2m
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Peter Geoghegan
Дата:
On Tue, Apr 20, 2021 at 11:18 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > BTW take a look into PGM [0]. I'm slowly working on implementing it. > I think it is kind of straightforward to implement it as extension. > I've started from forking B-tree[1]. I've removed support of anything that is not int4. > Then I plan to remove opclass\comparator abstraction layer. > And then add interpolation search over the page before binsearch. FWIW I'm a skeptic of learned indexes. There are lots of reasons why I don't think that they're practical, but it boils down to this: in general data structures that work well don't need anybody to make a case for them. They simply work well for the task they were designed for. Anything is possible, but I strongly doubt that learned indexes are the exception to that general rule. Why hasn't anybody been able to apply the techniques in any real world database system to date? I'm sure that it's possible to make things much faster if it's possible to make certain wide-reaching assumptions. They talk about big improvements in space utilization compared to B-Trees as if there was some very fixed idea of how that works in B-Trees. Why haven't they compared their model to grotty heuristics that practical experience has shown to work, such as the rightmost leaf page split heuristic? Any paper about learned indexes that I've ever read doesn't even acknowledge the existence of these heuristics. -- Peter Geoghegan
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Chapman Flack
Дата:
On 04/20/21 15:24, Peter Geoghegan wrote: > data structures that work well don't need anybody to make a case for them. > They simply work well for the task they were designed for. How would showing that to be true for data structure X be different from making a case for data structure X? Regards, -Chap
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Peter Geoghegan
Дата:
On Tue, Apr 20, 2021 at 12:35 PM Chapman Flack <chap@anastigmatix.net> wrote: > How would showing that to be true for data structure X be different from > making a case for data structure X? You don't have to understand the theoretical basis of B-Tree indexes to see that they work well. In fact, it took at least a decade for somebody to formalize how the space utilization works with B-Trees containing random data. Of course theory matters, but the fact is that B-Trees had been widely used for commercial and scientific applications that whole time. Maybe I'll be wrong about learned indexes - who knows? But the burden of proof is not mine. I prefer to spend my time on things that I am reasonably confident will work out well ahead of time. -- Peter Geoghegan
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
"Jonah H. Harris"
Дата:
On Tue, Apr 20, 2021 at 3:45 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Apr 20, 2021 at 12:35 PM Chapman Flack <chap@anastigmatix.net> wrote:
> How would showing that to be true for data structure X be different from
> making a case for data structure X?
You don't have to understand the theoretical basis of B-Tree indexes
to see that they work well. In fact, it took at least a decade for
somebody to formalize how the space utilization works with B-Trees
containing random data. Of course theory matters, but the fact is that
B-Trees had been widely used for commercial and scientific
applications that whole time.
Maybe I'll be wrong about learned indexes - who knows? But the burden
of proof is not mine. I prefer to spend my time on things that I am
reasonably confident will work out well ahead of time.
Agreed on all of your takes, Peter. In time, they will probably be more realistic. But, at present, I tend to see the research papers make comparisons between learned vs. traditional pitching the benefits of the former without any of the well-known optimizations of the latter - as if time stood still since the original B-Tree. Similarly, where most academic research starts to fall apart in practicality is the lack of addressing realistic write volumes and related concurrency issues. I'm happy to be disproven on this, though.
Jonah H. Harris
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Peter Geoghegan
Дата:
On Tue, Apr 20, 2021 at 12:51 PM Jonah H. Harris <jonah.harris@gmail.com> wrote: >> Maybe I'll be wrong about learned indexes - who knows? But the burden >> of proof is not mine. I prefer to spend my time on things that I am >> reasonably confident will work out well ahead of time. > > > Agreed on all of your takes, Peter. In time, they will probably be more realistic. A big problem when critically evaluating any complicated top-down model in the abstract is that it's too easy for the designer to hide *risk* (perhaps inadvertently). If you are allowed to make what amounts to an assumption that you have perfect foreknowledge of the dataset, then sure, you can do a lot with that certainty. You can easily find a way to make things faster or more space efficient by some ridiculous multiple that way (like 10x, 100x, whatever). None of these papers ever get around to explaining why what they've come up with is not simply fool's gold. The assumption that you can have robust foreknowledge of the dataset seems incredibly fragile, even if your model is *almost* miraculously good. I have no idea how fair that is. But my job is to make Postgres better, not to judge papers. My mindset is very matter of fact and practical. -- Peter Geoghegan
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Stefan Keller
Дата:
Just for the records: A learned index as no more foreknowledge about the dataset as other indices. I'd give learned indexes at least a change to provide a proof-of-concept. And I want to learn more about the requirements to be accepted as a new index (before undergoing month's of code sprints). As you may have seen, the "Stonebraker paper" I cited [1] is also sceptic requiring full parity on features (like "concurrency control, recovery, non main memory,and multi-user settings")! Non main memory code I understand. => But index read/write operations and multi-user settings are part of a separate software (transaction manager), aren't they? ~Stefan [1] https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf Am Di., 20. Apr. 2021 um 22:22 Uhr schrieb Peter Geoghegan <pg@bowt.ie>: > > On Tue, Apr 20, 2021 at 12:51 PM Jonah H. Harris <jonah.harris@gmail.com> wrote: > >> Maybe I'll be wrong about learned indexes - who knows? But the burden > >> of proof is not mine. I prefer to spend my time on things that I am > >> reasonably confident will work out well ahead of time. > > > > > > Agreed on all of your takes, Peter. In time, they will probably be more realistic. > > A big problem when critically evaluating any complicated top-down > model in the abstract is that it's too easy for the designer to hide > *risk* (perhaps inadvertently). If you are allowed to make what > amounts to an assumption that you have perfect foreknowledge of the > dataset, then sure, you can do a lot with that certainty. You can > easily find a way to make things faster or more space efficient by > some ridiculous multiple that way (like 10x, 100x, whatever). > > None of these papers ever get around to explaining why what they've > come up with is not simply fool's gold. The assumption that you can > have robust foreknowledge of the dataset seems incredibly fragile, > even if your model is *almost* miraculously good. I have no idea how > fair that is. But my job is to make Postgres better, not to judge > papers. My mindset is very matter of fact and practical. > > -- > Peter Geoghegan
Stefan Keller <sfkeller@gmail.com> writes: > I'd give learned indexes at least a change to provide a > proof-of-concept. And I want to learn more about the requirements to > be accepted as a new index (before undergoing month's of code > sprints). There's enough support these days that you can build a new index type as an extension, without touching the core code at all. I'd advise proceeding that way, at least until you have a pretty convincing prototype. regards, tom lane
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Peter Geoghegan
Дата:
On Tue, Apr 20, 2021 at 2:29 PM Stefan Keller <sfkeller@gmail.com> wrote: > Just for the records: A learned index as no more foreknowledge about > the dataset as other indices. Maybe. ML models are famously prone to over-interpreting training data. In any case I am simply not competent to assess how true this is. > I'd give learned indexes at least a change to provide a > proof-of-concept. And I want to learn more about the requirements to > be accepted as a new index (before undergoing month's of code > sprints). I have everything to gain and nothing to lose by giving them a chance -- I'm not required to do anything to give them a chance, after all. I just want to be clear that I'm a skeptic now rather than later. I'm not the one making a big investment of my time here. > As you may have seen, the "Stonebraker paper" I cited [1] is also > sceptic requiring full parity on features (like "concurrency control, > recovery, non main memory,and multi-user settings")! Non main memory > code I understand. > => But index read/write operations and multi-user settings are part of > a separate software (transaction manager), aren't they? It's easy for me to be a skeptic -- again, what do I have to lose by freely expressing my opinion? Mostly I'm just saying that I wouldn't work on this because ISTM that there is significant uncertainty about the outcome, but much less uncertainty about the outcome of alternative projects of comparable difficulty. That's fundamentally how I assess what to work on. There is plenty of uncertainty on my end -- but that's beside the point. -- Peter Geoghegan
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Stefan Keller
Дата:
Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote: > There's enough support these days that you can build a new index > type as an extension, without touching the core code at all. Thanks. I'm ramping up knowledge about extending PG with C++. I'm still interested to understand in principle what an index has to do with concurrency control, in order to divide concerns/reponsibilities of code. Di., 20. Apr. 2021 23:51 Uhr Peter Geoghegan <pg@bowt.ie> wrote: > It's easy for me to be a skeptic Isn't being skeptic a requirement for all of us to be a db engineer :-) > but much less uncertainty about the outcome of alternative projects of comparable difficulty Oh. As mentioned above I'm trying to get an overview of indices. So, if you have hints about other new indexes (like PGM, VODKA for text/ts, or Hippo), I'm interested. ~Stefan Am Di., 20. Apr. 2021 um 23:51 Uhr schrieb Peter Geoghegan <pg@bowt.ie>: > > On Tue, Apr 20, 2021 at 2:29 PM Stefan Keller <sfkeller@gmail.com> wrote: > > Just for the records: A learned index as no more foreknowledge about > > the dataset as other indices. > > Maybe. ML models are famously prone to over-interpreting training > data. In any case I am simply not competent to assess how true this > is. > > > I'd give learned indexes at least a change to provide a > > proof-of-concept. And I want to learn more about the requirements to > > be accepted as a new index (before undergoing month's of code > > sprints). > > I have everything to gain and nothing to lose by giving them a chance > -- I'm not required to do anything to give them a chance, after all. I > just want to be clear that I'm a skeptic now rather than later. I'm > not the one making a big investment of my time here. > > > As you may have seen, the "Stonebraker paper" I cited [1] is also > > sceptic requiring full parity on features (like "concurrency control, > > recovery, non main memory,and multi-user settings")! Non main memory > > code I understand. > > => But index read/write operations and multi-user settings are part of > > a separate software (transaction manager), aren't they? > > It's easy for me to be a skeptic -- again, what do I have to lose by > freely expressing my opinion? Mostly I'm just saying that I wouldn't > work on this because ISTM that there is significant uncertainty about > the outcome, but much less uncertainty about the outcome of > alternative projects of comparable difficulty. That's fundamentally > how I assess what to work on. There is plenty of uncertainty on my end > -- but that's beside the point. > > -- > Peter Geoghegan
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Oleg Bartunov
Дата:
On Tue, Apr 20, 2021 at 8:56 PM Stefan Keller <sfkeller@gmail.com> wrote:
Dear Olegs, dear Nikolay, dear all
Allow me to revive this thread:
Are there any advances in a learned index for PostgreSQL?
Background: I'm trying to benchmark those experimental indices. For
this I did some bibliography work (see below). Fun fact: Not only
Postgres people love high-proof drinks, sorry, high-proof index names:
see "From wisckey to bourbon" :-).
Have you seen recent paper "Benchmarking Learned Indexes" ?
I haven't look into the code https://github.com/learnedsystems/sosd
My main discovery is a discussion report - which included Stonebraker
- entitled "ML-In-Databases: Assessment and Prognosis" [1]. The first
two takeaways are "#1: The initial comparisons of learned indices with
optimized traditional indices should be further expanded to include
concurrency control and multi-user settings. (...) #2: A key benefit
of learned indices may come from the learned structures requiring
lower space utilization, rather than a reduction in search time."
Yours, Stefan
[1] Kraska, T., Minhas, U. F., Neumann, T., Papaemmanouil, O., Patel,
J. M., Ré, C., & Stonebraker, M. (2021). ML-In-Databases: Assessment
and Prognosis. Data Engineering, 3. Web access:
https://www.cs.brandeis.edu/~olga/publications/ieee2021.pdf
Bibliography (without any claim for completeness):
Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018,
May). The case for learned index structures. In Proceedings of the
2018 International Conference on Management of Data (pp. 489-504). Web
access https://dl.acm.org/doi/pdf/10.1145/3183713.3196909
"Recursive model index, a learned index structure" (based on Kraska et
al. 2017? 2018). Source code: https://github.com/BenJoyenConseil/rmi
(Go language), https://github.com/learnedsystems/RMI (Rust)
Kaur, T. (2018). Learned Index Structures: Practical Implementations
and Future Directions. Master Thesis. Web access:
https://wwwiti.cs.uni-magdeburg.de/iti_db/publikationen/ps/auto/kaur2018learned.pdf
(pretends to be open source C++ but none found)
Kipf, A., Marcus, R., van Renen, A., Stoian, M., Kemper, A., Kraska,
T., & Neumann, T. (2020, June). RadixSpline: a single-pass learned
index. In Proceedings of the Third International Workshop on
Exploiting Artificial Intelligence Techniques for Data Management (pp.
1-5). Web access: https://dl.acm.org/doi/10.1145/3401071.3401659),
Source code: https://github.com/learnedsystems/RadixSpline (C++).
Dai, Y., Xu, Y., Ganesan, A., Alagappan, R., Kroth, B.,
Arpaci-Dusseau, A., & Arpaci-Dusseau, R. (2020). From wisckey to
bourbon: A learned index for log-structured merge trees. In 14th
{USENIX} Symposium on Operating Systems Design and Implementation
({OSDI} 20) (pp. 155-171). Web access:
https://www.usenix.org/system/files/osdi20-dai_0.pdf
Ding, J., Minhas, U. F., Yu, J., Wang, C., Do, J., Li, Y., ... &
Kraska, T. (2020, June). ALEX: an updatable adaptive learned index. In
Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Data (pp. 969-984). Web access:
https://doi.org/10.1145/3318464.3389711 /
https://dblp.org/rec/conf/sigmod/DingMYWDLZCGKLK20 . Source code:
https://github.com/microsoft/ALEX (C++, MIT license)
Am Di., 12. Dez. 2017 um 18:02 Uhr schrieb Oleg Ivanov
<o.ivanov@postgrespro.ru>:
>
> On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> > On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> > <samokhvalov@gmail.com> wrote:
> >> Very interesting read: https://arxiv.org/abs/1712.01208
> >>
> >> HN discussion: https://news.ycombinator.com/item?id=15894896
> >>
> >> Some of the comments (from Twitter
> >> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
> >> at GOOG just released a paper showing how machine-learned indexes can
> >> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
> >> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
> >> unlike CPU. Amazing."
> >>
> >> Can those ideas be applied to Postgres in its current state? Or it's not
> >> really down-to-earth?
> > Oleg made some analysis of the paper.
> If the keys are comparable and the data distribution is simple enough
> (which seems to happen rather often) and the data does not change, we
> can learn the distribution, and so perform the search faster, and save
> the memory also. That is what authors wrote and that is what must work
> in my opinion.
>
> The limitations of the approach, which authors mention in their work:
> + The data does not change. The proposed method can be extended more or
> less straightforward to nearly append-only workloads and to workloads in
> which the data distribution does not change or changes very slowly (the
> data itself or its amount may change, nevertheless). Otherwise it is
> challenging, because the key positions cannot be considered as
> independent values anymore, but it looks still possible. The other
> proposed by the authors option is to store new objects in buffer and
> retrain model in the background, but that does not look as a nice solution.
> + GPU are fast, but the latency of doing operations on the GPU almost
> certainly avoids all speed benefits for such small models. The only case
> in which it is reasonable to use GPU is training models (i. e. building
> such indexes).
>
> The following left unclear for me from the paper:
> + How effectively the approach can be implemented taking into account
> the limitations above.
> + For hash functions authors propose to replace hash function with the
> learned CDF, which can decrease the amount of collisions, which decreaes
> the memory consumption. That is reasonable, but this hash function
> implicitly uses the information about the order on the keys. I suppose
> that using the ordering on the data and with access to the data
> histogram one could achieve similar memory consumption decrease.
> + The proposed hierarchical learning looks natural for the problem, but
> it is not clear that it is the best option. For example, one might use
> the end-to-end learning procedure with weights sparsity regularizers as
> well. I wonder whether the last approach can obtain the competitive
> results with the first one, and if not, then why. Anyway, I have a
> feeling that the proposed method has a room for improvement.
>
> In general, the approach still needs some additional research (mostly
> about the changing data), and has some points to think about how to
> implement them properly (paging, for example), but it looks promising
> enough nevertheless.
>
--
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Bruce Momjian
Дата:
On Wed, Apr 21, 2021 at 10:52:19AM +0200, Stefan Keller wrote: > Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote: > > There's enough support these days that you can build a new index > > type as an extension, without touching the core code at all. > > Thanks. I'm ramping up knowledge about extending PG with C++. > > I'm still interested to understand in principle what an index has to > do with concurrency control, in order to divide > concerns/reponsibilities of code. The issue is that some index structures, like bitmap indexes, have very poor concurrent performance. This means that some indexes perform very well for a single user but poorly for multiple users. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Stefan Keller
Дата:
Mi., 21. Apr. 2021, 11:16 Uhr, Oleg Bartunov <obartunov@postgrespro.ru> wrote: > Have you seen recent paper "Benchmarking Learned Indexes" ? Yes. I skipped it after that this benchmark "just" compares the algorithm implementations. What's needed - and what many here as well as the "ML-In-Databases" paper from Kraska et al. (2021) are saying - is, that a new index (like a learned index) should be implemented as a PostgreSQL extension. Mi., 21. Apr. 2021, 15:46 Uhr, Bruce Momjian <bruce@momjian.us> wrote: > The issue is that some index structures, like bitmap indexes, have very > poor concurrent performance. This means that some indexes perform very > well for a single user but poorly for multiple users. I see now. That looks to me like a second step of an experiment to implement a possible new index. ~Stefan Am Mi., 21. Apr. 2021 um 15:46 Uhr schrieb Bruce Momjian <bruce@momjian.us>: > > On Wed, Apr 21, 2021 at 10:52:19AM +0200, Stefan Keller wrote: > > Di., 20. Apr. 2021 23:50 Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > There's enough support these days that you can build a new index > > > type as an extension, without touching the core code at all. > > > > Thanks. I'm ramping up knowledge about extending PG with C++. > > > > I'm still interested to understand in principle what an index has to > > do with concurrency control, in order to divide > > concerns/reponsibilities of code. > > The issue is that some index structures, like bitmap indexes, have very > poor concurrent performance. This means that some indexes perform very > well for a single user but poorly for multiple users. > > -- > Bruce Momjian <bruce@momjian.us> https://momjian.us > EDB https://enterprisedb.com > > If only the physical world exists, free will is an illusion. >
Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
От
Andrey Borodin
Дата:
> 21 апр. 2021 г., в 21:01, Stefan Keller <sfkeller@gmail.com> написал(а): > > What's needed - and what many here as well as the "ML-In-Databases" > paper from Kraska et al. (2021) are saying - is, that a new index > (like a learned index) should be implemented as a PostgreSQL > extension. BTW, you don't have to implement index from scratch. You can take one of many existing and modify towards Learned Index. Index contains many parts: scans, builds, vacuums, wal, opclasses etc. And simply adhering to API is a lot of work. I'll be happy to help you to with formulating key differences and, probably, code of an extension. Best regards, Andrey Borodin.