Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)

Поиск
Список
Период
Сортировка
От Oleg Ivanov
Тема Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)
Дата
Msg-id 5A300B8C.9020000@postgrespro.ru
обсуждение исходный текст
Ответ на Re: ML-based indexing ("The Case for Learned Index Structures", apaper from Google)  (Oleg Bartunov <obartunov@gmail.com>)
Ответы Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)  (Stefan Keller <sfkeller@gmail.com>)
Список pgsql-hackers
On 12/12/2017 04:33 PM, Oleg Bartunov wrote:
> On Mon, Dec 11, 2017 at 11:11 PM, Nikolay Samokhvalov
> <samokhvalov@gmail.com> wrote:
>> Very interesting read: https://arxiv.org/abs/1712.01208
>>
>> HN discussion: https://news.ycombinator.com/item?id=15894896
>>
>> Some of the comments (from Twitter
>> https://twitter.com/schrockn/status/940037656494317568): "Jeff Dean and co
>> at GOOG just released a paper showing how machine-learned indexes can
>> replace B-Trees, Hash Indexes, and Bloom Filters. Execute 3x faster than
>> B-Trees, 10-100x less space. Executes on GPU, which are getting faster
>> unlike CPU. Amazing."
>>
>> Can those ideas be applied to Postgres in its current state? Or it's not
>> really down-to-earth?
> Oleg made some analysis of the paper.
If the keys are comparable and the data distribution is simple enough 
(which seems to happen rather often) and the data does not change, we 
can learn the distribution, and so perform the search faster, and save 
the memory also. That is what authors wrote and that is what must work 
in my opinion.

The limitations of the approach, which authors mention in their work:
+ The data does not change. The proposed method can be extended more or 
less straightforward to nearly append-only workloads and to workloads in 
which the data distribution does not change or changes very slowly (the 
data itself or its amount may change, nevertheless). Otherwise it is 
challenging, because the key positions cannot be considered as 
independent values anymore, but it looks still possible. The other 
proposed by the authors option is to store new objects in buffer and 
retrain model in the background, but that does not look as a nice solution.
+ GPU are fast, but the latency of doing operations on the GPU almost 
certainly avoids all speed benefits for such small models. The only case 
in which it is reasonable to use GPU is training models (i. e. building 
such indexes).

The following left unclear for me from the paper:
+ How effectively the approach can be implemented taking into account 
the limitations above.
+ For hash functions authors propose to replace hash function with the 
learned CDF, which can decrease the amount of collisions, which decreaes 
the memory consumption. That is reasonable, but this hash function 
implicitly uses the information about the order on the keys. I suppose 
that using the ordering on the data and with access to the data 
histogram one could achieve similar memory consumption decrease.
+ The proposed hierarchical learning looks natural for the problem, but 
it is not clear that it is the best option. For example, one might use 
the end-to-end learning procedure with weights sparsity regularizers as 
well. I wonder whether the last approach can obtain the competitive 
results with the first one, and if not, then why. Anyway, I have a 
feeling that the proposed method has a room for improvement.

In general, the approach still needs some additional research (mostly 
about the changing data), and has some points to think about how to 
implement them properly (paging, for example), but it looks promising 
enough nevertheless.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Geoff Winkless
Дата:
Сообщение: Re: proposal: alternative psql commands quit and exit
Следующее
От: Oleg Ivanov
Дата:
Сообщение: Re: Learned Index