Re: Google Summer of Code: question about GiST API advancement project

Поиск
Список
Период
Сортировка
От GUO Rui
Тема Re: Google Summer of Code: question about GiST API advancement project
Дата
Msg-id CAEuz5PRfkd9S+MNsfzkbLtH-rtN6O4PseJcm7Zit-U4fVMZR7g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Google Summer of Code: question about GiST API advancement project  (GUO Rui <ruig2@uci.edu>)
Ответы Re: Google Summer of Code: question about GiST API advancementproject
Список pgsql-hackers
I drafted my proposal about the above topic at https://docs.google.com/document/d/1X7Lw-c0rLYuSjwLNfw6qXpN5Cf1_0u2gXtgEgLkNezA/edit?usp=sharing  . Looking forward to your feedback.

On Wed, Apr 3, 2019 at 10:55 PM GUO Rui <ruig2@uci.edu> wrote:
Dear Andrey Borodin,

I discussed the above topic with the professors at my school, and got the following points:

1. The case that the volume of an MBB is 0 should be very rare, and if the data is skewed (e.g. only a few nodes have non-NULL value on a dimension) then the data can be pre-proceeded and normalized before it goes to the database, thus the storage and query can be much faster;
2. The performance of the R-tree family may depend on the specific data set and even the order of the data insertions, so one algorithm may be better on one dataset and slower on another, thus the benchmark should include different datasets;

I totally agree that by adopting the RR*-tree algorithm we can improve the performance of PostgreSQL. For my proposal, I'll:
1. Document the benchmarks I found available online (e.g. 
https://github.com/ambling/rtree-benchmark), and then state how we'd like to generate data ourselves (e.g. data with a Gaussian distribution, or the same dataset but different insertion order...) to test with for a wilder coverage;
2. Create tools to generate a report on current PostgreSQL performance with the benchmark;
3. Plan to improve the R-tree and GiST part of PostgreSQL. For the discussion in the email thread https://www.postgresql.org/message-id/flat/CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg%40mail.gmail.com#CAJEAwVFMo-FXaJ6Lkj8Wtb1br0MtBY48EGMVEJBOodROEGykKg@mail.gmail.com I prefer to do a scale-based trick rather than using bits in a float or creating a new struct;
4. Generate a performance report on PostgreSQL with the above R-tree patch;
The following would be marked as optional:
5. Optimize GiST with New APIs (e.g. non-penalty-based choose subtree function, also discussed in the above email thread);
6. For skewed data, try to warn the user, and then suggest methods to cook the data (e.g. the normalization algorithms in ML); pre-proceeding the data should not be the duty of the database;
7. Other advanced features of RR*-tree and GiST bulk loading;

Any comments or feedback on the above ideas? I'll work on a draft proposal ASAP.

Many thanks,
Rui Guo




On Sun, Mar 31, 2019 at 10:53 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Hi!

> 31 марта 2019 г., в 14:58, GUO Rui <ruig2@uci.edu> написал(а):
>
> I'm Rui Guo, a PhD student focusing on database at the University of California, Irvine. I'm interested in the "GiST API advancement" project for the Google Summer of Code 2019 which is listed at https://wiki.postgresql.org/wiki/GSoC_2019#GiST_API_advancement_.282019.29 .
>
> I'm still reading about RR*-tree, GiST and the PostgreSQL source code to have a better idea on my proposal. Meanwhile, I have a very basic and simple question:
>
> Since the chooseSubtree() algorithm in both R*-tree and RR*-tree are heuristic and somehow greedy (e.g. pick the MBB that needs to enlarge the least), is it possible to apply machine learning algorithm to improve it? The only related reference I got is to use deep learning in database join operation (https://arxiv.org/abs/1808.03196). Is it not suitable to use machine learning here or someone already did?

If you are interested in ML and DBs you should definitely look into [0]. You do not have to base your proposal on mentor ideas, you can use your own. Implementing learned indexes - seems reasonable.

RR*-tree algorithms are heuristic in some specific parts, but in general they are designed to optimize very clear metrics. Generally, ML algorithms tend to compose much bigger pile of heuristics and solve less mathematically clear tasks than splitting subtrees or choosing subtree for insertion.
R*-tree algorithms are heuristic only to be faster.

Best regards, Andrey Borodin.

[0] https://arxiv.org/pdf/1712.01208.pdf

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: Using condition variables to wait for checkpoints
Следующее
От: Floris Van Nee
Дата:
Сообщение: Re: speeding up planning with partitions