Re: Out of Memory errors are frustrating as heck!
От | Tomas Vondra |
---|---|
Тема | Re: Out of Memory errors are frustrating as heck! |
Дата | |
Msg-id | 20190422130415.3xv3p2izpyh7xhad@development обсуждение исходный текст |
Ответ на | Re: Out of Memory errors are frustrating as heck! (Gaetano Mendola <mendola@gmail.com>) |
Список | pgsql-performance |
On Mon, Apr 22, 2019 at 10:07:52AM +0200, Gaetano Mendola wrote: > Batch splitting shouldn't be followed by a hash function change? What would be the value? That can help with hash collisions, but that's not the issue with the data sets discussed in this thread. The issue reported originally is about underestimates, and the sample data set has a large number of duplicate values (a single value representing ~10% of the data set). Neither of those issues is about hash collisions. The data set I used to demonstrate how the algorithms work is pretty perfect, with uniform distribution and no hash collisions. Furthermore, I don't think we can just change the hash function, for a couple of technical reasons. Firstly, it's not like we totally redistribute the whole dataset from N old batches to (2*N) new ones. By using the same 32-bit hash value and cconsidering one extra bit, the tuples either stay in the same batch (when the new bit is 0) or move to a single new batch (when it's 1). So each batch is split in 1/2. By changing the hash function this would no longer be true, and we'd redistribute pretty much the whole data set. The other issue is even more significant - we don't redistribute the tuples immediately. We only redistribute the current batch, but leave the other batches alone and handle them when we actually get to them. This is possible, because the tuples never move backwards - when splitting batch K, the tuples either stay in K or move to 2K. Or something like that, I'm too lazy to recall the exact formula now. And if I recall correctly, I think we can increment the number of batches while already performing the join, after some rows were already processed. That would probably be no longer true if we just switched the hash function, because it might move rows backwards (to the already processed region). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-performance по дате отправления: