Re: Why is a hash join preferred when it does not fit in work_mem
От | Dimitrios Apostolou |
---|---|
Тема | Re: Why is a hash join preferred when it does not fit in work_mem |
Дата | |
Msg-id | 78211cab-8f6e-52e8-8d06-26e7097c43bd@gmx.net обсуждение исходный текст |
Ответ на | Re: Why is a hash join preferred when it does not fit in work_mem (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-general |
On Sat, 14 Jan 2023, Tom Lane wrote: > Dimitrios Apostolou <jimis@gmx.net> writes: >> Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here >> is how I understand things according to posts I've read, and classical >> algorithms: > >> + The Hash Join is fastest when one side fits in work_mem. Then on one >> hand you have a hash table lookup (amortized O(1)) and on the other >> hand, if the table has M rows that that do not fit in memory, you have >> sequential reads from the disk (given low fragmentation of the table or >> index files): For every line you read from the disk, you lookup the key >> in the hash table. > >> If the hash table does not fit in RAM then the cost becomes prohibitive. >> Every lookup is a random access possibly hitting the disk. The total >> cost should be random_page_cost * M. > > That would be true of a simple hash join, but Postgres uses batched > hash joins: we split up the hash key space into subsets, where hopefully > each subset includes few enough inner-side rows to fit into work_mem. > While this can go wrong given pathological distribution of the inner-side > keys, it does mean that the join can perform well even when the inner > side is much larger than work_mem. So it's not the case that the planner > will simply disregard hash joins beyond work_mem. It will apply a cost > penalty for the predicted batching overhead; Thanks for this, I found a page [1] that describes the hash join and now I understand a bit more. [1] https://www.interdb.jp/pg/pgsql03.html I'm not sure whether the key distribution is pathological in my case. The join condition is: Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n) and workitem_ids.workitem_n is an integer GENERATED AS IDENTITY and PUBLIC KEY. The TABLE workitem_ids har 1.7M rows, and the other table has 3.7M rows. None of them fit in workmem. In my (simplified and pathological) case of work_mem == 1MB, the hash join does 512 batches (Buckets: 4,096 Batches: 512 Memory Usage: 759kB). I'm not sure which hash-merge strategy is followed, but based on that document, it should be the "hybrid hash join with skew". I don't quite follow the I/O requirements of this algorithm, yet. :-) > but that can still come out > cheaper than merge join, because the sorting needed for merge is generally > also far from cheap. I was under the impression that on-disk merge-sort is a relatively cheap (logN) operation, regarding random I/O. > >> So I would expect an increased random_page_cost to benefit the Merge Join >> algorithm. And since my setup involves spinning disks, it does makes sense >> to increase it. > > What is probably really happening is that random_page_cost affects the > estimated cost of performing the sort using an index scan instead of > a bespoke sort step. AFAIR, cost_sort doesn't consider random_page_cost > at all, and neither does cost_hashjoin. On the last EXPLAIN I posted for the forced merge-join, I see that it uses an index-scan on the "small" table. It makes sense since the join happens on the primary key of the table. On the large table it does not use an index scan, because an index doesn't exist for that column. It sorts the 3.7M rows of the table (and FWIW that table only has two integer columns). If I understood correctly what you meant with "performing the sort using an index scan". The problem I see is that the estimated cost of the sort operation is 609,372.91..618,630.40. It's already way above the whole hash-join cost (121,222.68..257,633.01). However the real timings are very different. Actual time for Sort is 4,602.569..5,414.072 ms while for the whole hash join it is 145,641.295..349,682.387 ms. Am I missing some configuration knobs to put some sense to the planner? Thanks, Dimitris
В списке pgsql-general по дате отправления:
Следующее
От: RonДата:
Сообщение: Re: Why is a Read-only Table Gets Autovacuumed "to prevent wraparound"