Re: Spilling hashed SetOps and aggregates to disk

Поиск
Список
Период
Сортировка
От David Gershuni
Тема Re: Spilling hashed SetOps and aggregates to disk
Дата
Msg-id 998FE0E3-CC36-43CC-96D0-5707959A4B0D@cs.cmu.edu
обсуждение исходный текст
Ответ на Re: Spilling hashed SetOps and aggregates to disk  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
As Serge mentioned, we’ve implemented spill-to-disk for SetOps and Aggregates at Salesforce. We were hitting OOMs often enough that this became a high priority for us. However, our current spill implementation is based on dynahash from 9.6, and we’re not happy with its performance (it was primarily an OOM stop-gap and was not focused on optimizing performance).  

Because of this, we’ve spec’d out a new spill-to-disk design for hash-based aggregates (and later SetOps), and we plan to begin implementation very soon.  Since this is an important fix for the community as well, we would be happy (and would prefer) to share our spill-to-disk implementation. 

Our new spill approach is very similar to Jeff’s and Heikki’s and is designed to use simplehash.h. It’s based on an algorithm called “HybridHash with Pre-Partitioning” found in [1]. It may later make sense to implement the “HashSort” Algorithm from [1] as well, which works better for highly skewed grouping keys. The optimizer could eventually choose between the two based on the stats available. We also like Heikki’s suggestions to use logtape.c to reduce disk usage and a trie-based approach to control the size of partitions dynamically.

We’ve also been grappling with how to handle the implementation challenges pointed out in this thread. These include:
• tracking memory usage
• choosing a smart eviction policy (which is touched on in [2])
• serializing opaque user-defined transition values when eviction is required

For 1), we plan to use our WithStats memory context, which Serge mentioned.
For 2), we plan to start with a simple non-eviction policy since we don’t have the stats do anything smarter (i.e. insert until we fill the hash table, then spill until we finish processing the batch, with evictions only happening if a group’s transition value grows too large).
For 3), we don’t have a good solution yet. We could serialize/deserialize for built-in types and rely on users to provide serialize/deserialize functions for user-defined aggregates going forward.

But we’re open to suggestions :)

Regards,
David
Salesforce

[1] Revisiting Aggregation for Data Intensive Applications: A Performance Study
https://arxiv.org/pdf/1311.0059.pdf

[2] DB2 with BLU acceleration: so much more than just a column store
http://delivery.acm.org/10.1145/2540000/2536233/p1080-raman.pdf?ip=204.14.239.107&id=2536233&acc=ACTIVE%20SERVICE&key=37B0A9F49C26EEFC%2E37B0A9F49C26EEFC%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1528414374_aeb9f862ae2acc26db305d591095e3f7

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nico Williams
Дата:
Сообщение: Re: [PATCH v16] GSSAPI encryption support
Следующее
От: Melanie Plageman
Дата:
Сообщение: Re: Bug in either collation docs or code