Обсуждение: Hash tables in dynamic shared memory
Hi hackers, Here is an experimental hash table implementation that uses DSA memory so that hash tables can be shared by multiple backends and yet grow dynamically. Development name: "DHT". It's impossible to write a general purpose hash table that will be suitable for every use case, considering all the combinations of design trade offs and subsets of functionality someone might want. What I'm trying to do here is produce something that is generally useful for many cases and easy to use, along the lines of dynahash, but use dynamic shared memory. Here is the bullet point summary: * DSA memory-backed * open hashing (separate chaining) * incremental resizing * built-in partition-based locking * concurrent iteration There is other work being done in this space: I'm aware of Andres's current thread[1] on Robin Hood-style closed hashing tables with macro-ised size, hash and comparator operations à la C++ container templates, and I'm following along with interest. He vigorously rejects chaining hash tables as the natural enemies of memory hardware. Obviously there are pros and cons: this node-based chaining design allows resizing, iteration with stronger guarantees, and users can point directly to (or into) the entries from other data structures, but yeah... it has to follow pointers into far away cache lines to do that. I guess Andres's simplehash should already work in DSA memory nicely anyway since it doesn't have any internal pointers IIUC (?). As for concurrency, Robert Haas also did some interesting experiments[2] with (nearly) lock-free hash tables and hazard pointers. I'm not sure what the optimum number of different implementations or levels of configurability is, and I'm certainly not wedded to this particular set of design tradeoffs. Potential use cases for DHT include caches, in-memory database objects and working state for parallel execution. (Not a use case: the shared hash table for my as-yet-unposted DSA-based shared parallel hash join table work, which follows the existing open-coded dense_alloc-based approach for reasons I'll explain later.) There are a couple of things I'm not happy about in this version of DHT: the space overhead per item is too high (hash + wasted padding + next pointer), and the iterator system is overly complicated. I have another version in development which gets rid of the hash and padding and sometimes gets rid of the next pointer to achieve zero overhead, and I'm working on a simpler iteration algorithm that keeps the current guarantees but doesn't need to reorder bucket chains. More on that, with some notes on performance and a testing module, soon. Please find attached dht-v1.patch, which depends on dsa-v2.patch[3]. I'm posting this version of DHT today because it is a dependency of another patch due on this list shortly. See also simple-cache.patch which contains a simple smoke test extension so you can type SELECT simple_cache_put('42', 'Hello world'), SELECT simple_cache_get('42') etc. Thanks to my colleagues Amit Khandekar, Amul Sul, Dilip Kumar and John Gorman for bug fixes and suggestions, and Robert Haas for feedback. Please let me know your thoughts, and thanks for reading. [1] https://www.postgresql.org/message-id/flat/20161003041215.tfrifbeext3i7nkj%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/CA%2BTgmoYE4t-Pt%2Bv08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q%40mail.gmail.com [3] https://www.postgresql.org/message-id/flat/CAEepm%3D1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Вложения
Hi, > It's impossible to write a general purpose hash table that will be > suitable for every use case, considering all the combinations of > design trade offs and subsets of functionality someone might want. Very much agreed. > There is other work being done in this space: I'm aware of Andres's > current thread[1] on Robin Hood-style closed hashing tables with > macro-ised size, hash and comparator operations à la C++ container > templates, and I'm following along with interest. > He vigorously > rejects chaining hash tables as the natural enemies of memory > hardware. I'd not go quite that far. There are good arguments for open addressing, and there's good arguments for open chaining. The latter is a lot more convincing if you want concurrent access with partition based locking, for example... > I guess Andres's simplehash should already work in DSA memory nicely > anyway since it doesn't have any internal pointers IIUC (?). The data access doesn't have pointers, but the "metadata" does have a pointer to the data. But that's a very solvable problem. But incremental resizing and granular locking aren't going to happen for it, so it'd really only be suitable for presenting a read-only (or possibly read-mostly) view. > Potential use cases for DHT include caches, in-memory database objects > and working state for parallel execution. Is there a more concrete example, i.e. a user we'd convert to this at the same time as introducing this hashtable? > There are a couple of things I'm not happy about in this version of > DHT: the space overhead per item is too high (hash + wasted padding + > next pointer), and the iterator system is overly complicated. I have > another version in development which gets rid of the hash and padding > and sometimes gets rid of the next pointer to achieve zero overhead, > and I'm working on a simpler iteration algorithm that keeps the > current guarantees but doesn't need to reorder bucket chains. More on > that, with some notes on performance and a testing module, soon. FWIW, my experimentation showed that getting rid of the hash itself is quite often *NOT* a worthwhile tradeof. Comparing keys and recomputing hashes is often quite expensive (e.g. for GROUP BY it's where the majority of time is spent after the simplehash patch). Greetings, Andres Freund
On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund <andres@anarazel.de> wrote: >> Potential use cases for DHT include caches, in-memory database objects >> and working state for parallel execution. > > Is there a more concrete example, i.e. a user we'd convert to this at > the same time as introducing this hashtable? A colleague of mine will shortly post a concrete patch to teach an existing executor node how to be parallel aware, using DHT. I'll let him explain. I haven't looked into whether it would make sense to convert any existing shmem dynahash hash table to use DHT. The reason for doing so would be to move it out to DSM segments and enable dynamically growing. I suspect that the bounded size of things like the hash tables involved in (for example) predicate locking is considered a feature, not a bug, so any such cluster-lifetime core-infrastructure hash table would not be a candidate. More likely candidates would be ephemeral data used by the executor, as in the above-mentioned patch, and long lived caches of dynamic size owned by core code or extensions. Like a shared query plan cache, if anyone can figure out the invalidation magic required. -- Thomas Munro http://www.enterprisedb.com
On Wed, Oct 5, 2016 at 12:11 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund <andres@anarazel.de> wrote: >>> Potential use cases for DHT include caches, in-memory database objects >>> and working state for parallel execution. >> >> Is there a more concrete example, i.e. a user we'd convert to this at >> the same time as introducing this hashtable? > > A colleague of mine will shortly post a concrete patch to teach an > existing executor node how to be parallel aware, using DHT. I'll let > him explain. > > I haven't looked into whether it would make sense to convert any > existing shmem dynahash hash table to use DHT. The reason for doing > so would be to move it out to DSM segments and enable dynamically > growing. I suspect that the bounded size of things like the hash > tables involved in (for example) predicate locking is considered a > feature, not a bug, so any such cluster-lifetime core-infrastructure > hash table would not be a candidate. More likely candidates would be > ephemeral data used by the executor, as in the above-mentioned patch, > and long lived caches of dynamic size owned by core code or > extensions. Like a shared query plan cache, if anyone can figure out > the invalidation magic required. Another thought: it could be used to make things like pg_stat_statements not have to be in shared_preload_libraries. -- Thomas Munro http://www.enterprisedb.com
<p dir="ltr"><p dir="ltr">On Oct 5, 2016 1:23 AM, "Thomas Munro" <<a href="mailto:thomas.munro@enterprisedb.com">thomas.munro@enterprisedb.com</a>>wrote:<br /> ><br /> > On Wed, Oct5, 2016 at 12:11 PM, Thomas Munro<br /> > <<a href="mailto:thomas.munro@enterprisedb.com">thomas.munro@enterprisedb.com</a>>wrote:<br /> > > On Wed, Oct 5, 2016at 11:22 AM, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>> wrote:<br /> > >>>Potential use cases for DHT include caches, in-memory database objects<br /> > >>> and working statefor parallel execution.<br /> > >><br /> > >> Is there a more concrete example, i.e. a user we'd convertto this at<br /> > >> the same time as introducing this hashtable?<br /> > ><br /> > > A colleagueof mine will shortly post a concrete patch to teach an<br /> > > existing executor node how to be parallelaware, using DHT. I'll let<br /> > > him explain.<br /> > ><br /> > > I haven't looked into whetherit would make sense to convert any<br /> > > existing shmem dynahash hash table to use DHT. The reason fordoing<br /> > > so would be to move it out to DSM segments and enable dynamically<br /> > > growing. I suspectthat the bounded size of things like the hash<br /> > > tables involved in (for example) predicate locking isconsidered a<br /> > > feature, not a bug, so any such cluster-lifetime core-infrastructure<br /> > > hashtable would not be a candidate. More likely candidates would be<br /> > > ephemeral data used by the executor,as in the above-mentioned patch,<br /> > > and long lived caches of dynamic size owned by core code or<br/> > > extensions. Like a shared query plan cache, if anyone can figure out<br /> > > the invalidationmagic required.<br /> ><br /> > Another thought: it could be used to make things like<br /> > pg_stat_statementsnot have to be in shared_preload_libraries.<br /> ><p dir="ltr">That would indeed be a great improvement.And possibly also allow the changing of the max number of statements it can track without a restart? <p dir="ltr">Iwas also wondering if it might be useful for a replacement for some of the pgstats stuff to get rid of the costof spooling to file and then rebuilding the hash tables in the receiving end. I've been waiting for this patch to figureout if that's useful. I mean keep the stats collector doing what it does now over udp, but present the results in sharedhash tables instead of files. <p dir="ltr">/Magnus <br />
On Wed, Oct 5, 2016 at 3:10 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here is an experimental hash table implementation that uses DSA memory > so that hash tables can be shared by multiple backends and yet grow > dynamically. Development name: "DHT". +dht_iterate_begin(dht_hash_table *hash_table, + dht_iterator *iterator, + bool exclusive) +{ + Assert(hash_table->control->magic == DHT_MAGIC); + + iterator->hash_table = hash_table; + iterator->exclusive = exclusive; + iterator->partition = 0; + iterator->bucket = 0; + iterator->item = NULL; + iterator->locked = false; While reviewing , I found that in dht_iterate_begin function, we are not initializing iterator->last_item_pointer to InvalidDsaPointer; and during scan this variable will be used in advance_iterator function. (I think this will create problem, even loss of some tuple ?) +advance_iterator(dht_iterator *iterator, dsa_pointer bucket_head, + dsa_pointer *next) +{ + dht_hash_table_item *item; + + while (DsaPointerIsValid(bucket_head)) + { + item = dsa_get_address(iterator->hash_table->area, + bucket_head); + if ((!DsaPointerIsValid(iterator->last_item_pointer) || + bucket_head < iterator->last_item_pointer) && -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Oct 6, 2016 at 12:02 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > While reviewing , I found that in dht_iterate_begin function, we are > not initializing > iterator->last_item_pointer to InvalidDsaPointer; Fixed, thanks. -- Thomas Munro http://www.enterprisedb.com
Вложения
On Wed, Oct 5, 2016 at 7:02 PM, Magnus Hagander <magnus@hagander.net> wrote: > On Oct 5, 2016 1:23 AM, "Thomas Munro" <thomas.munro@enterprisedb.com> > wrote: >> >> On Wed, Oct 5, 2016 at 12:11 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> > On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund <andres@anarazel.de> >> > wrote: >> >>> Potential use cases for DHT include caches, in-memory database objects >> >>> and working state for parallel execution. >> >> >> >> Is there a more concrete example, i.e. a user we'd convert to this at >> >> the same time as introducing this hashtable? >> > >> > A colleague of mine will shortly post a concrete patch to teach an >> > existing executor node how to be parallel aware, using DHT. I'll let >> > him explain. >> > >> > I haven't looked into whether it would make sense to convert any >> > existing shmem dynahash hash table to use DHT. The reason for doing >> > so would be to move it out to DSM segments and enable dynamically >> > growing. I suspect that the bounded size of things like the hash >> > tables involved in (for example) predicate locking is considered a >> > feature, not a bug, so any such cluster-lifetime core-infrastructure >> > hash table would not be a candidate. More likely candidates would be >> > ephemeral data used by the executor, as in the above-mentioned patch, >> > and long lived caches of dynamic size owned by core code or >> > extensions. Like a shared query plan cache, if anyone can figure out >> > the invalidation magic required. >> >> Another thought: it could be used to make things like >> pg_stat_statements not have to be in shared_preload_libraries. >> > > That would indeed be a great improvement. And possibly also allow the > changing of the max number of statements it can track without a restart? Yeah. You don't explicitly size a DHT hash table, it just grows as required to keep the load factor low enough, possibly leading the DSA area it lives in to create DSM segments. Currently it never gets smaller though, so if you throw out a bunch of entries you will be freeing up the memory occupied by the entries themselves (meaning: giving it back to the DSA area, which might eventually give back to the OS if the planets are correctly aligned rendering a DSM segment entirely unused), but the hash table's bucket array won't ever shrink. > I was also wondering if it might be useful for a replacement for some of the > pgstats stuff to get rid of the cost of spooling to file and then rebuilding > the hash tables in the receiving end. I've been waiting for this patch to > figure out if that's useful. I mean keep the stats collector doing what it > does now over udp, but present the results in shared hash tables instead of > files. Interesting thought. I haven't studied how that stuff works... I've put it on my reading list. -- Thomas Munro http://www.enterprisedb.com
On 2016-10-05 08:02:42 +0200, Magnus Hagander wrote: > On Oct 5, 2016 1:23 AM, "Thomas Munro" <thomas.munro@enterprisedb.com> > wrote: > > Another thought: it could be used to make things like > > pg_stat_statements not have to be in shared_preload_libraries. > That would indeed be a great improvement. And possibly also allow the > changing of the max number of statements it can track without a restart? There's the issue of having to serialize / load the file around a restart... Greetings, Andres Freund
I reviewed the dht-v2.patch and found that it is in excellent shape.
Benchmarking shows that this performs somewhat faster than
dynahash which is surprising because it is doing DSA address
translations on the fly.
One area where this could run faster is to reduce the amount of
time when the hash is in the RESIZE_IN_PROGRESS state.
When a hash partition reaches 75% full a resize begins by allocating
an empty new hash bucket array with double the number of buckets.
This begins the RESIZE_IN_PROGRESS state where there is both an
old and a new hash bucket array.
During the RESIZE state all operations such as find, insert,
delete and iterate run slower due to having to look items up in
two hash bucket arrays.
Whenever a new item is inserted into the hash and the hash is in
the RESIZE state the current code takes the time to move one item
from the old hash partition to the new hash partition. During
insertion an exclusive lock is already held on the partition so
this is an efficient time to do the resize cleanup work.
However since we only clean up one item per insertion we do not
complete the cleanup and free the old hash bucket array until we
are due to start yet another resize operation.
So we are effectively always in the RESIZE state which slows down
operations and wastes some space.
Here are the timings for insert and find in nanoseconds on a
Macbook Pro. The insert_resize figure includes the resize work to
move one item from the old to the new hash bucket array.
insert_resize: 945.98
insert_clean: 450.39
find_resize: 348.90
find_clean: 293.16
The attached dht-v2-resize-cleanup.patch patch applies on top of
the dht-v2.patch and speeds up the resize cleanup process so that
hashes spend most of their time in the clean state.
It does this by cleaning up more than one old item during
inserts. This patch cleans up three old items.
There is also the case where a hash receives all of its inserts
at the beginning and the rest of the work load is all finds. This
patch also cleans up two items for each find.
The downside of doing extra cleanup early is some additional
latency. Here are the experimental figures and the approximate
formulas for different numbers of cleanups for inserts. Note that
we cannot go lower than one cleanup per insert.
Resize work in inserts: 729.79 insert + 216.19 * cleanups
1 945.98
2 1178.13
3 1388.73
4 1617.04
5 1796.91
Here are the timings for different numbers of cleanups for finds.
Note that since we do not hold an exclusive partition lock on a find
there is the additional overhead of taking that lock.
Resize work in finds: 348.90 dirty_find + 233.45 lock + 275.78 * cleanups
0 348.90
1 872.88
2 1138.98
3 1448.94
4 1665.31
5 1975.91
The new effective latencies during the resize vs. the clean states.
#define DHT_INSERT_CLEANS 3
#define DHT_SEARCH_CLEANS 2
insert_resize: 1388.73 -- 3 cleaned
insert_clean: 450.39
find_resize: 1138.98 -- 2 cleaned
find_clean: 293.16
The overall performance will be faster due to not having to examine
more than one hash bucket array most of the time.
--
John Gorman
Вложения
On Sun, Nov 20, 2016 at 9:54 AM, John Gorman <johngorman2@gmail.com> wrote:
I reviewed the dht-v2.patch and found that it is in excellent shape.The overall performance will be faster due to not having to examinemore than one hash bucket array most of the time.
Reviewer didn't find any problems in the approach of the patch. Provided some
improvements on the approach to the author to respond.
Moved to next CF with "waiting on author" state.
Regards,
Hari Babu
Fujitsu Australia
On Sun, Nov 20, 2016 at 11:54 AM, John Gorman <johngorman2@gmail.com> wrote: > I reviewed the dht-v2.patch and found that it is in excellent shape. Thanks for reviewing! And sorry for the late reply. > Benchmarking shows that this performs somewhat faster than > dynahash which is surprising because it is doing DSA address > translations on the fly. That is indeed surprising and I think warrants more investigation. > One area where this could run faster is to reduce the amount of > time when the hash is in the RESIZE_IN_PROGRESS state. > > When a hash partition reaches 75% full a resize begins by allocating > an empty new hash bucket array with double the number of buckets. > This begins the RESIZE_IN_PROGRESS state where there is both an > old and a new hash bucket array. > > During the RESIZE state all operations such as find, insert, > delete and iterate run slower due to having to look items up in > two hash bucket arrays. > > Whenever a new item is inserted into the hash and the hash is in > the RESIZE state the current code takes the time to move one item > from the old hash partition to the new hash partition. During > insertion an exclusive lock is already held on the partition so > this is an efficient time to do the resize cleanup work. > > However since we only clean up one item per insertion we do not > complete the cleanup and free the old hash bucket array until we > are due to start yet another resize operation. > > So we are effectively always in the RESIZE state which slows down > operations and wastes some space. Right, that is the case in a workload that inserts a bunch of data but then becomes read-only forever. A workload that constantly does a mix of writing and reading should settle down at a reasonable size and stop resizing. > Here are the timings for insert and find in nanoseconds on a > Macbook Pro. The insert_resize figure includes the resize work to > move one item from the old to the new hash bucket array. > > insert_resize: 945.98 > insert_clean: 450.39 > find_resize: 348.90 > find_clean: 293.16 > > The attached dht-v2-resize-cleanup.patch patch applies on top of > the dht-v2.patch and speeds up the resize cleanup process so that > hashes spend most of their time in the clean state. > > It does this by cleaning up more than one old item during > inserts. This patch cleans up three old items. > > There is also the case where a hash receives all of its inserts > at the beginning and the rest of the work load is all finds. This > patch also cleans up two items for each find. > > The downside of doing extra cleanup early is some additional > latency. Here are the experimental figures and the approximate > formulas for different numbers of cleanups for inserts. Note that > we cannot go lower than one cleanup per insert. > > Resize work in inserts: 729.79 insert + 216.19 * cleanups > 1 945.98 > 2 1178.13 > 3 1388.73 > 4 1617.04 > 5 1796.91 > > Here are the timings for different numbers of cleanups for finds. > Note that since we do not hold an exclusive partition lock on a find > there is the additional overhead of taking that lock. > > Resize work in finds: 348.90 dirty_find + 233.45 lock + 275.78 * cleanups > 0 348.90 > 1 872.88 > 2 1138.98 > 3 1448.94 > 4 1665.31 > 5 1975.91 > > The new effective latencies during the resize vs. the clean states. > > #define DHT_INSERT_CLEANS 3 > #define DHT_SEARCH_CLEANS 2 > > insert_resize: 1388.73 -- 3 cleaned > insert_clean: 450.39 > find_resize: 1138.98 -- 2 cleaned > find_clean: 293.16 > > The overall performance will be faster due to not having to examine > more than one hash bucket array most of the time. Thanks for doing all these experiments. Yeah, I think it makes sense to merge this change to improve that case. Since Dilip Kumar's Parallel Bitmap Heap Scan project is no longer using this, I think I should park it here unless/until another potential use case pops up. Some interesting candidates have been mentioned already, and I'm fairly sure there are other uses too, but I don't expect anyone to be interested in committing this patch until something concrete shows up, so I'll go work on other things until then. -- Thomas Munro http://www.enterprisedb.com
On Mon, Dec 19, 2016 at 12:33 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Since Dilip Kumar's Parallel Bitmap Heap Scan project is no longer > using this, I think I should park it here unless/until another > potential use case pops up. Some interesting candidates have been > mentioned already, and I'm fairly sure there are other uses too, but I > don't expect anyone to be interested in committing this patch until > something concrete shows up, so I'll go work on other things until > then. Here is a rebased version. Changes since v2: 1. Out-of-memory conditions now raise errors (following dsa_allocate's change in behaviour to match palloc). 2. Moved into 'lib' where other reusable data structures live. There are still a couple of things I'd like to adjust in this code (including feedback from John and improvements to the iterator code which is overcomplicated) but I'm posting it today as infrastructure for another patch. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers