Обсуждение: [HACKERS] Hash Functions
https://www.postgresql.org/message-id/CAMp0ubeo3fzzEfiE1vmc1AJkkRPxLnZQoOASeu6cCcco-c+9zw@mail.gmail.com In that thread, I pointed out some important considerations for the hash functions themselves. This is a follow-up, after I looked more carefully. 1. The hash functions as they exist today aren't portable -- they can return different results on different machines. That means using these functions for hash partitioning would yield different contents for the same partition on different architectures (and that's bad, considering they are logical partitions and not some internal detail). The core hashing algorithm is based on 32-bit chunks, so it's only portable if the input is an int32 (or array of int32s). That's not true for varchar (etc.) or numeric (which is based on an array of int16s). There is a hack for int8, but see issue #2 below. We could try to mark portable vs. unportable hash functions, but it seems quite valuable to be able to logically partition on varchar, so I think we should have some portable answer there. Another annoyance is that we would have to assume container types are unportable, or make them inherit the portability of the underlying type's hash function. We could revert 26043592 (copied Tom because that was his patch) to make hash_any() go back to being portable -- do we know what that speedup actually was? Maybe the benefit is smaller on newer processors? Another option is to try to do some combination of byteswapping and word-at-a-time, which might be better than byte-at-a-time if the byteswapping is done with a native instruction. 2. hashint8() is a little dubious. If the input is positive, it XORs the high bits; if negative, it XORs the complement of the high bits. That works for compatibility with hashint2/4, but it does not cause good hash properties[1]. I prefer that we either (a) upcast int2/4 to int8 before hashing and then hash the whole 64 bits; or (b) if the value is within range, downcast to int4, otherwise hash the whole 64 bits. 3. We should downcast numeric to int4/8 before hashing if it's in range, so that it can be a part of the same opfamily as int2/4/8. 4. text/varchar should use hashbpchar() so that they can be part of the same opfamily. Trailing blanks seem unlikely to be significant for most real-world data anyway. 5. For salts[2], I don't think it's too hard to support them in an optional way. We just allow the function to be a two-argument function with a default. Places that care about specifying the salt may do so if the function has pronargs==2, otherwise it just gets the default value. If we have salts, I don't think having 64-bit hashes is very important. If we run out of bits, we can just salt the hash function differently and get more hash bits. This is not urgent and I believe we should just implement salts when and if some algorithm needs them. Regards, Jeff Davis [1] You can a kind of mirroring in the hash outputs indicating bad mixing: postgres=# select hashint8((2^32-100)::int8);hashint8 ----------41320869 (1 row) postgres=# select hashint8((-2^32-100)::int8);hashint8 ----------41320869 (1 row) postgres=# select hashint8((2^32-101)::int8); hashint8 --------------1214482326 (1 row) postgres=# select hashint8((-2^32-101)::int8); hashint8 --------------1214482326 (1 row) [2] Not sure I'm using the term "salt" properly. I really just mean a way to affect the initial state, which I think is good enough for our purposes.
On Fri, May 12, 2017 at 12:08 AM, Jeff Davis <pgsql@j-davis.com> wrote: > 1. The hash functions as they exist today aren't portable -- they can > return different results on different machines. That means using these > functions for hash partitioning would yield different contents for the > same partition on different architectures (and that's bad, considering > they are logical partitions and not some internal detail). Hmm, yeah, that is bad. > We could revert 26043592 (copied Tom because that was his patch) to > make hash_any() go back to being portable -- do we know what that > speedup actually was? Maybe the benefit is smaller on newer > processors? Another option is to try to do some combination of > byteswapping and word-at-a-time, which might be better than > byte-at-a-time if the byteswapping is done with a native instruction. With regard to portability, I find that in 2009, according to Tom, we had "already agreed" that it was dispensible: http://postgr.es/m/23832.1234214526@sss.pgh.pa.us I was not able to find where that was agreed. On performance, I found this: https://www.postgresql.org/message-id/20081104202655.GP18362@it.is.rice.edu It says at the end: "The average time to reindex the table using our current hash_any() without the separate mix()/final() was 1696ms and 1482ms with the separate mix()/final() stages giving almost 13% better performance for this stupid metric." > 5. For salts[2], I don't think it's too hard to support them in an > optional way. We just allow the function to be a two-argument function > with a default. Places that care about specifying the salt may do so > if the function has pronargs==2, otherwise it just gets the default > value. If we have salts, I don't think having 64-bit hashes is very > important. If we run out of bits, we can just salt the hash function > differently and get more hash bits. This is not urgent and I believe > we should just implement salts when and if some algorithm needs them. The potential problem with that is that the extra argument might slow down the hash functions enough to matter. Argument unpacking is not free, and the hashing logic itself will get more complex. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On May 12, 2017 10:05:56 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote: >On Fri, May 12, 2017 at 12:08 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> 1. The hash functions as they exist today aren't portable -- they can >> return different results on different machines. That means using >these >> functions for hash partitioning would yield different contents for >the >> same partition on different architectures (and that's bad, >considering >> they are logical partitions and not some internal detail). > >Hmm, yeah, that is bad. Given that a lot of data types have a architecture dependent representation, it seems somewhat unrealistic and expensiveto have a hard rule to keep them architecture agnostic. And if that's not guaranteed, then I'm doubtful it makessense as a soft rule either. Andres Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, May 12, 2017 at 1:12 PM, Andres Freund <andres@anarazel.de> wrote: > Given that a lot of data types have a architecture dependent representation, it seems somewhat unrealistic and expensiveto have a hard rule to keep them architecture agnostic. And if that's not guaranteed, then I'm doubtful it makessense as a soft rule either. That's a good point, but the flip side is that, if we don't have such a rule, a pg_dump of a hash-partitioned table on one architecture might fail to restore on another architecture. Today, I believe that, while the actual database cluster is architecture-dependent, a pg_dump is architecture-independent. Is it OK to lose that property? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, May 12, 2017 at 1:12 PM, Andres Freund <andres@anarazel.de> wrote: >> Given that a lot of data types have a architecture dependent representation, it seems somewhat unrealistic and expensiveto have a hard rule to keep them architecture agnostic. And if that's not guaranteed, then I'm doubtful it makessense as a soft rule either. > That's a good point, but the flip side is that, if we don't have such > a rule, a pg_dump of a hash-partitioned table on one architecture > might fail to restore on another architecture. Today, I believe that, > while the actual database cluster is architecture-dependent, a pg_dump > is architecture-independent. Is it OK to lose that property? I'd vote that it's not, which means that this whole approach to hash partitioning is unworkable. I agree with Andres that demanding hash functions produce architecture-independent values will not fly. Maintaining such a property for float8 (and the types that depend on it) might be possible if you believe that nobody ever uses anything but IEEE floats, but we've never allowed that as a hard assumption before. Even architecture dependence isn't the whole scope of the problem. Consider for example dumping a LATIN1-encoded database and trying to reload it into a UTF8-encoded database. People will certainly expect that to be possible, and do you want to guarantee that the hash of a text value is encoding-independent? regards, tom lane
On 05/12/2017 10:17 AM, Robert Haas wrote: > On Fri, May 12, 2017 at 1:12 PM, Andres Freund wrote: >> Given that a lot of data types have a architecture dependent >> representation, it seems somewhat unrealistic and expensive to have >> a hard rule to keep them architecture agnostic. And if that's not >> guaranteed, then I'm doubtful it makes sense as a soft rule >> either. > > That's a good point, but the flip side is that, if we don't have > such a rule, a pg_dump of a hash-partitioned table on one > architecture might fail to restore on another architecture. Today, I > believe that, while the actual database cluster is > architecture-dependent, a pg_dump is architecture-independent. Is it > OK to lose that property? Not from where I sit. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
On Fri, May 12, 2017 at 1:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd vote that it's not, which means that this whole approach to hash > partitioning is unworkable. I agree with Andres that demanding hash > functions produce architecture-independent values will not fly. If we can't produce architecture-independent hash values, then what's the other option? One alternative would be to change the way that we dump and restore the data. Instead of dumping the data with the individual partitions, dump it all out for the parent and let tuple routing sort it out at restore time. Of course, this isn't very satisfying because now dump-and-restore hasn't really preserved the state of the database; indeed, you could make it into a hard failure by creating a foreign key referencing a partition hash partition. After dump-and-restore, the row ends up in some other partition and the foreign key can't be recreated because the relationship no longer holds. This isn't limited to foreign keys, either; similar problems could be created with CHECK constraints or other per-table properties that can vary between one child and another. I basically think it's pretty futile to suppose that we can get away with having a dump and restore move rows around between partitions without having that blow up in some cases. > Maintaining such a property for float8 (and the types that depend on it) > might be possible if you believe that nobody ever uses anything but IEEE > floats, but we've never allowed that as a hard assumption before. I don't know how standard that is. Is there any hardware that anyone's likely to be using that doesn't? TBH, I don't really care if support for obscure, nearly-dead platforms like VAX or whatever don't quite work with hash-partitioned tables. In practice, PostgreSQL only sorta works on that kind of platform anyway; there are far bigger problems than this. On the other hand, if there are servers being shipped in 2017 that don't use IEEE floats, that's another problem. What about integers? I think we're already assuming two's-complement arithmetic, which I think means that the only problem with making the hash values portable for integers is big-endian vs. little-endian. That's sounds solveable-ish. > Even architecture dependence isn't the whole scope of the problem. > Consider for example dumping a LATIN1-encoded database and trying > to reload it into a UTF8-encoded database. People will certainly > expect that to be possible, and do you want to guarantee that the > hash of a text value is encoding-independent? No, I think that's expecting too much. I'd be just fine telling people that if you hash-partition on a text column, it may not load into a database with another encoding. If you care about that, don't use hash-partitioning, or don't change the encoding, or dump out the partitions one by one and reload all the roads into the parent table for re-routing, solving whatever problems come up along the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, May 12, 2017 at 1:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd vote that it's not, which means that this whole approach to hash >> partitioning is unworkable. I agree with Andres that demanding hash >> functions produce architecture-independent values will not fly. > If we can't produce architecture-independent hash values, then what's > the other option? Forget hash partitioning. There's no law saying that that's a good idea and we have to have it. With a different set of constraints, maybe we could do it, but I think the existing design decisions have basically locked it out --- and I doubt that hash partitioning is so valuable that we should jettison other desirable properties to get it. > One alternative would be to change the way that we dump and restore > the data. Instead of dumping the data with the individual partitions, > dump it all out for the parent and let tuple routing sort it out at > restore time. Of course, this isn't very satisfying because now > dump-and-restore hasn't really preserved the state of the database; > indeed, you could make it into a hard failure by creating a foreign > key referencing a partition hash partition. Yeah, that isn't really appetizing at all. If we were doing physical partitioning below the user-visible level, we could make it fly. But the existing design makes the partition boundaries user-visible which means we have to insist that the partitioning rule is immutable (in the strongest sense of being invariant across all installations you could possibly wish to transport data between). Now, we already have some issues of that sort with range partitioning; if say you try to transport data to another system with a different sort ordering, you have probably got problems. But that issue already existed with the old manual partitioning approach, ie if you have a CHECK constraint like "(col >= 'foo' && col < 'gob')" then a transfer to such a system could fail already. So I'm okay with that. But it seems like hash partitioning is going to introduce new, exciting, and hard-to-document-or-avoid portability gotchas. Do we really need to go there? regards, tom lane
On Fri, May 12, 2017 at 02:23:14PM -0400, Robert Haas wrote: > > What about integers? I think we're already assuming two's-complement > arithmetic, which I think means that the only problem with making the > hash values portable for integers is big-endian vs. little-endian. > That's sounds solveable-ish. > xxhash produces identical hashes independent for big-endian and little- endian. https://github.com/Cyan4973/xxHash Regards, Ken
On Fri, May 12, 2017 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, that isn't really appetizing at all. If we were doing physical > partitioning below the user-visible level, we could make it fly. > But the existing design makes the partition boundaries user-visible > which means we have to insist that the partitioning rule is immutable > (in the strongest sense of being invariant across all installations > you could possibly wish to transport data between). I think you're right. > Now, we already have some issues of that sort with range partitioning; > if say you try to transport data to another system with a different > sort ordering, you have probably got problems. But that issue already > existed with the old manual partitioning approach, ie if you have a > CHECK constraint like "(col >= 'foo' && col < 'gob')" then a transfer > to such a system could fail already. So I'm okay with that. But it > seems like hash partitioning is going to introduce new, exciting, and > hard-to-document-or-avoid portability gotchas. Do we really need > to go there? That is a good question. I think it basically amounts to this question: is hash partitioning useful, and if so, for what? Range and list partitioning inherently provide management benefits. For example, if you range-partition your data by year, then when you want to get rid of the oldest year worth of data, you can drop the entire partition at once, which is likely to be much faster than a DELETE statement. But hash partitioning provide no such benefits because, by design, the distribution of the data among the partitions is essentially random. Dropping one partition will not usually be a useful thing to do because the rows in that partition are logically unconnected. So, if you have a natural way of partitioning a table by range or list, that's probably going to be superior to hash partitioning, but sometimes there isn't an obviously useful way of breaking up the data, but you still want to partition it in order to reduce the size of the individual tables. That, in turn, allows maintenance operations to be performed each partition separately. For example, suppose your table is big enough that running CLUSTER or VACUUM FULL on it takes eight hours, but you can't really afford an eight-hour maintenance window when the table gets bloated. However, you can afford (on Sunday nights when activity is lowest) a window of, say, one hour. Well, if you hash-partition the table, you can CLUSTER or VACUUM FULL one partition a week for N weeks until you get to them all. Similarly, if you need to create an index, you can build it on one partition at a time. You can even add the index to one partition to see how well it works -- for example, does it turn too many HOT updates into non-HOT updates, causing bloat? -- and try it out before you go do it across the board. And an unpartitioned table can only accommodate one VACUUM process at a time, but a partitioned table can have one per partition. Partitioning a table also means that you don't need a single disk volume large enough to hold the whole thing; instead, you can put each partition in a separate tablespace or (in some exciting future world where PostgreSQL looks more like a distributed system) perhaps even on another server. For a table that is a few tens of gigabytes, none of this amounts to a hill of beans, but for a table that is a few tens of terabytes, the time it takes to perform administrative operations can become a really big problem. A good example is, say, a table full of orders. Imagine a high-velocity business like Amazon or UPS that has a constant stream of new orders, or a mobile app that has a constant stream of new user profiles. If that data grows fast enough that the table needs to be partitioned, how should it be done? It's often desirable to create partitions of about equal size and about equal hotness, and range-partitioning on the creation date or order ID number means continually creating new partitions, and having all of the hot data in the same partition. In my experience, it is *definitely* the case that users with very large tables are experiencing significant pain right now. The freeze map changes were a good step towards ameliorating some of that pain, but I think hash partitioning is another step in that direction, and I think there will be other applications as well. Even for users who don't have data that large, there are also interesting query-performance implications of hash partitioning. Joins between very large tables tend to get implemented as merge joins, which often means sorting all the data, which is O(n lg n) and not easy to parallelize. But if those very large tables happened to be compatibly partitioned on the join key, you could do a partitionwise join per the patch Ashutosh proposed, which would be cheaper and easier to do in parallel. Maybe a shorter argument for hash partitioning is that not one but two different people proposed patches for it within months of the initial partitioning patch going in. When multiple people are thinking about implementing the same feature almost immediately after the prerequisite patches land, that's a good clue that it's a desirable feature. So I think we should try to solve the problems, rather than giving up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5/12/17 14:23, Robert Haas wrote: > One alternative would be to change the way that we dump and restore > the data. Instead of dumping the data with the individual partitions, > dump it all out for the parent and let tuple routing sort it out at > restore time. I think this could be a pg_dump option. One way it dumps out the partitions, and another way it recomputes the partitions. I think that could be well within pg_dump's mandate. (cough ... logical replication ... cough) > Of course, this isn't very satisfying because now > dump-and-restore hasn't really preserved the state of the database; That depends no whether you consider the partitions to be a user-visible or an internal detail. The current approach is sort of in the middle, so it makes sense to allow the user to interpret it either way depending on need. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote: > On 5/12/17 14:23, Robert Haas wrote: > > One alternative would be to change the way that we dump and restore > > the data. Instead of dumping the data with the individual partitions, > > dump it all out for the parent and let tuple routing sort it out at > > restore time. > > I think this could be a pg_dump option. One way it dumps out the > partitions, and another way it recomputes the partitions. I think that > could be well within pg_dump's mandate. I was thinking the same, but enable that option automatically for hash partitioning. Each partition is still dumped separately, but instead of referencing the specific partition in the TABLE DATA object, reference the parent table. This way, the restore can still be parallelized, but tuple routing is executed for each tuple. > (cough ... logical replication ... cough) I think for logical replication the tuple should appear as being in the parent table, not the partition. No? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5/12/17 18:13, Alvaro Herrera wrote: > I think for logical replication the tuple should appear as being in the > parent table, not the partition. No? Logical replication replicates base table to base table. How those tables are tied together into a partitioned table or an inheritance tree is up to the system catalogs on each side. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, May 12, 2017 at 06:38:55PM -0400, Peter Eisentraut wrote: > On 5/12/17 18:13, Alvaro Herrera wrote: > > I think for logical replication the tuple should appear as being in the > > parent table, not the partition. No? > > Logical replication replicates base table to base table. How those > tables are tied together into a partitioned table or an inheritance tree > is up to the system catalogs on each side. This seems like a totally reasonable approach to pg_dump, especially in light of the fact that logical replication already (and quite reasonably) does it this way. Hard work has been done to make tuple-routing cheap, and this is one of the payoffs. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Fri, May 12, 2017 at 7:36 PM, David Fetter <david@fetter.org> wrote: > On Fri, May 12, 2017 at 06:38:55PM -0400, Peter Eisentraut wrote: >> On 5/12/17 18:13, Alvaro Herrera wrote: >> > I think for logical replication the tuple should appear as being in the >> > parent table, not the partition. No? >> >> Logical replication replicates base table to base table. How those >> tables are tied together into a partitioned table or an inheritance tree >> is up to the system catalogs on each side. > > This seems like a totally reasonable approach to pg_dump, especially > in light of the fact that logical replication already (and quite > reasonably) does it this way. Hard work has been done to make > tuple-routing cheap, and this is one of the payoffs. Cheap isn't free, though. It's got a double-digit percentage overhead rather than a large-multiple-of-the-runtime overhead as triggers do, but people still won't want to pay it unnecessarily, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-05-12 21:56:30 -0400, Robert Haas wrote: > Cheap isn't free, though. It's got a double-digit percentage overhead > rather than a large-multiple-of-the-runtime overhead as triggers do, > but people still won't want to pay it unnecessarily, I think. That should be partiall addressable with reasonable amounts of engineering though. Efficiently computing the target partition in a hash-partitioned table can be implemented very efficiently, and adding infrastructure for multiple bulk insert targets in copy should be quite doable as well. It's also work that's generally useful, not just for backups. The bigger issue to me here wrt pg_dump is that partitions can restored in parallel, but that'd probably not work as well if dumped separately. Unless we'd do the re-routing on load, rather than when dumping - which'd also increase cache locality, by most of the time (same architecture/encoding/etc) having one backend insert into the same partition. Greetings, Andres Freund
On Sat, May 13, 2017 at 1:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 12, 2017 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Maybe a shorter argument for hash partitioning is that not one but two > different people proposed patches for it within months of the initial > partitioning patch going in. When multiple people are thinking about > implementing the same feature almost immediately after the > prerequisite patches land, that's a good clue that it's a desirable > feature. So I think we should try to solve the problems, rather than > giving up. > Can we think of defining separate portable hash functions which can be used for the purpose of hash partitioning? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Can we think of defining separate portable hash functions which can be > used for the purpose of hash partitioning? I think that would be a good idea. I think it shouldn't even be that hard. By data type: - Integers. We'd need to make sure that we get the same results for the same value on big-endian and little-endian hardware, and that performance is good on both systems. That seems doable. - Floats. There may be different representations in use on different hardware, which could be a problem. Tom didn't answer my question about whether any even-vaguely-modern hardware is still using non-IEEE floats, which I suspect means that the answer is "no". If every bit of hardware we are likely to find uses basically the same representation of the same float value, then this shouldn't be hard. (Also, even if this turns out to be hard for floats, using a float as a partitioning key would be a surprising choice because the default output representation isn't even unambiguous; you need extra_float_digits for that.) - Strings. There's basically only one representation for a string. If we assume that the hash code only needs to be portable across hardware and not across encodings, a position for which I already argued upthread, then I think this should be manageable. - Everything Else. Basically, everything else is just a composite of that stuff, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 12, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maintaining such a property for float8 (and the types that depend on it) > might be possible if you believe that nobody ever uses anything but IEEE > floats, but we've never allowed that as a hard assumption before. This is not such a big practical problem (for me at least) because hashing of floats is of dubious value. > Even architecture dependence isn't the whole scope of the problem. > Consider for example dumping a LATIN1-encoded database and trying > to reload it into a UTF8-encoded database. People will certainly > expect that to be possible, and do you want to guarantee that the > hash of a text value is encoding-independent? That is a major problem. In an ideal world, we could make that work with something like ucol_getSortKey(), but we don't require ICU, and we can't mix getSortKey() with strxfrm(), or even strxfrm() results from different platforms. I don't think it's correct to hash the code points, either, because strings may be considered equal in a locale even if the code points aren't identical. But I don't think postgres lives up to that standard currently. But hash partitioning is too valuable to give up on entirely. I think we should consider supporting a limited subset of types for now with something not based on the hash am. Regards, Jeff Davis
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Can we think of defining separate portable hash functions which can be >> used for the purpose of hash partitioning? > I think that would be a good idea. I think it shouldn't even be that > hard. By data type: > - Integers. We'd need to make sure that we get the same results for > the same value on big-endian and little-endian hardware, and that > performance is good on both systems. That seems doable. > - Floats. There may be different representations in use on different > hardware, which could be a problem. Tom didn't answer my question > about whether any even-vaguely-modern hardware is still using non-IEEE > floats, which I suspect means that the answer is "no". If every bit > of hardware we are likely to find uses basically the same > representation of the same float value, then this shouldn't be hard. > (Also, even if this turns out to be hard for floats, using a float as > a partitioning key would be a surprising choice because the default > output representation isn't even unambiguous; you need > extra_float_digits for that.) > - Strings. There's basically only one representation for a string. > If we assume that the hash code only needs to be portable across > hardware and not across encodings, a position for which I already > argued upthread, then I think this should be manageable. Basically, this is simply saying that you're willing to ignore the hard cases, which reduces the problem to one of documenting the portability limitations. You might as well not even bother with worrying about the integer case, because porting between little- and big-endian systems is surely far less common than cases you've already said you're okay with blowing off. That's not an unreasonable position to take, perhaps; doing better than that is going to be a lot more work and it's not very clear how much real-world benefit results. But I can't follow the point of worrying about endianness but not encoding. regards, tom lane
On Fri, May 12, 2017 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Forget hash partitioning. There's no law saying that that's a good > idea and we have to have it. With a different set of constraints, > maybe we could do it, but I think the existing design decisions have > basically locked it out --- and I doubt that hash partitioning is so > valuable that we should jettison other desirable properties to get it. A lot of the optimizations that can make use of hash partitioning could also make use of range partitioning. But let me defend hash partitioning: * hash partitioning requires fewer decisions by the user * naturally balances data and workload among partitions in most cases * easy to match with a degree of parallelism But with range partitioning, you can have situations where different tables have different distributions of data. If you partition to balance the data between partitions in both cases, then that makes partition-wise join a lot harder because the boundaries don't line up. If you make the boundaries line up to do partition-wise join, the partitions might have wildly different amounts of data in them. Either way, it makes parallelism harder. Even without considering joins, range partitioning could force you to make a choice between balancing the data and balancing the workload. If you are partitioning based on date, then a lot of the workload will be on more recent partitions. That's desirable sometimes (e.g. for vacuum) but not always desirable for parallelism. Hash partitioning doesn't have these issues and goes very nicely with parallel query. Regards, Jeff Davis
On Fri, May 12, 2017 at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > That is a good question. I think it basically amounts to this > question: is hash partitioning useful, and if so, for what? Two words: parallel query. To get parallelism, one of the best approaches is dividing the data, then doing as much work as possible before combining it again. If you have hash partitions on some key, then you can do partition-wise joins or partition-wise aggregation on that key in parallel with no synchronization/communication overhead (until the final result). You've taken postgres pretty far in this direction already; hash partitioning will take it one step further by allowing more pushdowns and lower sync/communication costs. Some of these things can be done with range partitioning, too, but see my other message here: https://www.postgresql.org/message-id/CAMp0ubfNMSGRvZh7N7TRzHHN5tz0ZeFP13Aq3sv6b0H37fdcPg%40mail.gmail.com Regards, Jeff Davis
On 2017-05-13 10:29:09 -0400, Robert Haas wrote: > On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Can we think of defining separate portable hash functions which can be > > used for the purpose of hash partitioning? > > I think that would be a good idea. I think it shouldn't even be that > hard. By data type: > > - Integers. We'd need to make sure that we get the same results for > the same value on big-endian and little-endian hardware, and that > performance is good on both systems. That seems doable. > > - Floats. There may be different representations in use on different > hardware, which could be a problem. Tom didn't answer my question > about whether any even-vaguely-modern hardware is still using non-IEEE > floats, which I suspect means that the answer is "no". If every bit > of hardware we are likely to find uses basically the same > representation of the same float value, then this shouldn't be hard. > (Also, even if this turns out to be hard for floats, using a float as > a partitioning key would be a surprising choice because the default > output representation isn't even unambiguous; you need > extra_float_digits for that.) > > - Strings. There's basically only one representation for a string. > If we assume that the hash code only needs to be portable across > hardware and not across encodings, a position for which I already > argued upthread, then I think this should be manageable. > > - Everything Else. Basically, everything else is just a composite of > that stuff, I think. I seriously doubt that's true. A lot of more complex types have internal alignment padding and such. Consider e.g. something like jsonb, hstore, or postgis types - you *can* convert them to something that's unambiguous, but it's going to be fairly expensive. Essentially you'd have to something like calling the output function, and then hashing the result of that. And hash-partitioning is particularly interesting for e.g. large amounts of points in a geospatial scenario, because other types of partitioning are quite hard to maintain. - Andres
On Sat, May 13, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote: > I seriously doubt that's true. A lot of more complex types have > internal alignment padding and such. True, but I believe we require those padding bytes to be zero. If we didn't, then hstore_hash would be broken already. > Consider e.g. something like > jsonb, hstore, or postgis types - you *can* convert them to something > that's unambiguous, but it's going to be fairly expensive. I'm fuzzy on what you think we'd need to do. > Essentially > you'd have to something like calling the output function, and then > hashing the result of that. I really don't see why we'd have to go to nearly that length. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On May 13, 2017 8:44:22 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote: >On Sat, May 13, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de> >wrote: >> I seriously doubt that's true. A lot of more complex types have >> internal alignment padding and such. > >True, but I believe we require those padding bytes to be zero. If we >didn't, then hstore_hash would be broken already. It'll be differently sized on different platforms. So everyone will have to write hash functions that look at each memberindividually, rather than hashing the entire struct at once. And for each member you'll have to use a type specifichash function... Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sat, May 13, 2017 at 1:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Basically, this is simply saying that you're willing to ignore the > hard cases, which reduces the problem to one of documenting the > portability limitations. You might as well not even bother with > worrying about the integer case, because porting between little- > and big-endian systems is surely far less common than cases you've > already said you're okay with blowing off. > > That's not an unreasonable position to take, perhaps; doing better > than that is going to be a lot more work and it's not very clear > how much real-world benefit results. But I can't follow the point > of worrying about endianness but not encoding. Encoding is a user choice, not a property of the machine. Or, looking at it from another point of view, the set of values that can be represented by an int4 is the same whether they are represented in big-endian form or in little-endian form, but the set of values that are representable changes when you switch encodings. You could argue that text-under-LATIN1 and text-under-UTF8 aren't really the same data type at all. It's one thing to say "you can pick up your data and move it to a different piece of hardware and nothing will break". It's quite another thing to say "you can pick up your data and convert it to a different encoding and nothing will break". The latter is generally false already. Maybe LATIN1 -> UTF8 is no-fail, but what about UTF8 -> LATIN1 or SJIS -> anything? Based on previous mailing list discussions, I'm under the impression that it is sometimes debatable how a character in one encoding should be converted to some other encoding, either because it's not clear whether there is a mapping at all or it's unclear what mapping should be used. See, e.g., 2dbbf33f4a95cdcce66365bcdb47c885a8858d3c, or https://www.postgresql.org/message-id/1739a900-30ab-f48e-aec4-2b35475ecf02%402ndquadrant.com where it was discussed that being able to convert encoding A -> encoding B does not guarantee the ability to perform the reverse conversion. Arguing that a given int4 value should hash to the same value on every platform seems like a request that is at least superficially reasonable, if possibly practically tricky in some cases. Arguing that every currently supported encoding should hash every character the same way when they don't all have the same set of characters and the mappings between them are occasionally debatable is asking for the impossible. I certainly don't want to commit to a design for hash partitioning that involves a compatibility break any time somebody changes any encoding conversion in the system, even if a hash function that involved translating every character to some sort of universal code point before hashing it didn't seem likely to be horribly slow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 13, 2017 at 11:47 PM, Andres Freund <andres@anarazel.de> wrote: > It'll be differently sized on different platforms. So everyone will have to write hash functions that look at each memberindividually, rather than hashing the entire struct at once. And for each member you'll have to use a type specifichash function... Well, that's possibly kind of annoying, but it still sounds like pretty mechanical work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13 May 2017 at 10:29, Robert Haas <robertmhaas@gmail.com> wrote: > - Floats. There may be different representations in use on different > hardware, which could be a problem. Tom didn't answer my question > about whether any even-vaguely-modern hardware is still using non-IEEE > floats, which I suspect means that the answer is "no". Fwiw the answer to that is certainly no. The only caveat is that some platforms have not entirely complete implementations of IEEE missing corner cases such as denormalized values but I don't think that would be something that would be changed with a different hash function though. Personally while I would like to avoid code that actively crashes or fails basic tests on Vax even I don't think we need to worry about replication or federated queries in a heterogeneous environment where some servers are Vaxen and some are modern hardware. That seems a bit far-fetched. -- greg
On 2017-05-14 15:59:09 -0400, Greg Stark wrote: > Personally while I would like to avoid code that actively crashes or > fails basic tests on Vax I personally vote for simply refusing to run/compile on non-IEEE platforms, including VAX. The benefit of even trying to get that right, not to speak of actually testing, seems just not there. > even I don't think we need to worry about > replication or federated queries in a heterogeneous environment where > some servers are Vaxen and some are modern hardware. That seems a bit > far-fetched. Imo there's little point in trying to delineate some subset that we want to support on platforms that are 20 years out of date. Either we do, or don't. And since there's no point aside of some intellectual curiosity... On the other hand, I don't believe my opinion that requiring hashing to be portable is unrealistic, is meaningfully affected by disallowing non-IEEE floats. Greetings, Andres Freund
On Sat, May 13, 2017 at 9:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: > The latter is > generally false already. Maybe LATIN1 -> UTF8 is no-fail, but what > about UTF8 -> LATIN1 or SJIS -> anything? Based on previous mailing > list discussions, I'm under the impression that it is sometimes > debatable how a character in one encoding should be converted to some > other encoding, either because it's not clear whether there is a > mapping at all or it's unclear what mapping should be used. The express goal of the Unicode consortium is to replace all existing encodings with Unicode. My personal opinion is that a Unicode monoculture would be a good thing, provided reasonable differences can be accommodated. So, it might be that there is ambiguity about how one codepoint can be converted to another in another encoding, but that's because encodings like SJIS and BIG5 are needlessly ambiguous. It has nothing to do with cultural preferences leaving the question undecidable (at least by a panel of natural language experts), and everything to do with these regional character encoding systems being objectively bad. They richly deserve to die, and are in fact dying. Encoding actually *is* a property of the machine, even though regional encodings obfuscate things. There is a reason why MacOS and Java use UTF-16 rather than UTF-8, and there is a reason why the defacto standard on the web is UTF-8, and those reasons are completely technical. AFAICT, whatever non-technical reasons remain are actually technical debt in disguise. Where this leaves hash partitioning, I cannot say. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/
On Mon, May 15, 2017 at 7:59 AM, Greg Stark <stark@mit.edu> wrote: > On 13 May 2017 at 10:29, Robert Haas <robertmhaas@gmail.com> wrote: >> - Floats. There may be different representations in use on different >> hardware, which could be a problem. Tom didn't answer my question >> about whether any even-vaguely-modern hardware is still using non-IEEE >> floats, which I suspect means that the answer is "no". > > Fwiw the answer to that is certainly no. The only caveat is that some > platforms have not entirely complete implementations of IEEE missing > corner cases such as denormalized values but I don't think that would > be something that would be changed with a different hash function > though. Well... along with the Intel/IEEE-754 and VAX formats, there is a third floating point format that is/was in widespread use: IBM hex float[1]. It's base 16 rather than base 2, and might be the default on some IBM operating systems[2] for the C float and double types (but fortunately not xlc for AIX or Linux, and I have no clue about i/OS). This is probably irrelevant because it looks like people aren't running PostgreSQL on z/OS right now[3], but unlike VAXen these systems are alive and well so I just wanted to respond to the implication on this thread that the whole world is a VAX or an IEEE system :-) People really use this... I happen to know a shop that has petabytes of IBM hex float data. (IBM hardware also supports base 10 floats, but they show up as different types in C so not relevant.) [1] https://en.wikipedia.org/wiki/IBM_Floating_Point_Architecture [2] https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.cbcux01/flotcop.htm#flotcop [3] https://www.postgresql.org/message-id/flat/BLU437-SMTP4B3FF36035D8A3C3816D49C160%40phx.gbl -- Thomas Munro http://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes: > On 2017-05-14 15:59:09 -0400, Greg Stark wrote: >> Personally while I would like to avoid code that actively crashes or >> fails basic tests on Vax > I personally vote for simply refusing to run/compile on non-IEEE > platforms, including VAX. The point of wanting that is not because anybody thinks that VAX per se is an interesting platform anymore. The point is to avoid locking ourselves into narrow assumptions that we may need to break someday. Saying "nobody cares about floats other than IEEE floats" is morally indistinguishable from saying "nobody cares about running on any platform except Windows", which was a depressingly common opinion back in the nineties or so. It may well be that we can get away with saying "we're not going to make it simple to move hash-partitioned tables with float partition keys between architectures with different float representations". But there's a whole lot of daylight between that and denying any support for float representations other than the currently-most-popular one. regards, tom lane
On 2017-05-14 18:25:08 -0400, Tom Lane wrote: > It may well be that we can get away with saying "we're not going > to make it simple to move hash-partitioned tables with float > partition keys between architectures with different float > representations". But there's a whole lot of daylight between that > and denying any support for float representations other than the > currently-most-popular one. Note that I, IIRC in the mail you responded to, also argued that I don't think it'd be a good idea to rely on hashfunctions being portable. The amount of lock-in that'd create, especially for more complex datatypes, seems wholly inadvisable. I still think that dumping tables in a way they're reloaded via the top-partition (probably one copy statement for each child partition), and prohibiting incoming fkeys to partitions, is a better approach to all this. Greetings, Andres Freund
Peter Geoghegan <pg@bowt.ie> writes: > The express goal of the Unicode consortium is to replace all existing > encodings with Unicode. My personal opinion is that a Unicode > monoculture would be a good thing, provided reasonable differences can > be accommodated. Can't help remembering Randall Munroe's take on such things: https://xkcd.com/927/ I agree that the Far Eastern systems that can't easily be replaced by Unicode are that way mostly because they're a mess. But I'm still of the opinion that locking ourselves into Unicode is a choice we might regret, far down the road. regards, tom lane
On Mon, May 15, 2017 at 10:08 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > [2] https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.cbcux01/flotcop.htm#flotcop Though looking more closely I see that the default is IEEE in 64 bit builds, which seems like a good way to kill the older format off. If/when someone gets PostgreSQL ported to z/OS it probably won't be because they want to run it on an ancient 32 bit mainframe. -- Thomas Munro http://www.enterprisedb.com
On Sun, May 14, 2017 at 3:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I agree that the Far Eastern systems that can't easily be replaced > by Unicode are that way mostly because they're a mess. But I'm > still of the opinion that locking ourselves into Unicode is a choice > we might regret, far down the road. It's not a choice that has any obvious upside, so I have no reason to disagree. My point was only that Robert's contention that "You could argue that text-under-LATIN1 and text-under-UTF8 aren't really the same data type at all" seems wrong to me, because PostgreSQL seems to want to treat encoding as a property of the machine. This is evidenced by the fact that we expect applications to change client encoding "transparently". That is, client encoding may be changed without in any way affecting humans that speak a natural language that is provided for by the application's client encoding. That's a great ideal to have, and one that is very close to completely workable. -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/
On Sun, May 14, 2017 at 6:29 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-05-14 18:25:08 -0400, Tom Lane wrote: >> It may well be that we can get away with saying "we're not going >> to make it simple to move hash-partitioned tables with float >> partition keys between architectures with different float >> representations". But there's a whole lot of daylight between that >> and denying any support for float representations other than the >> currently-most-popular one. > > Note that I, IIRC in the mail you responded to, also argued that I don't > think it'd be a good idea to rely on hashfunctions being portable. The > amount of lock-in that'd create, especially for more complex datatypes, > seems wholly inadvisable. I still think that dumping tables in a way > they're reloaded via the top-partition (probably one copy statement for > each child partition), and prohibiting incoming fkeys to partitions, is > a better approach to all this. You'd have to prohibit a heck of a lot more than that in order for this to work 100% reliably. You'd have to prohibit CHECK constraints, triggers, rules, RLS policies, and UNIQUE indexes, at the least. You might be able to convince me that some of those things are useless when applied to partitions, but wanting a CHECK constraint that applies to only one partition seems pretty reasonable (e.g. CHECK that records for older years are all in the 'inactive' state, or whatever). I think getting this to work 100% reliably in all cases would require an impractically large hammer. Now that's not to say that we shouldn't have a reload-through-the-top-parent switch; actually, I think that's a good idea. I just don't believe that it can ever be a complete substitute for portable hash functions. I think almost everybody here agrees that it isn't necessary to have hash functions that are 100% portable, e.g. to VAX. But it would be nice IMHO if you could use an integer column as the partitioning key and have that be portable between Intel and POWER. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017-05-14 21:22:58 -0400, Robert Haas wrote: > but wanting a CHECK constraint that applies to only one partition > seems pretty reasonable (e.g. CHECK that records for older years are > all in the 'inactive' state, or whatever). On a hash-partitioned table? > Now that's not to say that we shouldn't have a > reload-through-the-top-parent switch; actually, I think that's a good > idea. I just don't believe that it can ever be a complete substitute > for portable hash functions. I think almost everybody here agrees > that it isn't necessary to have hash functions that are 100% portable, > e.g. to VAX. But it would be nice IMHO if you could use an integer > column as the partitioning key and have that be portable between Intel > and POWER. I'm not saying it can't work for any datatype, I just think it'd be a very bad idea to make it work for any non-trivial ones. The likelihood of reguarly breaking or preventing us from improving things seems high. I'm not sure that having a design where this most of the time works for some datatypes is a good one. Greetings, Andres Freund
On Sun, May 14, 2017 at 01:06:03PM -0700, Andres Freund wrote: > On 2017-05-14 15:59:09 -0400, Greg Stark wrote: > > Personally while I would like to avoid code that actively crashes or > > fails basic tests on Vax > > I personally vote for simply refusing to run/compile on non-IEEE > platforms, including VAX. The benefit of even trying to get that right, > not to speak of actually testing, seems just not there. Do we even know that floats are precise enough to determine the partition. For example, if you have 6.000000001, is it possible for that to be 5.9999999 on some systems? Are IEEE systems all the same for these values? I would say we should disallow any approximate date type for partitioning completely. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sun, May 14, 2017 at 8:00 PM, Bruce Momjian <bruce@momjian.us> wrote: > Do we even know that floats are precise enough to determine the > partition. For example, if you have 6.000000001, is it possible for > that to be 5.9999999 on some systems? Are IEEE systems all the same for > these values? I would say we should disallow any approximate date type > for partitioning completely. I'm inclined in this direction, as well. Hash partitioning is mostly useful for things that are likely to be join keys or group keys, and floats aren't. Same for complex user-defined types. The real problem here is what Tom pointed out: that we would have trouble hashing strings in an encoding-insensitive way. Strings are useful as join/group keys, so it would be painful to not support them. Perhaps there's some kind of compromise, like a pg_dump mode that reloads through the root. Or maybe hash partitions are really a "semi-logical" partitioning that the optimizer understands, but where things like per-partition check constraints don't make sense. Regards, Jeff Davis Regards, Jeff Davis
On Sun, May 14, 2017 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote: > You'd have to prohibit a heck of a lot more than that in order for > this to work 100% reliably. You'd have to prohibit CHECK constraints, > triggers, rules, RLS policies, and UNIQUE indexes, at the least. You > might be able to convince me that some of those things are useless > when applied to partitions, but wanting a CHECK constraint that > applies to only one partition seems pretty reasonable (e.g. CHECK that > records for older years are all in the 'inactive' state, or whatever). > I think getting this to work 100% reliably in all cases would require > an impractically large hammer. The more I think about it the more I think hash partitions are "semi-logical". A check constraint on a single partition of a range-partitioned table makes sense, but it doesn't make sense on a single partition of a hash-partitioned table. I think hash partitioning is mainly useful for parallel query (so the optimizer needs to know about it) and maintenance commands (as you listed in another email). But those don't need hash portability. FKs are a little more interesting, but I actually think they still work regardless of hash portability. If the two sides are in the same hash opfamily, they should hash the same and it's fine. And if they aren't, the FK has no hope of working regardless of hash portability. This would mean we need to reload through the root as Andres and others suggested, and disable a lot of logical partitioning capabilities. I'd be a little worried about what we do with attaching/detaching, though. Regards, Jeff Davis
On Mon, May 15, 2017 at 07:32:30AM -0700, Jeff Davis wrote: > On Sun, May 14, 2017 at 8:00 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Do we even know that floats are precise enough to determine the > > partition. For example, if you have 6.000000001, is it possible for > > that to be 5.9999999 on some systems? Are IEEE systems all the same for > > these values? I would say we should disallow any approximate date type > > for partitioning completely. > > I'm inclined in this direction, as well. Hash partitioning is mostly > useful for things that are likely to be join keys or group keys, and > floats aren't. Same for complex user-defined types. > > The real problem here is what Tom pointed out: that we would have > trouble hashing strings in an encoding-insensitive way. Strings are > useful as join/group keys, so it would be painful to not support them. Well, since we can't mix encodings in the same column, why can't we just hash the binary representation of the string? My point is that wish hashing we aren't comparing one string with another, right? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, May 15, 2017 at 07:48:14AM -0700, Jeff Davis wrote: > This would mean we need to reload through the root as Andres and > others suggested, One refinement of this would be to traverse the partition tree, stopping at the first place where the next branch has hash partitions, or at any rate types which have no application-level significance, and load from there. This could allow for parallelizing where it's portable to do so. Level Table Partition Type ------------------------------------------------ Base table: Log (N/A) Next partition: Year (range) Next partition: Month (range) Next partition: Day (range) <---- Data gets loaded no lower than here Next partition: * (hash) That last, any below it, doesn't have a specific name because they're just an implementation detail, i.e. none has any application-level meaning. > and disable a lot of logical partitioning capabilities. I'd be a > little worried about what we do with attaching/detaching, though. Attaching and detaching partitions only makes sense for partitions whose partition keys have application-level meaning anyway. Does it make sense at this point to separate our partitions into two categories, those which have can significance to applications, and those which can't? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Sun, May 14, 2017 at 9:35 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-05-14 21:22:58 -0400, Robert Haas wrote: >> but wanting a CHECK constraint that applies to only one partition >> seems pretty reasonable (e.g. CHECK that records for older years are >> all in the 'inactive' state, or whatever). > > On a hash-partitioned table? No, probably not. But do we really want the rules for partitioned tables to be different depending on the kind of partitioning in use? > I'm not saying it can't work for any datatype, I just think it'd be a > very bad idea to make it work for any non-trivial ones. The likelihood > of reguarly breaking or preventing us from improving things seems high. > I'm not sure that having a design where this most of the time works for > some datatypes is a good one. I think you might be engaging in undue pessimism here, but I suspect we need to actually try doing the work before we know how it will turn out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On May 15, 2017, at 7:48 AM, Jeff Davis <pgsql@j-davis.com> wrote: > > On Sun, May 14, 2017 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> You'd have to prohibit a heck of a lot more than that in order for >> this to work 100% reliably. You'd have to prohibit CHECK constraints, >> triggers, rules, RLS policies, and UNIQUE indexes, at the least. You >> might be able to convince me that some of those things are useless >> when applied to partitions, but wanting a CHECK constraint that >> applies to only one partition seems pretty reasonable (e.g. CHECK that >> records for older years are all in the 'inactive' state, or whatever). >> I think getting this to work 100% reliably in all cases would require >> an impractically large hammer. > > The more I think about it the more I think hash partitions are > "semi-logical". A check constraint on a single partition of a > range-partitioned table makes sense, but it doesn't make sense on a > single partition of a hash-partitioned table. That depends on whether the user gets to specify the hash function, perhaps indirectly by specifying a user defined opfamily. I can imagine clever hash functions that preserve certain properties of the incoming data, and check constraints in development versions of the database that help verify the hash is not violating those properties. That's not to say such hash functions must be allowed in the hash partitioning implementation; just that it does make sense if you squint and look a bit sideways at it. Mark Dilger
On Mon, May 15, 2017 at 03:26:02PM -0400, Robert Haas wrote: > On Sun, May 14, 2017 at 9:35 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2017-05-14 21:22:58 -0400, Robert Haas wrote: > >> but wanting a CHECK constraint that applies to only one partition > >> seems pretty reasonable (e.g. CHECK that records for older years > >> are all in the 'inactive' state, or whatever). > > > > On a hash-partitioned table? > > No, probably not. But do we really want the rules for partitioned > tables to be different depending on the kind of partitioning in use? As the discussion has devolved here, it appears that there are, at least conceptually, two fundamentally different classes of partition: public, which is to say meaningful to DB clients, and "private", used for optimizations, but otherwise opaque to DB clients. Mashing those two cases together appears to cause more problems than it solves. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, May 15, 2017 at 1:04 PM, David Fetter <david@fetter.org> wrote: > As the discussion has devolved here, it appears that there are, at > least conceptually, two fundamentally different classes of partition: > public, which is to say meaningful to DB clients, and "private", used > for optimizations, but otherwise opaque to DB clients. > > Mashing those two cases together appears to cause more problems than > it solves. I concur at this point. I originally thought hash functions might be made portable, but I think Tom and Andres showed that to be too problematic -- the issue with different encodings is the real killer. But I also believe hash partitioning is important and we shouldn't give up on it yet. That means we need to have a concept of hash partitions that's different from range/list partitioning. The terminology "public"/"private" does not seem appropriate. Logical/physical or external/internal might be better. With hash partitioning: * User only specifies number of partitions of the parent table; does not specify individual partition properties (modulus, etc.) * Dump/reload goes through the parent table (though we may provide options so pg_dump/restore can optimize this) * We could provide syntax to adjust the number of partitions, which would be expensive but still useful sometimes. * All DDL should be on the parent table, including check constraints, FKs, unique constraints, exclusion constraints, indexes, etc. - Unique and exclusion constraints would only be permittedif the keys are a superset of the partition keys. - FKs would only be permitted if the two table's partition schemes match and the keys are members of the same hash opfamily (this could be relaxed slightly, but it gets a little confusing if so) * No attach/detach of partitions * All partitions have the same permissions * Individual partitions would only be individually-addressable for maintenance (like reindex and vacuum), but not for arbitrary queries - perhaps also COPY for bulk loading/dumping, in casewe get clients smart enough to do their own hashing. The only real downside is that it could surprise users -- why can I add a CHECK constraint on my range-partitioned table but not the hash-partitioned one? We should try to document this so users don't find that out too far along. As long as they aren't surprised, I think users will understand why these aren't quite the same concepts. Regards, Jeff Davis
On Tue, May 16, 2017 at 11:10 AM, Jeff Davis <pgsql@j-davis.com> wrote: > With hash partitioning: > * User only specifies number of partitions of the parent table; does > not specify individual partition properties (modulus, etc.) > * Dump/reload goes through the parent table (though we may provide > options so pg_dump/restore can optimize this) > * We could provide syntax to adjust the number of partitions, which > would be expensive but still useful sometimes. > * All DDL should be on the parent table, including check constraints, > FKs, unique constraints, exclusion constraints, indexes, etc. > - Unique and exclusion constraints would only be permitted if the > keys are a superset of the partition keys. > - FKs would only be permitted if the two table's partition schemes > match and the keys are members of the same hash opfamily (this could > be relaxed slightly, but it gets a little confusing if so) > * No attach/detach of partitions > * All partitions have the same permissions > * Individual partitions would only be individually-addressable for > maintenance (like reindex and vacuum), but not for arbitrary queries > - perhaps also COPY for bulk loading/dumping, in case we get clients > smart enough to do their own hashing. I don't really find this a very practical design. If the table partitions are spread across different relfilenodes, then those relfilenodes have to have separate pg_class entries and separate indexes, and those indexes also need to have separate pg_class entries. Otherwise, nothing works. And if they do have separate pg_class entries, then the partitions have to have their own names, and likewise for their indexes, and a dump-and-reload has to preserve those names. If it doesn't, and those objects get new system-assigned names after the dump-and-reload, then dump restoration can fail when a system-assigned name collides with an existing name that is first mentioned later in the dump. If we had the ability to have anonymous pg_class entries -- relations that have no names -- then maybe it would be possible to make something like what you're talking about work. But that does not seem easy to do. There's a unique index on (relname, relnamespace) for good reason, and we can't make it partial on a system catalog. We could make the relname column allow nulls, but that would add overhead to any code that needs to access the relation name, and there's a fair amount of that. Similarly, if we had the ability to associate multiple relfilenodes with a single relation, and if index entries could point to <which-relfilenode, block, offset> rather than just <block, offset>, then we could also make this work. But either of those things would require significant re-engineering and would have downsides in other cases. If Java has portable hash functions, why can't we? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 16, 2017 at 08:10:39AM -0700, Jeff Davis wrote: > On Mon, May 15, 2017 at 1:04 PM, David Fetter <david@fetter.org> wrote: > > As the discussion has devolved here, it appears that there are, at > > least conceptually, two fundamentally different classes of partition: > > public, which is to say meaningful to DB clients, and "private", used > > for optimizations, but otherwise opaque to DB clients. > > > > Mashing those two cases together appears to cause more problems than > > it solves. > > I concur at this point. I originally thought hash functions might be > made portable, but I think Tom and Andres showed that to be too > problematic -- the issue with different encodings is the real killer. > > But I also believe hash partitioning is important and we shouldn't > give up on it yet. > > That means we need to have a concept of hash partitions that's > different from range/list partitioning. The terminology > "public"/"private" does not seem appropriate. Logical/physical or > external/internal might be better. I'm not attached to any particular terminology. > With hash partitioning: > * User only specifies number of partitions of the parent table; does > not specify individual partition properties (modulus, etc.) Maybe this is over-thinking it, but I'm picturing us ending up with something along the lines of: PARTITION BY INTERNAL(EXPRESSION) [ WITH ( [PARAMETERS] [, AS, APPROPRIATE] ) ] i.e. it's not clear that we should wire in "number of partitions" as a parameter. In a not that distant future, ANALYZE and similar could have a say in determining both the "how" and the "whether" of partitioning. > * Dump/reload goes through the parent table (though we may provide > options so pg_dump/restore can optimize this) Would it be simplest to default to routing through the immediate ancestor for now? It occurs to me that with the opaque partition system we're designing here, internal partitions would necessarily be leaves in the tree. > * We could provide syntax to adjust the number of partitions, which > would be expensive but still useful sometimes. Yep. I suspect that techniques for this are described in literature, and possibly even in code bases. Any pointers? > * All DDL should be on the parent table, including check constraints, > FKs, unique constraints, exclusion constraints, indexes, etc. Necessarily. > - Unique and exclusion constraints would only be permitted if the > keys are a superset of the partition keys. "Includes either all of the partition expression or none of it," maybe? > - FKs would only be permitted if the two table's partition schemes > match and the keys are members of the same hash opfamily (this could > be relaxed slightly, but it gets a little confusing if so) Relaxing sounds like a not-in-the-first-cut feature, and subtle. > * No attach/detach of partitions Since they're opaque, this is the only sane thing. > * All partitions have the same permissions Since they're opaque, this is the only sane thing. > * Individual partitions would only be individually-addressable for > maintenance (like reindex and vacuum), but not for arbitrary queries Since they're opaque, this is the only sane thing. > - perhaps also COPY for bulk loading/dumping, in case we get clients > smart enough to do their own hashing. This is appealing from a resource allocation point of view in the sense of deciding where the hash computing resources are spent. Do we want something like the NOT VALID/VALIDATE infrastructure to support it? > The only real downside is that it could surprise users -- why can I > add a CHECK constraint on my range-partitioned table but not the > hash-partitioned one? We should try to document this so users don't > find that out too far along. As long as they aren't surprised, I think > users will understand why these aren't quite the same concepts. +1 Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tuesday, May 16, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't really find this a very practical design. If the table
> partitions are spread across different relfilenodes, then those
> relfilenodes have to have separate pg_class entries and separate
> indexes, and those indexes also need to have separate pg_class
> entries. Otherwise, nothing works. And if they do have separate
> pg_class entries, then the partitions have to have their own names,
> and likewise for their indexes, and a dump-and-reload has to preserve
> those names. If it doesn't, and those objects get new system-assigned
> names after the dump-and-reload, then dump restoration can fail when a
> system-assigned name collides with an existing name that is first
> mentioned later in the dump.
Why can't hash partitions be stored in tables the same way as we do TOAST? That should take care of the naming problem.
> If Java has portable hash functions, why can't we?
Java standardizes on a particular unicode encoding (utf-16). Are you suggesting that we do the same? Or is there another solution that I am missing?
Regards,
Jeff Davis
> I don't really find this a very practical design. If the table
> partitions are spread across different relfilenodes, then those
> relfilenodes have to have separate pg_class entries and separate
> indexes, and those indexes also need to have separate pg_class
> entries. Otherwise, nothing works. And if they do have separate
> pg_class entries, then the partitions have to have their own names,
> and likewise for their indexes, and a dump-and-reload has to preserve
> those names. If it doesn't, and those objects get new system-assigned
> names after the dump-and-reload, then dump restoration can fail when a
> system-assigned name collides with an existing name that is first
> mentioned later in the dump.
Why can't hash partitions be stored in tables the same way as we do TOAST? That should take care of the naming problem.
> If Java has portable hash functions, why can't we?
Java standardizes on a particular unicode encoding (utf-16). Are you suggesting that we do the same? Or is there another solution that I am missing?
Regards,
Jeff Davis
On 5/16/17 11:10, Jeff Davis wrote: > I concur at this point. I originally thought hash functions might be > made portable, but I think Tom and Andres showed that to be too > problematic -- the issue with different encodings is the real killer. I think it would be OK that if you want to move a hash-partitioned table to a database with a different encoding, you have to do dump/restore through the parent table. This is quite similar to what you have to do now if you want to move a range-partitioned table to a database with a different locale. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017/05/17 5:25, Jeff Davis wrote: > On Tuesday, May 16, 2017, Robert Haas <robertmhaas@gmail.com> wrote: >> I don't really find this a very practical design. If the table >> partitions are spread across different relfilenodes, then those >> relfilenodes have to have separate pg_class entries and separate >> indexes, and those indexes also need to have separate pg_class >> entries. Otherwise, nothing works. And if they do have separate >> pg_class entries, then the partitions have to have their own names, >> and likewise for their indexes, and a dump-and-reload has to preserve >> those names. If it doesn't, and those objects get new system-assigned >> names after the dump-and-reload, then dump restoration can fail when a >> system-assigned name collides with an existing name that is first >> mentioned later in the dump. > > Why can't hash partitions be stored in tables the same way as we do TOAST? > That should take care of the naming problem. There is only one TOAST table per relation though, containing all of the relation's TOASTed data. Doesn't the multiplicity of hash partitions pose a problem? Will a hash partition of given name end up with the same subset of data in the target database as it did in the source database? I suppose it won't matter though if we make hash partitions an implementation detail of hash partitioned tables to the extent you described in your earlier email [1], whereby it doesn't matter to an application which partition contains what subset of the table's data. Thanks, Amit [1] https://www.postgresql.org/message-id/CAMp0ubcQ3VYdU1kNUCOmpj225U4hk6ZEoaUVeReP8h60p%2Bmv1Q%40mail.gmail.com
On Tue, May 16, 2017 at 8:40 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Mon, May 15, 2017 at 1:04 PM, David Fetter <david@fetter.org> wrote: >> As the discussion has devolved here, it appears that there are, at >> least conceptually, two fundamentally different classes of partition: >> public, which is to say meaningful to DB clients, and "private", used >> for optimizations, but otherwise opaque to DB clients. >> >> Mashing those two cases together appears to cause more problems than >> it solves. > > I concur at this point. I originally thought hash functions might be > made portable, but I think Tom and Andres showed that to be too > problematic -- the issue with different encodings is the real killer. > > But I also believe hash partitioning is important and we shouldn't > give up on it yet. > > That means we need to have a concept of hash partitions that's > different from range/list partitioning. The terminology > "public"/"private" does not seem appropriate. Logical/physical or > external/internal might be better. > > With hash partitioning: > * User only specifies number of partitions of the parent table; does > not specify individual partition properties (modulus, etc.) a well distributed integer column doesn't even need to be hashed, a simple modulo works with it. If we are going towards "implicit" (yet another name) partitioning, we could choose the strategy based on the data type of the partition key, not just hash it always. Although, we might end up hashing it in most of the cases. > * Dump/reload goes through the parent table (though we may provide > options so pg_dump/restore can optimize this) Probably you imply immediate hash partitioned parent, but just let me clarify it a bit. We support multi-level partitioning with each partitioned table anywhere in the partitioning hierarchy choosing any partitioning scheme. So, we can have range partitioned table as a partition of a hash partitioned table or for that matter, a non-hash partitioned table which is somewhere in the hiearchy rooted at the hash partitioned table. So, for range/list even hash partitions that are grand-children of a hash partitioned table, we will need to route dump/reload through that hash partitioned table i.e. route it through the topmost hash-partitioned table. > * We could provide syntax to adjust the number of partitions, which > would be expensive but still useful sometimes. > * All DDL should be on the parent table, including check constraints, > FKs, unique constraints, exclusion constraints, indexes, etc. i.e. topmost hash partitioned table as explained above. > - Unique and exclusion constraints would only be permitted if the > keys are a superset of the partition keys. Do you think this constraint apply even after we support global indexes? Isn't this applicable to all the partitioning strategies? > - FKs would only be permitted if the two table's partition schemes > match and the keys are members of the same hash opfamily (this could > be relaxed slightly, but it gets a little confusing if so) > * No attach/detach of partitions It will be good, if we can support this for maintenance purpose. If a partition goes bad, we could replace it with its copy somewhere, using attach and detach without affecting the whole table. Now does that mean, that we will need to support some form of pg_dump/copy with special flag to create copies of individual partitions? I think that proposal has already been floated. > * All partitions have the same permissions Why's that? > > The only real downside is that it could surprise users -- why can I > add a CHECK constraint on my range-partitioned table but not the > hash-partitioned one? We should try to document this so users don't > find that out too far along. As long as they aren't surprised, I think > users will understand why these aren't quite the same concepts. > For a transparent hash (non-transparent in the sense of what Mark Dilger proposed), any constraint other than implicit partitioning constraint is applicable to the whole table or it's not applicable at all. So, better if user adds it on the parent hash table. So, yes, with this reasoning, we could document this fact. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Tue, May 16, 2017 at 4:25 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Why can't hash partitions be stored in tables the same way as we do TOAST? > That should take care of the naming problem. Hmm, yeah, something like that could be done, but every place where you are currently allowed to refer to a partition by name would have to be be changed to accept some other syntax for specifying the partition. Even with all the things you propose to disallow, things like CLUSTER, VACUUM, ANALYZE, etc. would still have to be accepted on individual partitions. I suspect even CREATE INDEX and DROP INDEX would need to be accepted on individual partitions, because an index on one partition somehow becomes bloated while the corresponding indexes on other partitions are OK, you'll want to create a replacement index concurrently and drop the old one. Of course, for similar reasons, you'd need some way for \d on the parent to display information on indexes on all the children, and all of that output would have to frobbed to use whatever syntax is now required, in lieu of a name, to use an individual partition. Error messages would have to be adjusted in probably quite a few places to use the new notation, too. And on and on. It's not impossible to do, but we could end up chasing down loose ends for a very long time. Beyond that, I think it's a bad idea to make hash partitions behave completely differently from list and range partitions. That's a lot of code extra code to maintain, and a lot of extra notional complexity for users, for really not a lot of benefit. I think we're taking a significant but not overwhelming problem -- our current hash functions aren't portable -- and through a chain of logical links eventually ending up with the conclusion that the design of partitioning needs to be totally overhauled. I want to resist that conclusion. I'm not saying that the problem isn't a problem, or that there's not some logic to each step in the chain, but it's not that hard to blow a small problem up into a huge one by assuming the worst possible consequences or the tightest possible requirements at each step. http://tvtropes.org/pmwiki/pmwiki.php/Main/ForWantOfANail is not an argument for stricter regulation of the blacksmith industry. >> If Java has portable hash functions, why can't we? > > Java standardizes on a particular unicode encoding (utf-16). Are you > suggesting that we do the same? Or is there another solution that I am > missing? Well, I've already said several times (and Peter Eisentraut has agreed) that we don't really need the hash functions to be portable across encodings. I think there are at least three good arguments for that position. First, as Peter Geoghegan points out, the word is increasingly standardizing on Unicode, and that trend seems likely to continue. I strongly suspect that UTF-8 is the most common database encoding by a wide margin. There may occasionally be reasons to avoid it if, for example, you're using one of the Eastern languages that doesn't play entirely nicely with UTF-8, or if you happen to be storing a large number of characters that can be represented in a single byte in some other encoding but which require multiple bytes in UTF-8, but for an awful lot of people UTF-8 just works and there's no need to think any further. So, a lot of people will never hit the problem of needing to migrate a database between encodings because they'll just use UTF-8. Second, if the previous argument turns out to be wrong and the world abandons UTF-8 in favor of some new and better system (UTF-9?), or if users frequently want to make some other encoding conversion like Tom's original example of LATIN1 -> UTF-8, we've already got a proposed workaround for that case which seems like it will work just fine. Just dump your data with pg_dump --insert-hash-partitioned-data-into-parent and reload on the new system. This isn't absolutely guaranteed to work if you've done something silly that will make the load of a particular row work on one partition and fail on some other one, but you probably won't do that because it would be dumb. Also, it will be a bit slower than a regular dump-and-reload cycle because tuple routing isn't free. Neither of these problems really sound very bad. If we're going to start fixing things that could cause database migrations/upgrades to occasionally fail in corner cases or run more slowly than expected, there's a long list of things that you can do in your DDL that will make pg_upgrade bomb out, and many of them are things that bite users with some regularity (e.g. tablespaces inside the data directory or other tablespaces, dependencies on system objects that are changed in the new version, ALTER USER .. SET ROLE). For whatever reason, we haven't viewed those warts as really high-priority items in need of fixing; in some cases, instead of actually trying to improve usability, we've all but mocked the people reporting those issues for having the temerity to do configure the system in a way that we think isn't very smart. How can we then turn around and say "it's absolutely unacceptable for there to be any case where dump and reload fails even if there's a workaround switch for pg_dump that will almost always solve the problem"? That's holding hash partitioning to a stricter standard than our existing features, which seems unjustified. Third, the fact that we support multiple encodings ought to be a strong point of PostgreSQL, not an excuse for running away from features. This same issue came up with JSON support, and the question was, well, what do we do about the places where JSON assumes that everything is UTF-8, like in \uXXXX escape sequences, given that we support multiple encodings? We could have answered that question by giving up and saying that JSON support is just too hard in our multi-encoding environment, but we found some solution that let us move forward. I think the result is that JSON is not really quite per-spec unless you are using UTF-8 as your encoding, but I am confident that it's better to have made that trade-off than to just not support JSON at all. Similarly, I think that if we use the fact that we support multiple encodings either as an excuse for why we shouldn't have hash partitioning at all or for why the design needs to be vastly more complex than would otherwise be required, we've made a liability of what should be an asset. Tom's willing to make those arguments, but he doesn't think hash partitioning is worthwhile in the first place. If we believe it is worthwhile (and I do), then we should be looking for a set of design decisions that make implementing it reasonably practical, maybe with some trade-offs, rather than a set of decisions that increase the complexity to the point where it will take us 5-10 years to unravel all the problems. Also, it's worth noting that our decision here doesn't need to be monolithic. If we want, we can ship one opfamily that hashes strings in the obvious way - i.e. fast but not portable across encodings - and another that converts the input string to UTF-8 and then hashes the result. The latter will be somewhat slow, but it will be portable across encodings, and I don't have any problem at all with providing it if somebody wants to do the legwork. It'll also fail in any encodings that can't be converted to UTF-8, but it seems a bit much to insist that hash partitioning must solve a problem that the Unicode consortium hasn't managed to overcome. Anyway, if we do something like that, then users can choose whether they want the portability for which Tom advocates or the speed which I believe most users will prefer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 16, 2017 at 4:25 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> Why can't hash partitions be stored in tables the same way as we do TOAST? >> That should take care of the naming problem. > Hmm, yeah, something like that could be done, but every place where > you are currently allowed to refer to a partition by name would have > to be be changed to accept some other syntax for specifying the > partition. Uh ... toast tables have regular names, and can be specified in commands just like any other table. I don't see why these "auto" partition tables couldn't be handled the same way. > Beyond that, I think it's a bad idea to make hash partitions behave > completely differently from list and range partitions. I think the question is whether we are going to make a distinction between logical partitions (where the data division rule makes some sense to the user) and physical partitions (where it needn't). I think it might be perfectly reasonable for those to behave differently. regards, tom lane
On Wed, May 17, 2017 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, May 16, 2017 at 4:25 PM, Jeff Davis <pgsql@j-davis.com> wrote: >>> Why can't hash partitions be stored in tables the same way as we do TOAST? >>> That should take care of the naming problem. > >> Hmm, yeah, something like that could be done, but every place where >> you are currently allowed to refer to a partition by name would have >> to be be changed to accept some other syntax for specifying the >> partition. > > Uh ... toast tables have regular names, and can be specified in commands > just like any other table. I don't see why these "auto" partition tables > couldn't be handled the same way. Really? That seems like a huge usability fail to me. If somebody wants to create an index on one partition of a hash-partitioned table, or reindex an index, do you really want them to have to dig out an internal name to do it? And how exactly would you dump and restore the partitions and their indexes? It's true that there are some operations that can be performed directly on a TOAST table, but the saving grace is that you usually don't need to do any of them. That won't be true here. >> Beyond that, I think it's a bad idea to make hash partitions behave >> completely differently from list and range partitions. > > I think the question is whether we are going to make a distinction between > logical partitions (where the data division rule makes some sense to the > user) and physical partitions (where it needn't). I think it might be > perfectly reasonable for those to behave differently. I don't think I'd like to go so far as to say that it's unreasonable, but I certainly wouldn't say I'm optimistic about such a design. I do not think that it is going to work to conceal from the user that the partitions are really separate tables with their own indexes. I also think that trying to make such a thing work is just going to lead to a lot of time and energy spent trying to paper over problems that are basically self-inflicted, and that papering over those problems won't really end up having any value for users. Remember, the chain of reasoning here is something like: 1. To handle dump-and-reload the way we partitioning does today, hash functions would need to be portable across encodings. 2. That's impractically difficult. 3. So let's always load data through the top-parent. 4. But that could fail due to e.g. a UNIQUE index on an individual child, so let's try to prohibit all of the things that could be done to an individual partition that could cause a reload failure. 5. And then for good measure let's hide the existence of the partitions, too. Every step in that chain of logic has a certain sense to it, but none of them are exactly water-tight. #1 is basically a value judgement: would people rather (a) have faster hash functions, or (b) would they rather be able to port a database to a different encoding without having rows move between hash functions? The statement is only true if you think it's the latter, but I tend to think it's the former. #2 is a judgement that the performance characteristics of as-yet-unwritten portable hashing will be so bad that nobody could possibly be satisfied with it. #3 is a great idea as an optional behavior, but it's only a strict necessity if you're totally committed to #1 and #2. It also has some performance cost, which makes it somewhat undesirable as a default behavior. #4 is *probably* a necessary consequence of #3. I don't know what the argument for #5 is unless it's that #4 isn't hard enough already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 17, 2017 at 12:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 1. To handle dump-and-reload the way we partitioning does today, hash > functions would need to be portable across encodings. > 2. That's impractically difficult. > 3. So let's always load data through the top-parent. > 4. But that could fail due to e.g. a UNIQUE index on an individual > child, so let's try to prohibit all of the things that could be done > to an individual partition that could cause a reload failure. > 5. And then for good measure let's hide the existence of the partitions, too. That is one thread of logic, but out of the discussion also highlighted some of the consequences of treating hash partitions like range/list partitions. For instance, it makes little sense to have individual check constraints, indexes, permissions, etc. on a hash-partitioned table. It doesn't mean that we should necessarily forbid them, but it should make us question whether combining range and hash partitions is really the right design. Regards, Jeff Davis
On Wed, May 17, 2017 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think the question is whether we are going to make a distinction between > logical partitions (where the data division rule makes some sense to the > user) and physical partitions (where it needn't). I think it might be > perfectly reasonable for those to behave differently. Agreed. To summarize my perspective: * hash partitioning offers a nice way to divide the data for later processing by parallel query * range partitioning is good for partition elimination (constraint_exclusion) and separating hot/cold data (e.g. partitioning on date) * both offer some maintenance benefits (e.g. reindex one partition at a time), though range partitioning seems like it offers better flexibility here in some cases I lean toward separating the concepts, but Robert is making some reasonable arguments and I could be convinced. Regards, Jeff Davis
On Thu, May 18, 2017 at 1:53 AM, Jeff Davis <pgsql@j-davis.com> wrote: > For instance, it makes little sense to have individual check > constraints, indexes, permissions, etc. on a hash-partitioned table. > It doesn't mean that we should necessarily forbid them, but it should > make us question whether combining range and hash partitions is really > the right design. I think that it definitely makes sense to have individual indexes on a hash-partitioned table. If you didn't, then as things stand today, you'd have no indexes at all, which can't be good. In the future, we might have some system where an index created on the parent cascades down to all of the children, but even then, you might want to REINDEX just one of those child indexes, or better yet, create a replacement index concurrently and then drop the old one concurrently. You might also want to add the same sort of new index to every partition, but not in a single operation - for reasons of load, length of maintenance window, time for which a snapshot is held open, etc. I agree that separate constraints and permissions on hash partitions don't make much sense. To a lesser extent, that's true of other kinds of partitioning as well. I mean, there is probably some use case for setting separate permissions on a range-partitioned table, but it's a pretty thin use case. It certainly seems possible that many users would prefer a rule that enforces uniform permissions across the entire partitioning hierarchy. This is one of the key things that had to be decided in regard to the partitioning implementation we now have: for which things should we enforce uniformity, and for which things should we allow diversity? I advocated for enforcing uniformity only in areas where we could see a clear advantage to it, which led to the fairly minimal approach of enforcing only that we had no multiple inheritance and no extra columns in the children, but that's certainly an arguable position. Other people argued for more restrictions, I believe out of a desire to create more administrative simplicity, but there is a risk of cutting yourself off from useful configurations there, and it seems very difficult to me to draw a hard line between what is useful and what is useless. For example, consider a hash-partitioned table. Could it make sense to have some but not all partitions be unlogged? I think it could. Suppose you have a cluster of machines each of which has a replica of the same hash-partitioned table. Each server uses logged tables for the partitions for which it is the authoritative source of information, and unlogged tables for the others. In the event of crash, the data for any tables that are lost are replicated from the master for that machine. I can think of some disadvantages of that design, but I can think of some advantages, too, and I think it's pretty hard to say that nobody should ever want to do it. And if it's legitimate to want to do that, then what if I want to use trigger-based replication rather than logical replication? Then I might need triggers on some partitions but not all, or maybe different triggers on different partitions. Even for a permissions grant, suppose my production system is having some problem that can't be replicated on the test data set. Is it reasonable to want to give a trusted developer access to a slice, but not all of, my production data? I could allow them access to just one partition. Maybe not a common desire, but is that enough reason to ban it? I'd say it's arguable. I don't think that there are bright lines around any of this stuff. My experience with this area has led me to give up on the idea of complete uniformity as impractical, and instead look at it from the perspective of "what do we absolutely have to ban in order for this to be sane?". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thursday, May 18, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
> My experience with this area has led
> me to give up on the idea of complete uniformity as impractical, and
> instead look at it from the perspective of "what do we absolutely have
> to ban in order for this to be sane?".
I could agree to something like that. Let's explore some of the challenges there and potential solutions:> My experience with this area has led
> me to give up on the idea of complete uniformity as impractical, and
> instead look at it from the perspective of "what do we absolutely have
> to ban in order for this to be sane?".
Stepping back, your approach might be closer to the general postgres philosophy of allowing the user to assemble from spare parts first, then a few releases later we offer some pre-built subassemblies, and a few releases later we make the typical cases work out of the box. I'm fine with it as long as we don't paint ourselves into a corner.
Of course we still have work to do on the hash functions. We should solve at least the most glaring portability problems, and try to harmonize the hash opfamilies. If you agree, I can put together a patch or two.
Regards,
Jeff Davis
On Fri, May 19, 2017 at 2:36 AM, Jeff Davis <pgsql@j-davis.com> wrote: > I could agree to something like that. Let's explore some of the challenges > there and potential solutions: > > 1. Dump/reload of hash partitioned data. > > Falling back to restore-through-the-root seems like a reasonable answer > here. Moving to a different encoding is not an edge case, but it's not > common either, so a performance penalty seems acceptable. I'm not > immediately sure how we'd implement this in pg_dump/restore, so I'd feel a > little more comfortable if I saw a sketch. Right, I think this needs some investigation. I can't whip up a sketch on short notice, but I'll see if someone else at EnterpriseDB can work on it unless somebody else wants to take a crack at it. > 2. Having a lot of hash partitions would be cumbersome > > The user would need to create and manage each partition, and try to do > global operations in a sane way. The normal case would probably involve > scripts to do things like add an index to all partitions, or a column. Many > partitions would also just pollute the namespace unless you remember to put > them in a separate schema (yes, it's easy, but most people will still > forget). Some syntax sugar would go a long way here. I agree. Adding a column already cascades to all children, and there's a proposal to make the same thing true for indexes. See discussion beginning at http://postgr.es/m/c8fe4f6b-ff46-aae0-89e3-e936a35f0cfd@postgrespro.ru I haven't had time to review the code posted there yet, but I would like to see something along the lines discussed there committed to v11, and hopefully also something around foreign keys. It should be possible to create an outbound foreign key on a foreign table and have that cascade to all children. Conversely, it should also be possible to create a foreign key referencing a partitioned table provided that the foreign key references the partitioning key, and that there's a unique index on those same columns on every partition. (Referencing a foreign key that is not the partitioning key will have to wait for global indexes, I think.) These things are useful not just for hash partitioning, but also for list and range partitioning, and we'll save a lot of work if we can use the same infrastructure for both cases. > 3. The user would need to specify details they really don't care about for > each partition. > > Things like "modulus 16, remainder 0", "modulus 16, remainder 1" are tedious > boilerplate. And if the user makes a mistake, then 1/16 of inserts start > failing. Probably would be caught during testing, but not exactly a good > user experience. I'm not thrilled about this, considering that all the user > really wants is 16 partitions, but it's not the end of the world. As I said on the hash partitioning thread, I think the way to handle this is to get the basic feature in first, then add some convenience syntax to auto-create the partitions. That shouldn't be very difficult to implement; I just didn't want to complicate things more than necessary for the first version. The same issue arose when discussing list and range partitioning: Oracle has syntax like ours, but with the ability (requirement?) to create the partitions in the initial CREATE TABLE statement. However, I wanted to make sure we had the "inconvenient" syntax fully working and fully tested before we added that, because I believe pg_dump needs to dump everything out the long way. > 4. Detach is a foot-gun > > If you detach a partition, random inserts will start failing. Not thrilled > about this, but a hapless user would accept most of the blame if they > stumble over it. Another way of saying this is with hash partitioning you > really need the whole set for the table to be online at all. But we can't > really enforce that, because it would limit some of the flexibility that you > have in mind. Yes, I agree with all of that. I don't think it's really going to be a problem in practice. The only reason to detach a hash partition is if you want to split it, and we may eventually have convenience syntax to do that in an automated (i.e. less error-prone) way. If somebody does it manually and without a plan for putting back a replacement partition, they may be sad, but if somebody puts a CHECK (false) constraint on their table, they may be sad about that, too. It's more important to allow for flexibility than to prohibit every stupid thing somebody might try to do. Also, documentation helps. We've got a chapter on partitioning and it can be expanded to discuss these kinds of issues. > Stepping back, your approach might be closer to the general postgres > philosophy of allowing the user to assemble from spare parts first, then a > few releases later we offer some pre-built subassemblies, and a few releases > later we make the typical cases work out of the box. I'm fine with it as > long as we don't paint ourselves into a corner. That's basically my thinking here. Right now, our partitioning is primitive in numerous ways, and so the rough edges are pretty visible. However, I believe that with careful design we can file down many of those rough edges over time. Now, it's probably never going to be quite as smooth as if the system had been designed for partitions from the ground up, or at least not any time in the foreseeable future, but I think it can still be very good. If we add the above-referenced logic for index and foreign key handling, convenience syntax for creating partitions along with tables and for data-movement operations, better partition pruning in the planner, run-time partition-pruning in the executor, partitionwise join and aggregate, MergeAppend -> Append strength reduction when the required sortorder matches the partition order, UPDATE tuple routing, default partitions, and so on, I believe that partitioning will go from "eh, that's better than inheritance" to "hey, that's actually really good". There is a ton of work to do there, but I think it is all doable within the current infrastructure, and all of those things can (and in a number of cases already are) being worked on as separate patches, so we can get into each release what is ready and push off to the next release what isn't. As we go, we'll have fewer and fewer cases where partitioning a table regresses performance and more and more cases where the stuff you want to do just works. Probably the toughest nut to crack is global indexes. An argument was made on this very mailing list a number of years ago that nobody should want global indexes because $REASONS, but I was a little skeptical of that argument at the time and it's been clear to me that EnterpriseDB's customers, at least, do not in any way accept those arguments. It works on Oracle or other systems they use, and they find it useful enough that they're unwilling to be without it. If PostgreSQL provides the same functionality, they'll use it for more things than if it doesn't. I find myself more than a bit intimidated at the prospect of actually trying to make this work, but I've been convinced that people won't stop asking until it does. > Of course we still have work to do on the hash functions. We should solve at > least the most glaring portability problems, and try to harmonize the hash > opfamilies. If you agree, I can put together a patch or two. I definitely agree with solving the portability problems to the extent that we can reasonably do so. I think adding more cross-type hash opfamilies is a mildly good thing: I don't object to it, it probably makes sense to do at the same time as any other hashing changes we want to make, and it's better than not doing it. At the same time, I wouldn't walk over hot coals for it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, May 12, 2017 at 1:35 PM, Joe Conway <mail@joeconway.com> wrote: >> That's a good point, but the flip side is that, if we don't have >> such a rule, a pg_dump of a hash-partitioned table on one >> architecture might fail to restore on another architecture. Today, I >> believe that, while the actual database cluster is >> architecture-dependent, a pg_dump is architecture-independent. Is it >> OK to lose that property? > > Not from where I sit. It was pointed out at PGCon that we've actually already crossed this Rubicon. If you create a range-partitioned table, put a bunch of data into it, and then try to reload it on another system with a different set of encoding definitions, the proper partition for some row might be different. That would result in the dump failing to reload with a complaint about the partition key being violated. And, in fact, you could have the exact same issue on earlier releases which don't have partitioning, because a CHECK constraint of the form (a >= 'something' AND b < 'something else') could be true under one encoding and false under another, and you could define such a constraint on any release (from this millienium, anyway). I'm not actually aware of an instance where this has bitten anyone, even though it seems like it certainly could have and maybe should've gotten somebody at some point. Has anyone else? I think it's a reasonable guess that such problems will become more common with the advent of partitioning and more common still as we continue to improve partitioning, because people who otherwise would have given up on PostgreSQL -- or at least on partitioning -- will actually try to use it in earnest and then hit this problem. However, my guess is that it will still be pretty rare, and that having an optional --dump-partition-data-with-parent flag that can be used when it crops up will be an adequate workaround for most people. Of course, that is just my opinion. So now I think that the right way to think about the issues around hash partitioning is as a new variant of a problem that we already have rather than an altogether new problem. IMHO, the right questions are: 1. Are the new problems worse than the old ones? 2. What could we do about it? On #1, I'd say tentatively yes. The problem of portability across encodings is probably not new. Suppose you have a table partitioned by range, either using the new range partitioning or using the old table inheritance method and CHECK constraints. If you move that table to a different encoding, will the collation behavior you get under the new encoding match the collation behavior you got under the old encoding? The documentation says: "Also, a collation is tied to a character set encoding (see Section 22.3). The same collation name may exist for different encodings", which makes it sound like it is possible but not guaranteed. Even if the same collation name happens to exist, there's no guarantee it behaves the same way under the new encoding, and given our experiences with glibc so far, I'd bet against it. If glibc doesn't even think strcoll() and strxfrm() need to agree with each other for the same collation, or that minor releases shouldn't whack the behavior around, there doesn't seem to be room for optimism about the possibility that they carefully preserve behavior across similarly-named collations on different encodings. On the other hand, collation rules probably tend to vary only around the edges, so there's a reasonably good chance that even if the collation rules change when you switch encodings, every row will still get put into the same partition as before. If we implement hashing for hash partitioning in some trivial way like hashing the raw bytes, that will most certainly not be true -- *most* rows will move to a different partition when you switch encodings. Furthermore, neither range nor list partitioning depends on properties of the hardware, like how wide integers are, or whether they are stored big-endian. A naive approach to hash partitioning would depend on those things. That's clearly worse. On #2, I'll attempt to list the approaches that have been proposed so far: 1. Don't implement hash partitioning (Tom Lane). I don't think this proposal will attract many votes. 2. Add an option like --dump-partition-data-with-parent. I'm not sure who originally proposed this, but it seems that everybody likes it. What we disagree about is the degree to which it's sufficient. Jeff Davis thinks it doesn't go far enough: what if you have an old plain-format dump that you don't want to hand-edit, and the source database is no longer available? Most people involved in the unconference discussion of partitioning at PGCon seemed to feel that wasn't really something we should be worry about too much. I had been taking that position also, more or less because I don't see that there are better alternatives. For instance, Jeff proposed having the COPY command specify both the parent and the child and providing a run-time option of some kind to decide which table name gets used, but I think that would be a fairly unpleasant syntax wart with some risk of breaking other cases (e.g. what if you want to restore a single child on a system where the parent doesn't exist?). Someone else in the room had a different take on why we shouldn't worry about this problem, which I'll try to summarize: "Well, encoding conversions are already so painful that if you're laboring under any illusion that it's all just going to work, you're wrong, and this isn't going to make anything materially worse." 3. Implement portable hash functions (Jeff Davis or me, not sure which). Andres scoffed at this idea, but I still think it might have legs. Coming up with a hashing algorithm for integers that produces the same results on big-endian and little-endian systems seems pretty feasible, even with the additional constraint that it should still be fast. Coming up with a hashing algorithm that produces the same results on every various encodings of the same characters is definitely feasible in any case where we know how to do encoding conversion among all relevant encodings, but it is probably hard to make fast. Those two things also solve different parts of the problem; one is insulating the user from a difference in hardware architecture, while the other is insulating the user from a difference in user-selected settings. I think that the first of those things is more important than the second, because it's easier to change your settings than it is to change your hardware. Also, I think that there may be room for allowing for some user choice on this point; if people are willing to write multiple hash opfamilies with different portability-performance trade-offs, we can offer them all, either in core or contrib, and users can pick the ones that have the properties they want. My personal guess is that most people will prefer the fast hash functions over the ones that solve their potential future migration problems, but, hey, options are good. I think that some combination of #2 and #3 is probably as well as we're going to do here, and personally I think we can make that pretty good. Users who use PostgreSQL 11 to do hash partitioning using a composite key consisting of one float8 column and one SJIS-encoded text column on a VAX and attach to each partition a CHECK constraint that happens to pass only because of the vagaries of which rows end up in which partitions, and who then try to migrate to Linux64 with a numeric column and a UTF-8 encoded text column with the same CHECK constraints will have some problems to work out. However, for the vast majority of people, reloading the data through the top-parent is going to be good enough, and the more portable we can make the hash functions, the fewer people will need to do even that much. Of course, if there are other things we can do for a reasonable investment of effort to make things better still, great! I'm open to suggestions. But I don't think we need to panic about this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-06-01 13:59:42 -0400, Robert Haas wrote: > I'm not actually aware of an instance where this has bitten anyone, > even though it seems like it certainly could have and maybe should've > gotten somebody at some point. Has anyone else? Two comments: First, citus has been doing hash-partitiong and append/range partitioning for a while now, and I'm not aware of anyone being bitten by this (although there've been plenty other things ;)), even though there've been cases upgrading to different collation & encodings. Secondly, I think that's to a significant degree caused by the fact that in practice people way more often partition on types like int4/int8/date/timestamp/uuid rather than text - there's rarely good reasons to do the latter. > Furthermore, neither range nor list partitioning depends on properties > of the hardware, like how wide integers are, or whether they are > stored big-endian. A naive approach to hash partitioning would depend > on those things. That's clearly worse. I don't think our current int4/8 hash functions depend on FLOAT8PASSBYVAL. > 3. Implement portable hash functions (Jeff Davis or me, not sure > which). Andres scoffed at this idea, but I still think it might have > legs. Coming up with a hashing algorithm for integers that produces > the same results on big-endian and little-endian systems seems pretty > feasible, even with the additional constraint that it should still be > fast. Just to clarify: I don't think it's a problem to do so for integers and most other simple scalar types. There's plenty hash algorithms that are endianess independent, and the rest is just a bit of care. Where I see a lot more issues is doing so for more complex types like arrays, jsonb, postgis geometry/geography types and the like, where the fast and simple implementation is to just hash the entire datum - and that'll very commonly not be portable at all due to padding and type wideness differences. > My personal guess is that most people will prefer the fast > hash functions over the ones that solve their potential future > migration problems, but, hey, options are good. I'm pretty sure that will be the case. I'm not sure that adding infrastructure to allow for something that nobody will use in practice is a good idea. If there ends up being demand for it, we can still go there. I think that the number of people that migrate between architectures is low enough that this isn't going to be a very common issue. Having some feasible way around this is important, but I don't think we should optimize heavily for it by developing new infrastructure / complicating experience for the 'normal' uses. Greetings, Andres Freund
On 06/01/2017 11:25 AM, Andres Freund wrote: > On 2017-06-01 13:59:42 -0400, Robert Haas wrote: >> My personal guess is that most people will prefer the fast >> hash functions over the ones that solve their potential future >> migration problems, but, hey, options are good. > > I'm pretty sure that will be the case. I'm not sure that adding > infrastructure to allow for something that nobody will use in practice > is a good idea. If there ends up being demand for it, we can still go there. > > I think that the number of people that migrate between architectures is > low enough that this isn't going to be a very common issue. Having some > feasible way around this is important, but I don't think we should > optimize heavily for it by developing new infrastructure / complicating > experience for the 'normal' uses. +1 Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
On Thu, Jun 1, 2017 at 10:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > 1. Are the new problems worse than the old ones? > > 2. What could we do about it? Exactly the right questions. 1. For range partitioning, I think it's "yes, a little". As you point out, there are already some weird edge cases -- the main way range partitioning would make the problem worse is simply by having more users. But for hash partitioning I think the problems will become more substantial. Different encodings, endian issues, etc. will be a headache for someone, and potentially a bad day if they are urgently trying to restore on a new machine. We should remember that not everyone is a full-time postgres DBA, and users might reasonably think that the default options to pg_dump[all] will give them a portable dump. 2. I basically see two approaches to solve the problem: (a) Tom suggested at PGCon that we could have a GUC that automatically causes inserts to the partition to be re-routed through the parent. We could discuss whether to always route through the parent, or do a recheck on the partition constrains and only reroute tuples that will fail it. If the user gets into trouble, the worst that would happen is a helpful error message telling them to set the GUC. I like this idea. (b) I had suggested before that we could make the default text dump (and the default output from pg_restore, for consistency) route through the parent. Advanced users would dump with -Fc, and pg_restore would support an option to do partition-wise loading. To me, this is simpler, but users might forget to use (or not know about) the pg_restore option and then it would load more slowly. Also, the ship is sailing on range partitioning, so we might prefer option (a) just to avoid making any changes. I am fine with either option. > 2. Add an option like --dump-partition-data-with-parent. I'm not sure > who originally proposed this, but it seems that everybody likes it. > What we disagree about is the degree to which it's sufficient. Jeff > Davis thinks it doesn't go far enough: what if you have an old > plain-format dump that you don't want to hand-edit, and the source > database is no longer available? Most people involved in the > unconference discussion of partitioning at PGCon seemed to feel that > wasn't really something we should be worry about too much. I had been > taking that position also, more or less because I don't see that there > are better alternatives. If the suggestions above are unacceptable, and we don't come up with anything better, then of course we have to move on. I am worrying now primarily because now is the best time to worry; I don't expect any horrible outcome. > 3. Implement portable hash functions (Jeff Davis or me, not sure > which). Andres scoffed at this idea, but I still think it might have > legs. I think it reduces the problem, which has value, but it's hard to make it rock-solid. > make fast. Those two things also solve different parts of the > problem; one is insulating the user from a difference in hardware > architecture, while the other is insulating the user from a difference > in user-selected settings. I think that the first of those things is > more important than the second, because it's easier to change your > settings than it is to change your hardware. Good point. Regards, Jeff Davis
On Thu, Jun 1, 2017 at 11:25 AM, Andres Freund <andres@anarazel.de> wrote: > Secondly, I think that's to a significant degree caused by > the fact that in practice people way more often partition on types like > int4/int8/date/timestamp/uuid rather than text - there's rarely good > reasons to do the latter. Once we support more pushdowns to partitions, the only question is: what are your join keys and what are your grouping keys? Text is absolutely a normal join key or group key. Consider joins on a user ID or grouping by a model number. Regards, Jeff Davis
On Fri, Jun 2, 2017 at 1:24 AM, Jeff Davis <pgsql@j-davis.com> wrote: > 1. For range partitioning, I think it's "yes, a little". As you point > out, there are already some weird edge cases -- the main way range > partitioning would make the problem worse is simply by having more > users. I agree. > But for hash partitioning I think the problems will become more > substantial. Different encodings, endian issues, etc. will be a > headache for someone, and potentially a bad day if they are urgently > trying to restore on a new machine. We should remember that not > everyone is a full-time postgres DBA, and users might reasonably think > that the default options to pg_dump[all] will give them a portable > dump. I agree to an extent. I think the problem will be worse for hash partitioning but I might disagree with you on how much worse. I think that most people don't do encoding conversions very often, and that those who do know (or should know) enough to expect trouble. I think most people do endian-ness conversions almost never, but since that's a matter of hardware not configuration I'd like to paper over that case if we can. > 2. I basically see two approaches to solve the problem: > (a) Tom suggested at PGCon that we could have a GUC that > automatically causes inserts to the partition to be re-routed through > the parent. We could discuss whether to always route through the > parent, or do a recheck on the partition constrains and only reroute > tuples that will fail it. If the user gets into trouble, the worst > that would happen is a helpful error message telling them to set the > GUC. I like this idea. Yeah, that's not crazy. I find it a bit surprising in terms of the semantics, though. SET when_i_try_to_insert_into_a_specific_partition_i_dont_really_mean_it = true? > (b) I had suggested before that we could make the default text dump > (and the default output from pg_restore, for consistency) route > through the parent. Advanced users would dump with -Fc, and pg_restore > would support an option to do partition-wise loading. To me, this is > simpler, but users might forget to use (or not know about) the > pg_restore option and then it would load more slowly. Also, the ship > is sailing on range partitioning, so we might prefer option (a) just > to avoid making any changes. I think this is a non-starter. The contents of the dump shouldn't depend on the format chosen; that is bound to confuse somebody. I also do not wish to inflict a speed penalty on the users of plain-format dumps. >> 2. Add an option like --dump-partition-data-with-parent. I'm not sure >> who originally proposed this, but it seems that everybody likes it. >> What we disagree about is the degree to which it's sufficient. Jeff >> Davis thinks it doesn't go far enough: what if you have an old >> plain-format dump that you don't want to hand-edit, and the source >> database is no longer available? Most people involved in the >> unconference discussion of partitioning at PGCon seemed to feel that >> wasn't really something we should be worry about too much. I had been >> taking that position also, more or less because I don't see that there >> are better alternatives. > > If the suggestions above are unacceptable, and we don't come up with > anything better, then of course we have to move on. I am worrying now > primarily because now is the best time to worry; I don't expect any > horrible outcome. OK. >> 3. Implement portable hash functions (Jeff Davis or me, not sure >> which). Andres scoffed at this idea, but I still think it might have >> legs. > > I think it reduces the problem, which has value, but it's hard to make > it rock-solid. I agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 06/02/2017 05:47 AM, Robert Haas wrote: > On Fri, Jun 2, 2017 at 1:24 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> 2. I basically see two approaches to solve the problem: >> (a) Tom suggested at PGCon that we could have a GUC that >> automatically causes inserts to the partition to be re-routed through >> the parent. We could discuss whether to always route through the >> parent, or do a recheck on the partition constrains and only reroute >> tuples that will fail it. If the user gets into trouble, the worst >> that would happen is a helpful error message telling them to set the >> GUC. I like this idea. > > Yeah, that's not crazy. I find it a bit surprising in terms of the > semantics, though. SET > when_i_try_to_insert_into_a_specific_partition_i_dont_really_mean_it = > true? Maybe SET partition_tuple_retry = true; -or- SET partition_tuple_reroute = true; ? I like the idea of only rerouting when failing constraints although I can envision where there might be use cases where you essentially want to re-partition and therefore reroute everything, leading to: SET partition_tuple_reroute = (none | error | all); Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
On Fri, Jun 2, 2017 at 10:19 AM, Joe Conway <mail@joeconway.com> wrote: >> Yeah, that's not crazy. I find it a bit surprising in terms of the >> semantics, though. SET >> when_i_try_to_insert_into_a_specific_partition_i_dont_really_mean_it = >> true? > > Maybe > SET partition_tuple_retry = true; > -or- > SET partition_tuple_reroute = true; > ? > > I like the idea of only rerouting when failing constraints although I > can envision where there might be use cases where you essentially want > to re-partition and therefore reroute everything, leading to: > > SET partition_tuple_reroute = (none | error | all); Personally, I think it's more elegant to make this a pg_dump option than to make it a server GUC, but I'm not going to spend time fighting the server GUC idea if other people like it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <andres@anarazel.de> wrote: > Just to clarify: I don't think it's a problem to do so for integers and > most other simple scalar types. There's plenty hash algorithms that are > endianess independent, and the rest is just a bit of care. Do you have any feeling for which of those endianness-independent hash functions might be a reasonable choice for us? https://github.com/markokr/pghashlib implements various hash functions for PostgreSQL, and claims that, of those implemented, crc32, Jenkins, lookup3be and lookup3le, md5, and siphash24 are endian-independent. An interesting point here is that Jeff Davis asserted in the original post on this thread that our existing hash_any() wasn't portable, but our current hash_any seems to be the Jenkins algorithm -- so I'm confused. Part of the problem seems to be that, according to https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of those. I don't know whether the one in pghashlib is the same one we've implemented. Kennel Marshall suggested xxhash as an endian-independent algorithm upthread. Code for that is available under a 2-clause BSD license. PostgreSQL page checksums use an algorithm based on, but not exactly, FNV-1a. See storage/checksum_impl.h. The comments there say this algorithm was chosen with speed in mind. Our version is not endian-independent because it folds in 4-byte integers rather than 1-byte integers, but plain old FNV-1a *is* endian-independent and could be used. We also have an implementation of CRC32C in core - see port/pg_crc32.h and port/pg_crc32c_sb8.c. It's not clear to me whether this is Endian-independent or not, although there is stuff that depends on WORDS_BIGENDIAN, so, uh, maybe? Some other possibly-interesting links: https://research.neustar.biz/2011/12/29/choosing-a-good-hash-function-part-2/ http://greenrobot.org/essentials/features/performant-hash-functions-for-java/comparison-of-hash-functions/ https://www.strchr.com/hash_functions Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017-08-03 17:09:41 -0400, Robert Haas wrote: > On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <andres@anarazel.de> wrote: > > Just to clarify: I don't think it's a problem to do so for integers and > > most other simple scalar types. There's plenty hash algorithms that are > > endianess independent, and the rest is just a bit of care. > > Do you have any feeling for which of those endianness-independent hash > functions might be a reasonable choice for us? Not a strong / very informed one, TBH. I'm not convinced it's worth trying to achieve this in the first place, now that we "nearly" have the insert-via-parent feature. Isn't this a lot of work, for very little practical gain? Having to select that when switching architectures still seems unproblematic. People will just about never switch endianess, so even a tiny performance & effort overhead doesn't seem worth it to me. Leaving that aside: > https://github.com/markokr/pghashlib implements various hash functions > for PostgreSQL, and claims that, of those implemented, crc32, Jenkins, > lookup3be and lookup3le, md5, and siphash24 are endian-independent. > An interesting point here is that Jeff Davis asserted in the original > post on this thread that our existing hash_any() wasn't portable, but > our current hash_any seems to be the Jenkins algorithm -- so I'm > confused. Part of the problem seems to be that, according to > https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of > those. I don't know whether the one in pghashlib is the same one > we've implemented. IIUC lookup3be/le from Marko's hashlib just has a endianess conversion added. I'd personally not go for jenkins3, it's not particularly fast, nor does it balance that out w/ being cryptographicaly secure. > Kennel Marshall suggested xxhash as an endian-independent algorithm > upthread. Code for that is available under a 2-clause BSD license. Yea, that'd have been one of my suggestions too. Especially as I still want to implement better compression using lz4, and that'll depend on xxhash in turn. > PostgreSQL page checksums use an algorithm based on, but not exactly, > FNV-1a. See storage/checksum_impl.h. The comments there say this > algorithm was chosen with speed in mind. Our version is not > endian-independent because it folds in 4-byte integers rather than > 1-byte integers, but plain old FNV-1a *is* endian-independent and > could be used. Non-SIMDed (which we hope to achieve with our implementation, which is why we have separate compiler flags for that file) implementations of FNV are, to my knowledge, not particularly fast. And the SIMD tricks are, to my knowledge, not really applicable to the case at hand here. So I'm not a fan of choosing FNV. > We also have an implementation of CRC32C in core - see port/pg_crc32.h > and port/pg_crc32c_sb8.c. It's not clear to me whether this is > Endian-independent or not, although there is stuff that depends on > WORDS_BIGENDIAN, so, uh, maybe? The results should be endian independent. It depends on WORDS_BIGENDIAN because we need different pre-computed tables depending on endianess. Greetings, Andres Freund
On Thu, Aug 3, 2017 at 5:32 PM, Andres Freund <andres@anarazel.de> wrote: >> Do you have any feeling for which of those endianness-independent hash >> functions might be a reasonable choice for us? > > Not a strong / very informed one, TBH. > > I'm not convinced it's worth trying to achieve this in the first place, > now that we "nearly" have the insert-via-parent feature. Isn't this a > lot of work, for very little practical gain? Having to select that when > switching architectures still seems unproblematic. People will just > about never switch endianess, so even a tiny performance & effort > overhead doesn't seem worth it to me. I kind of agree with you. There are some advantages of being endian-independent, like maybe your hash partitioning is really across multiple shards, and not all the shards are the same machine architecture, but it's not going to come up for most people. For me, the basic point here is that we need a set of hash functions for hash partitioning that are different than what we use for hash indexes and hash joins -- otherwise when we hash partition a table and create hash indexes on each partition, those indexes will have nasty clustering. Partitionwise hash joins will have similar problems. So, a new set of hash functions specifically for hash partitioning is quite desirable. Given that, if it's not a big problem to pick ones that have the portability properties at least some people want, I'd be inclined to do it. If it results in a noticeable slowdown on macrobenchmarks, then not so much, but otherwise, I'd rather do what people are asking for than spend time arguing about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017-08-03 17:43:44 -0400, Robert Haas wrote: > For me, the basic point here is that we need a set of hash functions > for hash partitioning that are different than what we use for hash > indexes and hash joins -- otherwise when we hash partition a table and > create hash indexes on each partition, those indexes will have nasty > clustering. Partitionwise hash joins will have similar problems. So, > a new set of hash functions specifically for hash partitioning is > quite desirable. Couldn't that just as well solved by being a bit smarter with an IV? I doubt we want to end up with different hashfunctions for sharding, partitioning, hashjoins (which seems to form a hierarchy). Having a working hash-combine function, or even better a hash API that can continue to use the hash's internal state, seems a more scalable solution. Greetings, Andres Freund
On Thu, Aug 3, 2017 at 5:50 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-08-03 17:43:44 -0400, Robert Haas wrote: >> For me, the basic point here is that we need a set of hash functions >> for hash partitioning that are different than what we use for hash >> indexes and hash joins -- otherwise when we hash partition a table and >> create hash indexes on each partition, those indexes will have nasty >> clustering. Partitionwise hash joins will have similar problems. So, >> a new set of hash functions specifically for hash partitioning is >> quite desirable. > > Couldn't that just as well solved by being a bit smarter with an IV? I > doubt we want to end up with different hashfunctions for sharding, > partitioning, hashjoins (which seems to form a hierarchy). Having a > working hash-combine function, or even better a hash API that can > continue to use the hash's internal state, seems a more scalable > solution. That's another way to go, but it requires inventing a way to thread the IV through the hash opclass interface. That's actually sort of a problem anyway. Maybe I ought to have started with the question of how we're going to make that end of things work. We could: - Invent a new hash_partition AM that doesn't really make indexes but supplies hash functions for hash partitioning. - Add a new, optional support function 2 to the hash AM that takes a value of the type *and* an IV as an argument. - Something else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-08-03 17:57:37 -0400, Robert Haas wrote: > On Thu, Aug 3, 2017 at 5:50 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2017-08-03 17:43:44 -0400, Robert Haas wrote: > >> For me, the basic point here is that we need a set of hash functions > >> for hash partitioning that are different than what we use for hash > >> indexes and hash joins -- otherwise when we hash partition a table and > >> create hash indexes on each partition, those indexes will have nasty > >> clustering. Partitionwise hash joins will have similar problems. So, > >> a new set of hash functions specifically for hash partitioning is > >> quite desirable. > > > > Couldn't that just as well solved by being a bit smarter with an IV? I > > doubt we want to end up with different hashfunctions for sharding, > > partitioning, hashjoins (which seems to form a hierarchy). Having a > > working hash-combine function, or even better a hash API that can > > continue to use the hash's internal state, seems a more scalable > > solution. > > That's another way to go, but it requires inventing a way to thread > the IV through the hash opclass interface. Only if we really want to do it really well :P. Using a hash_combine() like /** Combine two hash values, resulting in another hash value, with decent bit* mixing.** Similar to boost's hash_combine().*/ static inline uint32 hash_combine(uint32 a, uint32 b) {a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);return a; } between hash(IV) and the hashfunction should do the trick (the IV needs to hashed once, otherwise the bit mix is bad). > That's actually sort of a > problem anyway. Maybe I ought to have started with the question of > how we're going to make that end of things work. +1 one for that plan. > We could: > > - Invent a new hash_partition AM that doesn't really make indexes but > supplies hash functions for hash partitioning. > - Add a new, optional support function 2 to the hash AM that takes a > value of the type *and* an IV as an argument. > - Something else. Not arguing for it, but one option could also have pg_type.hash* function(s). One thing that I think might be advisable to think about is that we're atm stuck with a relatively bad hash function for hash indexes (and hash joins/aggs), and we should probably evolve it at some point. At the same time there's currently people out there relying on the current hash functions remaining stable. Greetings, Andres Freund
On Thu, Aug 3, 2017 at 6:08 PM, Andres Freund <andres@anarazel.de> wrote: >> That's another way to go, but it requires inventing a way to thread >> the IV through the hash opclass interface. > > Only if we really want to do it really well :P. Using a hash_combine() > like > > /* > * Combine two hash values, resulting in another hash value, with decent bit > * mixing. > * > * Similar to boost's hash_combine(). > */ > static inline uint32 > hash_combine(uint32 a, uint32 b) > { > a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2); > return a; > } > > between hash(IV) and the hashfunction should do the trick (the IV needs > to hashed once, otherwise the bit mix is bad). That seems pretty lame, although it's sufficient to solve the immediate problem, and I have to admit to a certain predilection for things that solve the immediate problem without creating lots of additional work. >> We could: >> >> - Invent a new hash_partition AM that doesn't really make indexes but >> supplies hash functions for hash partitioning. >> - Add a new, optional support function 2 to the hash AM that takes a >> value of the type *and* an IV as an argument. >> - Something else. > > Not arguing for it, but one option could also have pg_type.hash* > function(s). True. That is a bit less configurable because you can't then have multiple functions for the same type. Going through the opclass interface means you can have hash_portable_ops and hash_fast_ops and let people choose. But this would be easy to implement and enough for most people in practice. > One thing that I think might be advisable to think about is that we're > atm stuck with a relatively bad hash function for hash indexes (and hash > joins/aggs), and we should probably evolve it at some point. At the same > time there's currently people out there relying on the current hash > functions remaining stable. That to me is a bit of a separate problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Aug 3, 2017 at 6:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > That seems pretty lame, although it's sufficient to solve the > immediate problem, and I have to admit to a certain predilection for > things that solve the immediate problem without creating lots of > additional work. After some further thought, I propose the following approach to the issues raised on this thread: 1. Allow hash functions to have a second, optional support function, similar to what we did for btree opclasses in c6e3ac11b60ac4a8942ab964252d51c1c0bd8845. The second function will have a signature of (opclass_datatype, int64) and should return int64. The int64 argument is a salt. When the salt is 0, the low 32 bits of the return value should match what the existing hash support function returns. Otherwise, the salt should be used to perturb the hash calculation. This design kills two birds with one stone: it gives callers a way to get 64-bit hash values if they want them (which should make Tom happy, and we could later think about plugging it into hash indexes) and it gives us a way of turning a single hash function into many (which should allow us to prevent hash indexes or hash tables built on a hash-partitioned table from having a heavily lopsided distribution, and probably will also make people who are interested in topics like Bloom filters happy). 2. Introduce a new hash opfamilies here which are more faster, more portable, and/or better in other ways than the ones we have today. Given our current rather simplistic notion of a "default" opclass, there doesn't seem to be an easy to make whatever we introduce here the default for hash partitioning while keeping the existing default for other purposes. That should probably be fixed at some point. However, given the amount of debate this topic has generated, it also doesn't seem likely that we'd actually wish to decide on a different default in the v11 release cycle, so I don't think there's any rush to figure out exactly how we want to fix it. Focusing on introducing the new opfamilies at all is probably a better use of time, IMHO. Unless anybody strongly objects, I'm going to write a patch for #1 (or convince somebody else to do it) and leave #2 for someone else to tackle if they wish. In addition, I'll tackle (or convince someone else to tackle) the project of adding that second optional support function to every hash opclass in the core repository. Then Amul can update the core hash partitioning patch to use the new infrastructure when it's available and fall back to the existing method when it's not, and I think we'll be in reasonably good shape. Objections to this plan of attack? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > After some further thought, I propose the following approach to the > issues raised on this thread: > 1. Allow hash functions to have a second, optional support function, > similar to what we did for btree opclasses in > c6e3ac11b60ac4a8942ab964252d51c1c0bd8845. The second function will > have a signature of (opclass_datatype, int64) and should return int64. > The int64 argument is a salt. When the salt is 0, the low 32 bits of > the return value should match what the existing hash support function > returns. Otherwise, the salt should be used to perturb the hash > calculation. +1 > 2. Introduce a new hash opfamilies here which are more faster, more > portable, and/or better in other ways than the ones we have today. This part seems, uh, under-defined and/or over-ambitious and/or unrelated to the problem at hand. What are the concrete goals? regards, tom lane
On Wed, Aug 16, 2017 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> After some further thought, I propose the following approach to the >> issues raised on this thread: > >> 1. Allow hash functions to have a second, optional support function, >> similar to what we did for btree opclasses in >> c6e3ac11b60ac4a8942ab964252d51c1c0bd8845. The second function will >> have a signature of (opclass_datatype, int64) and should return int64. >> The int64 argument is a salt. When the salt is 0, the low 32 bits of >> the return value should match what the existing hash support function >> returns. Otherwise, the salt should be used to perturb the hash >> calculation. > > +1 Cool. >> 2. Introduce a new hash opfamilies here which are more faster, more >> portable, and/or better in other ways than the ones we have today. > > This part seems, uh, under-defined and/or over-ambitious and/or unrelated > to the problem at hand. What are the concrete goals? In my view, it's for the person who proposed a new opclass to say what goals they're trying to satisfy with that opclass. A variety of goals that could justify a new opclass have been proposed upthread -- especially in Jeff Davis's original post (q.v.). Such goals could include endian-ness independence, collation-independence, speed, better bit-mixing, and opfamilies that span more types than our currently ones. These goals are probably mutually exclusive, in the sense that endian-ness independence and collation-independence are probably pulling in the exact opposite direction as speed, so conceivably there could be multiple opclasses proposed with different trade-offs. I take no position for the present on which of those would be worth accepting into core. I agree with you that all of this is basically unrelated to the problem at hand, if "the problem at hand" means "hash partitioning". In my mind, there are two really serious issues that have been raised on that front. One is the problem of hash joins/aggregates/indexes on hash-partitioned tables coming out lopsided, and adding an optional salt will let us fix that problem. The other is that if you migrate your data to a different encoding or endianness, you might have dump/reload problems, but IMHO the already-committed patch for --load-via-partition-root is as much as really *needs* to be done there. I am less skeptical about the idea of endianness-independent hash functions than he is, but "we can't have $FEATURE_MANY_PEOPLE_WANT until we solve $PROBLEM_ANDRES_THINKS_IS_NOT_PRACTICALLY_SOLVABLE" is not a route to swift victory. In short, I'm proposing to add a method to seed the existing hash functions and get 64 bits out of them, and leaving any other potential improvements to someone who wants to argue for them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 16, 2017 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> After some further thought, I propose the following approach to the >> issues raised on this thread: > >> 1. Allow hash functions to have a second, optional support function, >> similar to what we did for btree opclasses in >> c6e3ac11b60ac4a8942ab964252d51c1c0bd8845. The second function will >> have a signature of (opclass_datatype, int64) and should return int64. >> The int64 argument is a salt. When the salt is 0, the low 32 bits of >> the return value should match what the existing hash support function >> returns. Otherwise, the salt should be used to perturb the hash >> calculation. > > +1 Attached is a quick sketch of how this could perhaps be done (ignoring for the moment the relatively-boring opclass pushups). It introduces a new function hash_any_extended which differs from hash_any() in that (a) it combines both b and c into the result and (b) it accepts a seed which is mixed into the initial state if it's non-zero. Comments? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Robert Haas <robertmhaas@gmail.com> writes: > Attached is a quick sketch of how this could perhaps be done (ignoring > for the moment the relatively-boring opclass pushups). It introduces > a new function hash_any_extended which differs from hash_any() in that > (a) it combines both b and c into the result and (b) it accepts a seed > which is mixed into the initial state if it's non-zero. > Comments? Hm. Despite the comment at lines 302-304, I'm not sure that we ought to do this simply by using "b" as the high order bits. AFAICS that exposes little or no additional randomness; in particular it seems unlikely to meet Jenkins' original design goal that "every 1-bit and 2-bit delta achieves avalanche". There might be some simple way to extend the existing code to produce a mostly-independent set of 32 more bits, but I wonder if we wouldn't be better advised to just keep Jenkins' code as-is and use some other method entirely for producing the 32 new result bits. ... In fact, on perusing the linked-to page http://burtleburtle.net/bob/hash/doobs.html Bob says specifically that taking b and c from this hash does not produce a fully random 64-bit result. He has a new one that does, lookup3.c, but probably he hasn't tried to make that bit-compatible with the 1997 algorithm. Your injection of the seed as prepended data seems unassailable from the randomness standpoint. But I wonder whether we could do it more cheaply by xoring the seed into the initial a/b/c values --- it's not very clear whether those are magic in any interesting way, or just some randomly chosen constants. Anyway, I'd certainly suggest that whoever embarks on this for real spend some time perusing Jenkins' website. regards, tom lane
On Wed, Aug 16, 2017 at 05:58:41PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Attached is a quick sketch of how this could perhaps be done (ignoring > > for the moment the relatively-boring opclass pushups). It introduces > > a new function hash_any_extended which differs from hash_any() in that > > (a) it combines both b and c into the result and (b) it accepts a seed > > which is mixed into the initial state if it's non-zero. > > > Comments? > > Hm. Despite the comment at lines 302-304, I'm not sure that we ought > to do this simply by using "b" as the high order bits. AFAICS that > exposes little or no additional randomness; in particular it seems > unlikely to meet Jenkins' original design goal that "every 1-bit and > 2-bit delta achieves avalanche". There might be some simple way to > extend the existing code to produce a mostly-independent set of 32 more > bits, but I wonder if we wouldn't be better advised to just keep Jenkins' > code as-is and use some other method entirely for producing the > 32 new result bits. > > ... In fact, on perusing the linked-to page > http://burtleburtle.net/bob/hash/doobs.html > Bob says specifically that taking b and c from this hash does not > produce a fully random 64-bit result. He has a new one that does, > lookup3.c, but probably he hasn't tried to make that bit-compatible > with the 1997 algorithm. > Hi, The updated hash functions that we currently use are based on Bob Jenkins lookup3.c and using b as the higher order bits is pretty darn good. I had lobbied to present the 64-bit b+c hash in the original work for similar reasons. We are definitely not using a lookup2.c version from 1997. Regards, Ken
Kenneth Marshall <ktm@rice.edu> writes: > On Wed, Aug 16, 2017 at 05:58:41PM -0400, Tom Lane wrote: >> ... In fact, on perusing the linked-to page >> http://burtleburtle.net/bob/hash/doobs.html >> Bob says specifically that taking b and c from this hash does not >> produce a fully random 64-bit result. He has a new one that does, >> lookup3.c, but probably he hasn't tried to make that bit-compatible >> with the 1997 algorithm. > The updated hash functions that we currently use are based on Bob Jenkins > lookup3.c and using b as the higher order bits is pretty darn good. I had > lobbied to present the 64-bit b+c hash in the original work for similar > reasons. We are definitely not using a lookup2.c version from 1997. Oh --- I overlooked the bit about "Bob's 2006 update". Really that comment block should have been completely rewritten, rather than leaving the original text there, especially since as it stands there are only pointers to the old algorithm. regards, tom lane
On Wed, Aug 16, 2017 at 5:34 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Attached is a quick sketch of how this could perhaps be done (ignoring > for the moment the relatively-boring opclass pushups). Here it is with some relatively-boring opclass pushups added. I just did the int4 bit; the same thing will need to be done, uh, 35 more times. But you get the gist. No, not that kind of gist. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Fri, Aug 18, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 16, 2017 at 5:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Attached is a quick sketch of how this could perhaps be done (ignoring
> for the moment the relatively-boring opclass pushups).
Here it is with some relatively-boring opclass pushups added. I just
did the int4 bit; the same thing will need to be done, uh, 35 more
times. But you get the gist. No, not that kind of gist.
I will work on this.
I have a small query, what if I want a cache entry with extended hash function instead standard one, I might require that while adding hash_array_extended function? Do you think we need to extend lookup_type_cache() as well?
Regards,
Amul
On Fri, Aug 18, 2017 at 1:12 PM, amul sul <sulamul@gmail.com> wrote: > I have a small query, what if I want a cache entry with extended hash > function instead standard one, I might require that while adding > hash_array_extended function? Do you think we need to extend > lookup_type_cache() as well? Hmm, I thought you had changed the hash partitioning stuff so that it didn't rely on lookup_type_cache(). You have to look up the function using the opclass provided in the partition key definition; lookup_type_cache() will give you the default one for the datatype. Maybe just use get_opfamily_proc? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 18, 2017 at 11:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 18, 2017 at 1:12 PM, amul sul <sulamul@gmail.com> wrote:
> I have a small query, what if I want a cache entry with extended hash
> function instead standard one, I might require that while adding
> hash_array_extended function? Do you think we need to extend
> lookup_type_cache() as well?
Hmm, I thought you had changed the hash partitioning stuff so that it
didn't rely on lookup_type_cache(). You have to look up the function
using the opclass provided in the partition key definition;
lookup_type_cache() will give you the default one for the datatype.
Maybe just use get_opfamily_proc?
Yes, we can do that for
the
partitioning code, but my concern is a little bitdifferent. I apologize, I wasn't clear enough.
I am trying to extend hash_array & hash_range function. The hash_array and
hash_range function calculates hash by using the respective hash function for
the given argument type (i.e. array/range element type), and those hash
functions are made available in the TypeCacheEntry via lookup_type_cache(). But
in the hash_array & hash_range extended version requires a respective extended
hash function for those element type.
I have added hash_array_extended & hash_range_extended function in the attached
patch 0001, which maintains a local copy of TypeCacheEntry with extended hash
functions. But I am a little bit skeptic about this logic, any
advice/suggestions will begreatly appreciated.
The logic in the rest of the extended hash functions is same as the standard
one.
Attaching patch 0002 for the reviewer's testing.
Regards,
Amul
Вложения
On Tue, Aug 22, 2017 at 5:44 PM, amul sul <sulamul@gmail.com> wrote:
On Fri, Aug 18, 2017 at 11:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:On Fri, Aug 18, 2017 at 1:12 PM, amul sul <sulamul@gmail.com> wrote:
> I have a small query, what if I want a cache entry with extended hash
> function instead standard one, I might require that while adding
> hash_array_extended function? Do you think we need to extend
> lookup_type_cache() as well?
Hmm, I thought you had changed the hash partitioning stuff so that it
didn't rely on lookup_type_cache(). You have to look up the function
using the opclass provided in the partition key definition;
lookup_type_cache() will give you the default one for the datatype.
Maybe just use get_opfamily_proc?Yes, we can do that forthe partitioning code, but my concern is a little bitdifferent. I apologize, I wasn't clear enough.I am trying to extend hash_array & hash_range function. The hash_array andhash_range function calculates hash by using the respective hash function forthe given argument type (i.e. array/range element type), and those hashfunctions are made available in the TypeCacheEntry via lookup_type_cache(). Butin the hash_array & hash_range extended version requires a respective extendedhash function for those element type.I have added hash_array_extended & hash_range_extended function in the attachedpatch 0001, which maintains a local copy of TypeCacheEntry with extended hashfunctions. But I am a little bit skeptic about this logic, any advice/suggestions will begreatly appreciated.
Instead, in the attached patch, I have modified lookup_type_cache() to
request it to get extended hash function in the TypeCacheEntry.
For that, I've introduced new flags as TYPECACHE_HASH_EXTENDED_PROC, TYPECACHE_HASH_EXTENDED_PROC_ FINFO & TCFLAGS_CHECKED_HASH_EXTENDED_ PROC, and additional variables in TypeCacheEntry structure to hold extended hash proc information.
The logic in the rest of the extended hash functions is same as the standardone.
Same for the hash_array_extended() & hash_range_extended() function as well.
Regards,
Amul
Вложения
On Tue, Aug 22, 2017 at 8:14 AM, amul sul <sulamul@gmail.com> wrote: > Attaching patch 0002 for the reviewer's testing. I think that this 0002 is not something we can think of committing because there's no guarantee that hash functions will return the same results on all platforms. However, what we could and, I think, should do is hash some representative values of each data type and verify that hash(x) and hashextended(x, 0) come out the same at least as to the low-order 32 bits -- and maybe that hashextend(x, 1) comes out different as to the low-order 32 bits. The easiest way to do this seems to be to cast to bit(32). For example: SELECT v, hashint4(v)::bit(32) as standard, hashint4extended(v, 0)::bit(32) as extended0, hashint4extended(v,1)::bit(32) as extended1 FROM (VALUES (0), (1), (17), (42), (550273), (207112489)) x(v) WHERE hashint4(v)::bit(32) != hashint4extended(v, 0)::bit(32) OR hashint4(v)::bit(32) = hashint4extended(v, 1)::bit(32); I suggest adding a version of this for each data type. From your latest version of 0001, I get: rangetypes.c:1297:8: error: unused variable 'rngtypid' [-Werror,-Wunused-variable] Oid rngtypid= RangeTypeGetOid(r); I suggest not duplicating the comments from each regular function into the extended function, but just writing /* Same approach as hashfloat8 */ when the implementation is non-trivial (no need for this if the extended function is a single line or the original function had no comments anyway). hash_aclitem seems embarrassingly stupid. I suggest that we make the extended version slightly less stupid -- i.e. if the seed is non-zero, actually call hash_any_extended on the sum and pass the seed through. * Reset info about hash function whenever we pick up new info about * equality operator. This is so wecan ensure that the hash function * matches the operator. */ typentry->flags &= ~(TCFLAGS_CHECKED_HASH_PROC); + typentry->flags &= ~(TCFLAGS_CHECKED_HASH_EXTENDED_PROC); Adjust comment: "about hash function" -> "about hash functions", "hash functions matches" -> "hash functions match". +extern Datum +hash_any_extended(register const unsigned char *k, register int + keylen, uint64 seed); Formatting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 29, 2017 at 11:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 22, 2017 at 8:14 AM, amul sul <sulamul@gmail.com> wrote:
> Attaching patch 0002 for the reviewer's testing.
I think that this 0002 is not something we can think of committing
because there's no guarantee that hash functions will return the same
results on all platforms. However, what we could and, I think, should
do is hash some representative values of each data type and verify
that hash(x) and hashextended(x, 0) come out the same at least as to
the low-order 32 bits -- and maybe that hashextend(x, 1) comes out
different as to the low-order 32 bits. The easiest way to do this
seems to be to cast to bit(32). For example:
SELECT v, hashint4(v)::bit(32) as standard,
hashint4extended(v, 0)::bit(32) as extended0,
hashint4extended(v, 1)::bit(32) as extended1
FROM (VALUES (0), (1), (17), (42), (550273), (207112489)) x(v)
WHERE hashint4(v)::bit(32) != hashint4extended(v, 0)::bit(32)
OR hashint4(v)::bit(32) = hashint4extended(v, 1)::bit(32);
I suggest adding a version of this for each data type.
Thanks for the suggestion, I have updated 0002-patch accordingly.
Using this I found some strange behaviours as follow:
1) standard and extended0 output for the jsonb_hash case is not same.
2) standard and extended0 output for the hash_range case is not same when input
is int4range(550274, 1550274) other case in the patch are fine. This can be
reproducible with other values as well, for e.g. int8range(1570275, 208112489).
Will look into this tomorrow.
Another case which I want to discuss is, extended and standard version of
hashfloat4, hashfloat8 & hash_numeric function will have the same output for zero
value irrespective of seed value. Do you think we need a fix for this?
From your latest version of 0001, I get:
rangetypes.c:1297:8: error: unused variable 'rngtypid'
[-Werror,-Wunused-variable]
Oid rngtypid = RangeTypeGetOid(r);
Fixed in the attached version.
I suggest not duplicating the comments from each regular function into
the extended function, but just writing /* Same approach as hashfloat8
*/ when the implementation is non-trivial (no need for this if the
extended function is a single line or the original function had no
comments anyway).
Fixed in the attached version.
hash_aclitem seems embarrassingly stupid. I suggest that we make the
extended version slightly less stupid -- i.e. if the seed is non-zero,
actually call hash_any_extended on the sum and pass the seed through.
* Reset info about hash function whenever we pick up new info about
* equality operator. This is so we can ensure that the hash function
* matches the operator.
*/
typentry->flags &= ~(TCFLAGS_CHECKED_HASH_PROC);
+ typentry->flags &= ~(TCFLAGS_CHECKED_HASH_EXTENDED_PROC);
Adjust comment: "about hash function" -> "about hash functions", "hash
functions matches" -> "hash functions match".
Fixed in the attached version.
+extern Datum
+hash_any_extended(register const unsigned char *k, register int
+ keylen, uint64 seed);
Formatting.
Fixed in the attached version.
Thanks !
Regards,
Amul
Вложения
On Wed, Aug 30, 2017 at 10:43 AM, amul sul <sulamul@gmail.com> wrote: > Thanks for the suggestion, I have updated 0002-patch accordingly. > Using this I found some strange behaviours as follow: > > 1) standard and extended0 output for the jsonb_hash case is not same. > 2) standard and extended0 output for the hash_range case is not same when > input > is int4range(550274, 1550274) other case in the patch are fine. This can > be > reproducible with other values as well, for e.g. int8range(1570275, > 208112489). > > Will look into this tomorrow. Those sound like bugs in your patch. Specifically: + /* Same approach as hash_range */ + result = hash_uint32_extended((uint32) flags, seed); + result ^= lower_hash; + result = (result << 1) | (result >> 63); + result ^= upper_hash; That doesn't give compatible results. The easiest thing to do might be to rotate the high 32 bits and the low 32 bits separately. JsonbHashScalarValueExtended has the same problem. Maybe create a helper function that does something like this (untested): ((x << 1) & UINT64COUNT(0xfffffffefffffffe)) | ((x >> 31) & UINT64CONST(0x100000001)) > Another case which I want to discuss is, extended and standard version of > hashfloat4, hashfloat8 & hash_numeric function will have the same output for > zero > value irrespective of seed value. Do you think we need a fix for this? Yes, I think you should return the seed rather than 0 in the cases where the current code hard-codes a 0 return. That will give the same behavior in the seed == 0 case while cheaply getting us something a bit different when there is a seed. BTW, you should probably invent and use a PG_RETURN_UINT64 macro in this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 30, 2017 at 9:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 30, 2017 at 10:43 AM, amul sul <sulamul@gmail.com> wrote:
> Thanks for the suggestion, I have updated 0002-patch accordingly.
> Using this I found some strange behaviours as follow:
>
> 1) standard and extended0 output for the jsonb_hash case is not same.
> 2) standard and extended0 output for the hash_range case is not same when
> input
> is int4range(550274, 1550274) other case in the patch are fine. This can
> be
> reproducible with other values as well, for e.g. int8range(1570275,
> 208112489).
>
> Will look into this tomorrow.
Those sound like bugs in your patch. Specifically:
+ /* Same approach as hash_range */
+ result = hash_uint32_extended((uint32) flags, seed);
+ result ^= lower_hash;
+ result = (result << 1) | (result >> 63);
+ result ^= upper_hash;
Yes, you are correct.
That doesn't give compatible results. The easiest thing to do might
be to rotate the high 32 bits and the low 32 bits separately.
JsonbHashScalarValueExtended has the same problem. Maybe create a
helper function that does something like this (untested):
((x << 1) & UINT64COUNT(0xfffffffefffffffe)) | ((x >> 31) &
UINT64CONST(0x100000001))
This working as expected, also tested by executing the following SQL multiple times:
SELECT v as value, hash_range(v)::bit(32) as standard,
hash_range_extended(v, 0)::bit(32) as extended0,
hash_range_extended(v, 1)::bit(32) as extended1
FROM (VALUES (int8range(floor(random() * 100)::int8, floor(random() * 1000)::int8)),
(int8range(floor(random() * 1000)::int8, floor(random() * 10000)::int8)),
(int8range(floor(random() * 10000)::int8, floor(random() * 100000)::int8)),
(int8range(floor(random() * 10000000)::int8, floor(random() * 100000000)::int8)),
(int8range(floor(random() * 100000000)::int8, floor(random() * 1000000000)::int8))) x(v)
WHERE hash_range(v)::bit(32) != hash_range_extended(v, 0)::bit(32)
OR hash_range(v)::bit(32) = hash_range_extended(v, 1)::bit(32);
> Another case which I want to discuss is, extended and standard version of
> hashfloat4, hashfloat8 & hash_numeric function will have the same output for
> zero
> value irrespective of seed value. Do you think we need a fix for this?
Yes, I think you should return the seed rather than 0 in the cases
where the current code hard-codes a 0 return. That will give the same
behavior in the seed == 0 case while cheaply getting us something a
bit different when there is a seed.
Fixed in the attached version.
BTW, you should probably invent and use a PG_RETURN_UINT64 macro in this patch.
Added i
n the attached version.
Thanks for your all the suggestions.
Regards,
Amul
Вложения
On Thu, Aug 31, 2017 at 8:40 AM, amul sul <sulamul@gmail.com> wrote: > Fixed in the attached version. I fixed these up a bit and committed them. Thanks. I think this takes care of adding not only the infrastructure but support for all the core data types, but I'm not quite sure how to handle upgrading types in contrib. It looks like citext, hstore, and several data types provided by isn have hash opclasses, and I think that there's no syntax for adding a support function to an existing opclass. We could add that, but I'm not sure how safe it would be. TBH, I really don't care much about fixing isn, but it seems like fixing citext and hstore would be worthwhile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think this takes care of adding not only the infrastructure but > support for all the core data types, but I'm not quite sure how to > handle upgrading types in contrib. It looks like citext, hstore, and > several data types provided by isn have hash opclasses, and I think > that there's no syntax for adding a support function to an existing > opclass. We could add that, but I'm not sure how safe it would be. ALTER OPERATOR FAMILY ADD FUNCTION ... ? That would result in the functions being considered "loose" in the family rather than bound into an operator class. I think that's actually the right thing, because they shouldn't be considered to be required. regards, tom lane
On Thu, Aug 31, 2017 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think this takes care of adding not only the infrastructure but >> support for all the core data types, but I'm not quite sure how to >> handle upgrading types in contrib. It looks like citext, hstore, and >> several data types provided by isn have hash opclasses, and I think >> that there's no syntax for adding a support function to an existing >> opclass. We could add that, but I'm not sure how safe it would be. > > ALTER OPERATOR FAMILY ADD FUNCTION ... ? > > That would result in the functions being considered "loose" in the > family rather than bound into an operator class. I think that's > actually the right thing, because they shouldn't be considered > to be required. But wouldn't that result in a different effect than the core data type changes I just did? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Aug 31, 2017 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ALTER OPERATOR FAMILY ADD FUNCTION ... ? >> >> That would result in the functions being considered "loose" in the >> family rather than bound into an operator class. I think that's >> actually the right thing, because they shouldn't be considered >> to be required. > But wouldn't that result in a different effect than the core data type > changes I just did? Possibly --- I have not read that patch. But considering that all core functions are pinned anyway, it doesn't seem like it much matters whether we consider them to be loosely or tightly bound to opclasses. That should only matter if one tries to drop the function. regards, tom lane
On Fri, Sep 1, 2017 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Aug 31, 2017 at 8:40 AM, amul sul <sulamul@gmail.com> wrote:
> Fixed in the attached version.
I fixed these up a bit and committed them. Thanks.
I think this takes care of adding not only the infrastructure but
support for all the core data types, but I'm not quite sure how to
handle upgrading types in contrib. It looks like citext, hstore, and
several data types provided by isn have hash opclasses, and I think
that there's no syntax for adding a support function to an existing
opclass. We could add that, but I'm not sure how safe it would be.
TBH, I really don't care much about fixing isn, but it seems like
fixing citext and hstore would be worthwhile.
Attached patch proposes the fix for the citext and hstore contrib.
To make it easy to understand I've split these patch in two part. 0001 adds
a new file for the contrib upgrade & renames an existing file to the higher
version, and 0002 is the actual implementation of extended hash function for
that contrib's data type.
Regards,
Amul