Обсуждение: Improving on MAX_CONVERSION_GROWTH
Thinking about the nearby thread[1] about overrunning MaxAllocSize during encoding conversion, it struck me that another thing we could usefully do to improve that situation is to be smarter about what's the growth factor --- the existing one-size-fits-all choice of MAX_CONVERSION_GROWTH = 4 is leaving a lot on the table. In particular, it seems like we could usefully frame things as having a constant max growth factor associated with each target encoding, stored as a new field in pg_wchar_table[]. By definition, the max growth factor cannot be more than the maximum character length in the target encoding. So this approach immediately gives us a growth factor of 1 with any single-byte output encoding, and even many of the multibyte encodings would have max growth 2 or 3 without having to think any harder than that. But we can do better, I think, recognizing that all the supported encodings are ASCII extensions. The only possible way to expend 4 output bytes per input byte is if there is some 1-byte character that translates to a 4-byte character, and I think this is not the case for converting any of our encodings to UTF8. If you need at least a 2-byte character to produce a 3-byte or 4-byte UTF8 character, then UTF8 has max growth 2. I'm not quite sure if that's true for every source encoding, but I'm pretty certain it couldn't be worse than 3. It might be worth getting a bit more complex and having a 2-D array indexed by both source and destination encodings to determine the max growth factor. I haven't run tests to empirically verify what is the max growth factor. A fly in this ointment is: could a custom encoding conversion function violate our conclusions about what's the max growth factor? Maybe it would be worth treating the growth factor as a property of a particular conversion (i.e., add a column to pg_conversion) rather than making it a hard-wired property. In any case, it seems likely that we could end up with a multiplier of 1, 2, or 3 rather than 4 in just about every case of practical interest. That sure seems like a win when converting long strings. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/flat/20190816181418.GA898@alvherre.pgsql
On Tue, Sep 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > In any case, it seems likely that we could end up with a > multiplier of 1, 2, or 3 rather than 4 in just about every > case of practical interest. That sure seems like a win > when converting long strings. +1. From what I've seen, I'd say this is a significant practical problem for people who are trying to store large blobs of data in the database. A lot of that is because they hit the 1GB allocation limit, and I wonder whether we shouldn't be trying harder to avoid imposing that limit in multiple places. It's reasonable - and necessary - to impose a limit on the size of an individual datum, but when that same limit is imposed on other things, like the worst-case size of the encoding conversion, the size of an individual message sent via the wire protocol, etc., you end up with a situation where users have trouble predicting what the behavior is going to be. >=1GB definitely won't work, but it'll probably break at some point before you even get that far depending on a bunch of complex factors that are hard to understand, not really documented, and mostly the result of applying 1GB limit to every single memory allocation across the whole backend without really thinking about what that does to the user-visible behavior. Now, that's not to say we should abandon MaxAllocSize, which I agree serves as a useful backstop. But IMHO it would be smart to start with the desired user-facing behavior -- we want to support datums up to X size -- and then consider how we can get there while maintaining MaxAllocSize as a general-purpose backstop. Our current strategy seems to be mostly the reverse: write the code the way that feels natural, enforce MaxAllocSize everywhere, and if that breaks things for a user, well then that means - by definition - that the user was trying to do something we don't support. One approach I think we should consider is, for larger strings, actually scan the string and figure out how much memory we're going to need for the conversion and then allocate exactly that amount (and fail if it's >=1GB). An extra scan over the string is somewhat costly, but allocating hundreds of megabytes of memory on the theory that we could hypothetically have needed it is costly in different way. Memory is more abundant today than it's ever been, but there are still plenty of systems where a couple of extra allocations in the multi-hundred-MB range can make the whole thing fall over. And even if it doesn't make the whole thing fall over, the CPU efficiency of avoiding an extra pass over the string really ought to be compared with the memory efficiency of allocating extra storage. Getting down from a worst-case multiple of 4 to 2 is a great idea, but it still means that converting a 100MB string will allocate 200MB when what you need will very often be between 100MB and 105MB. That's not an insignificant cost, even though it's much better than allocating 400MB. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-09-27 10:53:48 -0400, Robert Haas wrote: > A lot of that is because they hit the 1GB allocation limit, and I > wonder whether we shouldn't be trying harder to avoid imposing that > limit in multiple places. > It's reasonable - and necessary - to impose > a limit on the size of an individual datum, but when that same limit > is imposed on other things, like the worst-case size of the encoding > conversion, the size of an individual message sent via the wire > protocol, etc., you end up with a situation where users have trouble > predicting what the behavior is going to be. >=1GB definitely won't > work, but it'll probably break at some point before you even get that > far depending on a bunch of complex factors that are hard to > understand, not really documented, and mostly the result of applying > 1GB limit to every single memory allocation across the whole backend > without really thinking about what that does to the user-visible > behavior. +1 - that will be a long, piecemeal, project I think... But deciding that we should do so is a good first step. Note that one of the additional reasons for the 1GB limit is that it protects against int overflows. I'm somewhat unconvinced that that's a sensible approach, but ... I wonder if we shouldn't make stringinfos use size_t lengths, btw. Only supporting INT32_MAX (not even UINT32_MAX) seems weird these days. But we'd presumably have to make it opt-in. > One approach I think we should consider is, for larger strings, > actually scan the string and figure out how much memory we're going to > need for the conversion and then allocate exactly that amount (and > fail if it's >=1GB). An extra scan over the string is somewhat costly, > but allocating hundreds of megabytes of memory on the theory that we > could hypothetically have needed it is costly in different way. My proposal for this is something like https://www.postgresql.org/message-id/20190924214204.mav4md77xg5u5wq6%40alap3.anarazel.de which should avoid the overallocation without a second pass, and hopefully without loosing much efficiency. It's worthwhile to note that additional passes over data are often quite expensive, memory latency hasn't shrunk that much in last decade or so. I have frequently seen all the memcpys from one StringInfo/char* into another StringInfo show up in profiles. Greetings, Andres Freund
On Fri, Sep 27, 2019 at 11:40 AM Andres Freund <andres@anarazel.de> wrote: > Note that one of the additional reasons for the 1GB limit is that it > protects against int overflows. I'm somewhat unconvinced that that's a > sensible approach, but ... It's not crazy. People using 'int' rather casually just as they use 'palloc' rather casually, without necessarily thinking about what could go wrong at the edges. I don't have any beef with that as a general strategy; I just think we should be trying to do better in the cases where it negatively affects the user experience. > It's worthwhile to note that additional passes over data are often quite > expensive, memory latency hasn't shrunk that much in last decade or > so. I have frequently seen all the memcpys from one StringInfo/char* > into another StringInfo show up in profiles. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Sep 27, 2019 at 11:40 AM Andres Freund <andres@anarazel.de> wrote: >> Note that one of the additional reasons for the 1GB limit is that it >> protects against int overflows. I'm somewhat unconvinced that that's a >> sensible approach, but ... > It's not crazy. People using 'int' rather casually just as they use > 'palloc' rather casually, without necessarily thinking about what > could go wrong at the edges. I don't have any beef with that as a > general strategy; I just think we should be trying to do better in the > cases where it negatively affects the user experience. A small problem with doing anything very interesting here is that the int-is-enough-for-a-string-length approach is baked into the wire protocol (read the DataRow message format spec and weep). We could probably bend the COPY protocol enough to support multi-gig row values --- dropping the rule that the backend doesn't split rows across CopyData messages wouldn't break too many clients, hopefully. That would at least dodge some problems in dump/restore scenarios. In the meantime, I still think we should commit what I proposed in the other thread (<974.1569356381@sss.pgh.pa.us>), or something close to it. Andres' proposal would perhaps be an improvement on that, but I don't think it'll be ready anytime soon; and for sure we wouldn't risk back-patching it, while I think we could back-patch what I suggested. In any case, that patch is small enough that dropping it would be no big loss if a better solution comes along. Also, as far as the immediate subject of this thread is concerned, I'm inclined to get rid of MAX_CONVERSION_GROWTH in favor of using the target encoding's max char length. The one use (in printtup.c) where we don't know the target encoding could use MAX_MULTIBYTE_CHAR_LEN instead. Being smarter than that could help in some cases (mostly, conversion of ISO encodings to UTF8), but it's not that big a win. (I did some checks and found that some ISO encodings could provide a max growth of 2x, but many are max 3x, so 4x isn't that far out of line.) If Andres' ideas don't pan out we could come back and work harder on this, but for now something simple and back-patchable seems like a useful stopgap improvement. regards, tom lane
I wrote: > In the meantime, I still think we should commit what I proposed in the > other thread (<974.1569356381@sss.pgh.pa.us>), or something close to it. > Andres' proposal would perhaps be an improvement on that, but I don't > think it'll be ready anytime soon; and for sure we wouldn't risk > back-patching it, while I think we could back-patch what I suggested. > In any case, that patch is small enough that dropping it would be no big > loss if a better solution comes along. Not having heard any objections, I'll proceed with that. Andres is welcome to work on replacing it with his more-complicated idea... > Also, as far as the immediate subject of this thread is concerned, > I'm inclined to get rid of MAX_CONVERSION_GROWTH in favor of using > the target encoding's max char length. I realized after re-reading the comment for MAX_CONVERSION_GROWTH that this thread is based on a false premise, namely that encoding conversions always produce one "character" out per "character" in. In the presence of combining characters and suchlike, that premise fails, and it becomes quite unclear just what the max growth ratio actually is. So I'm going to leave that alone for now. Maybe this point is an argument for pushing forward with Andres' approach, but I'm still dubious about the overall cost/benefit ratio of that concept. regards, tom lane
Hi, On 2019-10-03 12:12:40 -0400, Tom Lane wrote: > I wrote: > > In the meantime, I still think we should commit what I proposed in the > > other thread (<974.1569356381@sss.pgh.pa.us>), or something close to it. > > Andres' proposal would perhaps be an improvement on that, but I don't > > think it'll be ready anytime soon; and for sure we wouldn't risk > > back-patching it, while I think we could back-patch what I suggested. > > In any case, that patch is small enough that dropping it would be no big > > loss if a better solution comes along. > > Not having heard any objections, I'll proceed with that. Andres is > welcome to work on replacing it with his more-complicated idea... Yea, what I'm proposing is clearly not backpatchable. So +1 > Maybe this point is an argument for pushing forward with Andres' > approach, but I'm still dubious about the overall cost/benefit ratio > of that concept. I think if it were just for MAX_CONVERSION_GROWTH, I'd be inclined to agree. But I think it has other advantages, so I'm mildy positivie that it'll be an overall win... Greetings, Andres Freund