Обсуждение: refactoring - share str2*int64 functions
Hello Tom, >> Yep, but ISTM that it is down to 32 bits, > > Only on 32-bit-long machines, which are a dwindling minority (except > for Windows, which I don't really care about). > >> So the third short is now always 0. Hmmm. I'll propose another option over >> the week-end. > > I suppose we could put pg_strtouint64 somewhere where pgbench can use it, > but TBH I don't think it's worth the trouble. The set of people using > the --random-seed=int option at all is darn near empty, I suspect, > and the documentation only says you can write an int there. Although I agree it is not worth a lot of trouble, and even if I don't do Windows, I think it valuable that the behavior is the same on all platform. The attached match shares pg_str2*int64 functions between frontend and backend by moving them to "common/", which avoids some code duplication. This is more refactoring, and it fixes the behavior change on 32 bit architectures. -- Fabien.
As usual, it is better with the attachement. Sorry for the noise. >>> Yep, but ISTM that it is down to 32 bits, >> >> Only on 32-bit-long machines, which are a dwindling minority (except >> for Windows, which I don't really care about). >> >>> So the third short is now always 0. Hmmm. I'll propose another option over >>> the week-end. >> >> I suppose we could put pg_strtouint64 somewhere where pgbench can use it, >> but TBH I don't think it's worth the trouble. The set of people using >> the --random-seed=int option at all is darn near empty, I suspect, >> and the documentation only says you can write an int there. > > Although I agree it is not worth a lot of trouble, and even if I don't do > Windows, I think it valuable that the behavior is the same on all platform. > The attached match shares pg_str2*int64 functions between frontend and > backend by moving them to "common/", which avoids some code duplication. > > This is more refactoring, and it fixes the behavior change on 32 bit > architectures. > > -- Fabien.
Вложения
>> Although I agree it is not worth a lot of trouble, and even if I don't do >> Windows, I think it valuable that the behavior is the same on all platform. >> The attached match shares pg_str2*int64 functions between frontend and >> backend by moving them to "common/", which avoids some code duplication. >> >> This is more refactoring, and it fixes the behavior change on 32 bit >> architectures. V2 is a rebase. -- Fabien.
Вложения
On Fri, May 24, 2019 at 3:23 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > >> Although I agree it is not worth a lot of trouble, and even if I don't do > >> Windows, I think it valuable that the behavior is the same on all platform. > >> The attached match shares pg_str2*int64 functions between frontend and > >> backend by moving them to "common/", which avoids some code duplication. > >> > >> This is more refactoring, and it fixes the behavior change on 32 bit > >> architectures. > > V2 is a rebase. Hi Fabien, Here's some semi-automated feedback, noted while going through failures on cfbot.cputube.org. You have a stray editor file src/backend/parser/parse_node.c.~1~. Something is failing to compile while doing the temp-install in make check-world, which probably indicates that some test or contrib module is using the interface you changed? -- Thomas Munro https://enterprisedb.com
On Mon, Jul 8, 2019 at 3:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's some semi-automated feedback, noted while going through
> failures on cfbot.cputube.org.  You have a stray editor file
> src/backend/parser/parse_node.c.~1~.  Something is failing to compile
> while doing the temp-install in make check-world, which probably
> indicates that some test or contrib module is using the interface you
> changed?
Please disregard the the comment about the ".~1~" file, my mistake.
As for the check-world failure, it's here:
pg_stat_statements.c:1024:11: error: implicit declaration of function
'pg_strtouint64' is invalid in C99
[-Werror,-Wimplicit-function-declaration]
                        rows = pg_strtouint64(completionTag + 5, NULL, 10);
                               ^
Apparently it needs to include common/string.h.
-- 
Thomas Munro
https://enterprisedb.com
			
		Hello Thomas, > pg_stat_statements.c:1024:11: error: implicit declaration of function > 'pg_strtouint64' is invalid in C99 > [-Werror,-Wimplicit-function-declaration] > rows = pg_strtouint64(completionTag + 5, NULL, 10); > ^ > Apparently it needs to include common/string.h. Yep, I gathered that as well, but did not act promptly on your help. Thanks for it! Here is the updated patch on which I checked "make check-world". -- Fabien.
Вложения
On Sun, Jul 14, 2019 at 3:22 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > Here is the updated patch on which I checked "make check-world". Thanks! So, we're moving pg_strtouint64() to a place where frontend code can use it, and getting rid of some duplication. I like it. I wanted this once before myself[1]. +extern bool pg_strtoint64(const char *str, bool errorOK, int64 *result); +extern uint64 pg_strtouint64(const char *str, char **endptr, int base); One of these things is not like the other. Let's see... the int64 version is used only by pgbench and is being promoted to common where it can be used by more code. With a name like that, wouldn't it make sense to bring it into line with the uint64 interface, and then move pgbench's error reporting stuff back into pgbench? The uint64 one derives its shape from the family of standard functions like strtol() so I think it wins. [1] https://www.postgresql.org/message-id/CAEepm=2KeC8xDbEWgDTDObXGqPHFW4kcD7BZXR6NMfiHjjnKhQ@mail.gmail.com -- Thomas Munro https://enterprisedb.com
Hello Thomas, > +extern bool pg_strtoint64(const char *str, bool errorOK, int64 *result); > +extern uint64 pg_strtouint64(const char *str, char **endptr, int base); > > One of these things is not like the other. Indeed. I agree that it is unfortunate, and it was bothering me a little as well. > Let's see... the int64 version is used only by pgbench and is being > promoted to common where it can be used by more code. No and yes. The pgbench code was a copy of server-side internal "scanint8", so it is used both by pgbench and the server-side handling of "int8", it is used significantly, taking advantage of its versatile error reporting feature on both sides. > With a name like that, wouldn't it make sense to bring it into line with > the uint64 interface, and then move pgbench's error reporting stuff back > into pgbench? That would need moving the server-side error handling as well, which I would not really be happy with. > The uint64 one derives its shape from the family of standard functions > like strtol() so I think it wins. Yep, it cannot be changed either. I do not think that changing the error handling capability is appropriate, it is really a feature of the function. The function could try to use an internal pg_strtoint64 which would look like the other unsigned version, but then it would not differentiate the various error conditions (out of range vs syntax error). The compromise I can offer is to change the name of the first one, say to "pg_scanint8" to reflect its former backend name. Attached a v4 which does a renaming so as to avoid the name similarity but signature difference. I also made both error messages identical. -- Fabien.
Вложения
On Mon, Jul 15, 2019 at 5:08 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > The compromise I can offer is to change the name of the first one, say to > "pg_scanint8" to reflect its former backend name. Attached a v4 which does > a renaming so as to avoid the name similarity but signature difference. I > also made both error messages identical. Cool. I'm not exactly sure when we should include 'pg_' in identifier names. It seems to be used for functions/macros that wrap or replace something else with a similar name, like pg_pwrite(), pg_attribute_noreturn(), ... In this case it's just our own code that we're moving, so I'm wondering if we should just call it scanint8(). If you agree, I think this is ready to commit. -- Thomas Munro https://enterprisedb.com
On 2019-Jul-16, Thomas Munro wrote: > On Mon, Jul 15, 2019 at 5:08 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > The compromise I can offer is to change the name of the first one, say to > > "pg_scanint8" to reflect its former backend name. Attached a v4 which does > > a renaming so as to avoid the name similarity but signature difference. I > > also made both error messages identical. > > Cool. I'm not exactly sure when we should include 'pg_' in identifier > names. It seems to be used for functions/macros that wrap or replace > something else with a similar name, like pg_pwrite(), > pg_attribute_noreturn(), ... In this case it's just our own code that > we're moving, so I'm wondering if we should just call it scanint8(). Isn't it annoying that pg_strtouint64() has an implementation that suggests that it ought to be in src/port? The fact that the signatures are so different suggests to me that we should indeed put them separate. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jul 16, 2019 at 11:16:31AM +1200, Thomas Munro wrote:
> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
> names.  It seems to be used for functions/macros that wrap or replace
> something else with a similar name, like pg_pwrite(),
> pg_attribute_noreturn(), ...  In this case it's just our own code that
> we're moving, so I'm wondering if we should just call it scanint8().
FWIW, I was looking forward to putting my hands on this patch and try
to get it merged so as we can get rid of those duplications.  Here are
some comments.
+#ifdef FRONTEND
+       fprintf(stderr,
+               "invalid input syntax for type %s: \"%s\"\n",
"bigint", str);
+#else
+       ereport(ERROR,
+               (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+                errmsg("invalid input syntax for type %s: \"%s\"",
+                       "bigint", str)));
+#endif
Have you looked at using the wrapper pg_log_error() here?
+extern bool pg_scanint8(const char *str, bool errorOK, int64
*result);
+extern uint64 pg_strtouint64(const char *str, char **endptr, int
base);
Hmm.  With this patch we have strtoint and pg_strtouint64, which makes
the whole set inconsistent.
+
 #endif                         /* COMMON_STRING_H */
Noise diff..
numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
become rather inconsistent with inconsistent APIs for the manipulation
of int2 and int4 fields, and scanint8 is just a derivative of the same
logic.  We have two categories of routines here:
- The wrappers on top of strtol and strtoul & co, which are named
respectively strtoint and pg_strtouint64 with the patch.  The naming
part is inconsistent, and we only handle uint64 and int32.  We don't
need to worry about int64 and uint32 because they are not used?
That's fine by me, but at least let's have a consistent naming.
Prefixing the functions with pg_* is a better practice in my opinion
as we will unlikely run into conflicts this way.
- The str->integer conversion routines, which actually have very
similar characteristics to the strtol families as they remove trailing
whitespaces first, check for a sign, etc, except that they work only
on base 10.  And here we get into a state where pg_scanint8 should be
actually called pg_strtoint64, with an interface inconsistent with its
int32/int16 relatives now only in the backend.  Could we consider more
consolidation here please?  Like moving the whole set to src/common/?
> If you agree, I think this is ready to commit.
Thomas, are you planning to look at this patch as committer?  I had it
in my agenda, and was planning to look at it sooner than later.  Now
if you are on it, I won't step on your toes.
--
Michael
			
		Вложения
On Tue, Jul 16, 2019 at 7:11 PM Michael Paquier <michael@paquier.xyz> wrote: > Thomas, are you planning to look at this patch as committer? I had it > in my agenda, and was planning to look at it sooner than later. Now > if you are on it, I won't step on your toes. Hi Michael, please go ahead, it's all yours. -- Thomas Munro https://enterprisedb.com
Hello Thomas, > On Mon, Jul 15, 2019 at 5:08 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: >> The compromise I can offer is to change the name of the first one, say to >> "pg_scanint8" to reflect its former backend name. Attached a v4 which does >> a renaming so as to avoid the name similarity but signature difference. I >> also made both error messages identical. > > Cool. I'm not exactly sure when we should include 'pg_' in identifier > names. It seems to be used for functions/macros that wrap or replace > something else with a similar name, like pg_pwrite(), > pg_attribute_noreturn(), ... In this case it's just our own code that > we're moving, so I'm wondering if we should just call it scanint8(). I added the pg_ prefix as a poor man's namespace because the function can be used by external tools (eg contribs), so as to avoid potential name conflicts. I agree that such conflicts are less probable if the name does not replace something existing. > If you agree, I think this is ready to commit. It can be removed, or not. So you do as you feel. -- Fabien.
On Jul 16, 2019, at 3:30 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: >> Cool. I'm not exactly sure when we should include 'pg_' in identifier >> names. It seems to be used for functions/macros that wrap or replace >> something else with a similar name, like pg_pwrite(), >> pg_attribute_noreturn(), ... In this case it's just our own code that >> we're moving, so I'm wondering if we should just call it scanint8(). > > I added the pg_ prefix as a poor man's namespace because the function can be used by external tools (eg contribs), so asto avoid potential name conflicts. Yeah, I think if we are going to expose it to front end code there is a good argument for some kind of prefix that makesit sound PostgreSQL-related. ...Robert
Robert Haas <robertmhaas@gmail.com> writes:
> On Jul 16, 2019, at 3:30 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>>> Cool.  I'm not exactly sure when we should include 'pg_' in identifier
>>> names.
>> I added the pg_ prefix as a poor man's namespace because the function can be used by external tools (eg contribs),
soas to avoid potential name conflicts. 
> Yeah, I think if we are going to expose it to front end code there is a good argument for some kind of prefix that
makesit sound PostgreSQL-related. 
Yeah, I'd tend to err in favor of including "pg_".  We might get away
without that as long as the name is never exposed to non-PG code, but
for stuff that's going into src/common/ or src/port/ I think that's
a risky assumption to make.
I'm also in agreement with Michael's comments in
<20190716071144.GF1439@paquier.xyz> that this would be a good time
to bring some consistency to the naming of related functions.
            regards, tom lane
			
		Hi,
On 2019-07-15 07:08:42 +0200, Fabien COELHO wrote:
> I do not think that changing the error handling capability is appropriate,
> it is really a feature of the function. The function could try to use an
> internal pg_strtoint64 which would look like the other unsigned version, but
> then it would not differentiate the various error conditions (out of range
> vs syntax error).
> The compromise I can offer is to change the name of the first one, say to
> "pg_scanint8" to reflect its former backend name. Attached a v4 which does a
> renaming so as to avoid the name similarity but signature difference. I also
> made both error messages identical.
I think the interface of that function is not that good, and the "scan"
in the name isn't great for discoverability (for one it's a different
naming than pg_strtoint16 etc), and the *8 meaning 64bit is confusing
enough in the backend, we definitely shouldn't extend that to frontend
code.
Referencing "bigint" and "input syntax" from frontend code imo doesn't
make a lot of sense. And int8in is the only caller that uses
errorOK=False anyway, so there's currently no need for the frontend
error strings afaict.
ISTM that something like
typedef enum StrToIntConversion
{
        STRTOINT_OK = 0,
        STRTOINT_SYNTAX_ERROR = 1,
        STRTOINT_OUT_OF_RANGE = 2
} StrToIntConversion;
StrToIntConversion pg_strtoint64(const char *str, int64 *result);
would make more sense.
There is the issue that there already is pg_strtoint16 and
pg_strtoint32, which do not have the option to not raise an error.  I'd
probably name the non-error throwing ones something like pg_strtointNN_e
(for extended, or error handling), and have pg_strtointNN wrappers that
just handle the errors, or reverse the naming (which'd cause a bit of
churn, but not that much).
That'd also make the code for pg_strtointNN a bit nicer, because we'd
not need the gotos anymore, they're just there to avoid redundant error
messages - which'd not be an issue if the error handling were just a
switch in a separate function. E.g.
int32
pg_strtoint32(const char *s)
{
    int32 result;
    switch (pg_strtoint32_e(s, &result))
    {
            case STRTOINT_OK:
                 return result;
            case STRTOINT_SYNTAX_ERROR:
                ereport(ERROR,
                        (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                         errmsg("value \"%s\" is out of range for type %s",
                                s, "integer")));
            case STRTOINT_OUT_OF_RANGE:
                ereport(ERROR,
                        (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
                         errmsg("invalid input syntax for type %s: \"%s\"",
                                "integer", s)));
    }
    return 0;                   /* keep compiler quiet */
}
which does seem nicer than what we have right now.
Greetings,
Andres Freund
			
		Hi, On 2019-07-16 16:11:44 +0900, Michael Paquier wrote: > numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations > become rather inconsistent with inconsistent APIs for the manipulation > of int2 and int4 fields, and scanint8 is just a derivative of the same > logic. We have two categories of routines here: > - The wrappers on top of strtol and strtoul & co, which are named > respectively strtoint and pg_strtouint64 with the patch. The naming > part is inconsistent, and we only handle uint64 and int32. We don't > need to worry about int64 and uint32 because they are not used? > That's fine by me, but at least let's have a consistent naming. Yea, consistent naming seems like a strong requirement here. Additionally I think we should just provide a consistent set rather than what's needed just now. That'll just lead to people inventing their own again down the line. > Prefixing the functions with pg_* is a better practice in my opinion > as we will unlikely run into conflicts this way. +1 > - The str->integer conversion routines, which actually have very > similar characteristics to the strtol families as they remove trailing > whitespaces first, check for a sign, etc, except that they work only > on base 10. There's afaict neither a caller that needs the base argument at the moment, nor one in the tree previously. I'd argue for just making pg_strtouint64's API consistent. I'd probably also just use the implementation we have for signed integers (minus the relevant negation and overflow checks, obviously) - it's a lot faster, and I think there's value in keeping the implementations in sync. Greetings, Andres Freund
On Tue, Jul 16, 2019 at 01:18:38PM -0700, Andres Freund wrote: > Hi, > > On 2019-07-16 16:11:44 +0900, Michael Paquier wrote: > Yea, consistent naming seems like a strong requirement > here. Additionally I think we should just provide a consistent set > rather than what's needed just now. That'll just lead to people > inventing their own again down the line. Agreed. The first versions of pg_rewind in the tree have been using copy_file_range(), which has been introduced in Linux. > > - The str->integer conversion routines, which actually have very > > similar characteristics to the strtol families as they remove trailing > > whitespaces first, check for a sign, etc, except that they work only > > on base 10. > > There's afaict neither a caller that needs the base argument at the > moment, nor one in the tree previously. I'd argue for just making > pg_strtouint64's API consistent. Good point, indeed, this could be much more simplified. I have not paid attention at that part. > I'd probably also just use the implementation we have for signed > integers (minus the relevant negation and overflow checks, obviously) - > it's a lot faster, and I think there's value in keeping the > implementations in sync. You mean that it is much faster than the set of wrappers for strtol than we have? Is that because we don't care about the base? -- Michael
Вложения
On Tue, Jul 16, 2019 at 01:04:38PM -0700, Andres Freund wrote: > There is the issue that there already is pg_strtoint16 and > pg_strtoint32, which do not have the option to not raise an error. I'd > probably name the non-error throwing ones something like pg_strtointNN_e > (for extended, or error handling), and have pg_strtointNN wrappers that > just handle the errors, or reverse the naming (which'd cause a bit of > churn, but not that much). > > That'd also make the code for pg_strtointNN a bit nicer, because we'd > not need the gotos anymore, they're just there to avoid redundant error > messages - which'd not be an issue if the error handling were just a > switch in a separate function. E.g. Agreed on that. I am wondering if we should use a common wrapper for all the internal functions taking in input a set of bits16 flags to control its behavior and put all that into common/script.c: - Issue an error. - Check for signedness. - Base length: 16, 32 or 64. This would have the advantage to move the error string generation, the trailing whitespace check, sign handling and such in a single place. We could have the internal routine return uint64 which is casted afterwards to a proper result depending on what we use. (Perhaps that's what you actually meant?) I would also rather not touch the strtol wrappers that we have able to handle the base. There is nothing in the tree using it, but likely there are extensions relying on it. Switching all the existing callers in the tree to the new routines sounds good to me of course. Consolidating all that still needs more work, so for now I am switching the patch as waiting on author. -- Michael
Вложения
Bonjour Michaël,
> FWIW, I was looking forward to putting my hands on this patch and try
> to get it merged so as we can get rid of those duplications.  Here are
> some comments.
>
> +#ifdef FRONTEND
> +       fprintf(stderr,
> +               "invalid input syntax for type %s: \"%s\"\n",
> "bigint", str);
> +#else
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> +                errmsg("invalid input syntax for type %s: \"%s\"",
> +                       "bigint", str)));
> +#endif
> Have you looked at using the wrapper pg_log_error() here?
I have not.
I have simply merged the two implementations (pgbench & backend) as they 
were.
> +extern bool pg_scanint8(const char *str, bool errorOK, int64 *result);
> +extern uint64 pg_strtouint64(const char *str, char **endptr, int base);
> Hmm.  With this patch we have strtoint and pg_strtouint64, which makes
> the whole set inconsistent.
I understand that you mean bits vs bytes? Indeed it can bite!
> +
> #endif                         /* COMMON_STRING_H */
> Noise diff..
Indeed.
> numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
> become rather inconsistent with inconsistent APIs for the manipulation
> of int2 and int4 fields, and scanint8 is just a derivative of the same
> logic.  We have two categories of routines here:
Yep, but the int2/4 functions are not used elsewhere.
> - The wrappers on top of strtol and strtoul & co, which are named
> respectively strtoint and pg_strtouint64 with the patch.  The naming
> part is inconsistent, and we only handle uint64 and int32.  We don't
> need to worry about int64 and uint32 because they are not used?
Indeed, it seems that they are not needed/used by client code, AFAICT.
> That's fine by me, but at least let's have a consistent naming.
Ok.
> Prefixing the functions with pg_* is a better practice in my opinion
> as we will unlikely run into conflicts this way.
Ok.
> - The str->integer conversion routines, which actually have very
> similar characteristics to the strtol families as they remove trailing
> whitespaces first, check for a sign, etc, except that they work only
> on base 10.  And here we get into a state where pg_scanint8 should be
> actually called pg_strtoint64,
I just removed that:-)
ISTM that the issue is that the error handling of these functions is 
pretty different.
> with an interface inconsistent with its int32/int16 relatives now only 
> in the backend.
We can, but I'm not at ease with changing the error handling approach.
> Could we consider more consolidation here please?  Like moving the whole 
> set to src/common/?
My initial plan was simply to remove direct code duplications between 
front-end and back-end, not to re-engineer the full set of string to int 
conversion functions:-)
On the re-engineering front: Given the various points on the thread, ISTM 
that there should probably be two behaviors for str to signed/unsigned 
int{16,32,64}, and having only one kind of signature for all types would 
be definitely better.
One low-level one that does the conversion or return an error.
Another higher-level one which possibly adds an error message (stderr for 
front-end, log for back-end).
One choice is whether there are two functions (the higher one calling the 
lower one and adding the messages) or just one with a boolean to trigger 
the messages. I do not have a strong opinion. Maybe one function would be 
simpler. Andres boolean-compatible enum return looks like a good option.
Overall, this leads to something like:
enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
   pg_strto{,u}int{8?,16,32,64}
     (const char * string, const TYPE * result, const bool verbose);
-- 
Fabien.
			
		On Wed, Jul 17, 2019 at 07:55:39AM +0000, Fabien COELHO wrote:
>> numutils.c also has pg_strtoint16 and pg_strtoint32, so the locations
>> become rather inconsistent with inconsistent APIs for the manipulation
>> of int2 and int4 fields, and scanint8 is just a derivative of the same
>> logic.  We have two categories of routines here:
>
> Yep, but the int2/4 functions are not used elsewhere.
The worry is more about having people invent the same stuff all over
again.  If we can get a clean interface, that would ease adoption.
Hopefully.
> Overall, this leads to something like:
>
> enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
>   pg_strto{,u}int{8?,16,32,64}
>     (const char * string, const TYPE * result, const bool verbose);
Something like that.  "verbose" may mean "error_ok" though.  Not
having 6 times the same trailing whitespace checks and such would be
nice.
Actually, one thing which may be a problem is that we lack currently
the equivalents of pg_mul_s16_overflow and such for unsigned
integers.  The point of contention comes from pgbench's
set_random_seed() in this case as we can expect an unsigned seed as
the docs say.  But if we give up on the signedness of the random seed
which remains with 8 bytes, then we could let pg_strtouint64 as
backend-only and only worry about porting this set of APIs for signed
integers.
--
Michael
			
		Вложения
Hi, On 2019-07-17 12:18:19 +0900, Michael Paquier wrote: > On Tue, Jul 16, 2019 at 01:04:38PM -0700, Andres Freund wrote: > > There is the issue that there already is pg_strtoint16 and > > pg_strtoint32, which do not have the option to not raise an error. I'd > > probably name the non-error throwing ones something like pg_strtointNN_e > > (for extended, or error handling), and have pg_strtointNN wrappers that > > just handle the errors, or reverse the naming (which'd cause a bit of > > churn, but not that much). > > > > That'd also make the code for pg_strtointNN a bit nicer, because we'd > > not need the gotos anymore, they're just there to avoid redundant error > > messages - which'd not be an issue if the error handling were just a > > switch in a separate function. E.g. > > Agreed on that. I am wondering if we should use a common wrapper for > all the internal functions taking in input a set of bits16 flags to > control its behavior and put all that into common/script.c: > - Issue an error. > - Check for signedness. > - Base length: 16, 32 or 64. That'd be considerably slower, so I'm *strongly* against that. These conversion routines are *really* hot in a number of workloads, e.g. bulk-loading with COPY. Check e.g. https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de > I would also rather not touch the strtol wrappers that we have able to > handle the base. There is nothing in the tree using it, but likely > there are extensions relying on it. I doubt it - it's not of that long-standing vintage (23a27b039d9, 2016-03-12), and if so they are very likely to use base 10. We shouldn't keep some barely tested function around, just for the hypothetical scenario that some extension uses it. Especially if that function is considerably slower than the potential replacement. Greetings, Andres Freund
Hi, On 2019-07-17 12:04:32 +0900, Michael Paquier wrote: > On Tue, Jul 16, 2019 at 01:18:38PM -0700, Andres Freund wrote: > > I'd probably also just use the implementation we have for signed > > integers (minus the relevant negation and overflow checks, obviously) - > > it's a lot faster, and I think there's value in keeping the > > implementations in sync. > > You mean that it is much faster than the set of wrappers for strtol > than we have? Is that because we don't care about the base? Yes: https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de Not caring about the base is one significant part, that removes a fair bit of branches and more importantly allows the compiler to replace divisions with much faster code (glibc tries to avoid the division too, with lookup tables, but that's still expensive). Additionally there's also some locale awareness in strtoll etc that we don't need. It's also plainly not that well implemented at least in glibc and musl. Having an implementation that reliably works the same across all platforms is also advantageous. Greetings, Andres Freund
Hi,
On 2019-07-17 07:55:39 +0000, Fabien COELHO wrote:
> > - The str->integer conversion routines, which actually have very
> > similar characteristics to the strtol families as they remove trailing
> > whitespaces first, check for a sign, etc, except that they work only
> > on base 10.  And here we get into a state where pg_scanint8 should be
> > actually called pg_strtoint64,
> 
> I just removed that:-)
What do you mean by that?
> > with an interface inconsistent with its int32/int16 relatives now only
> > in the backend.
> 
> We can, but I'm not at ease with changing the error handling approach.
Why?
> > Could we consider more consolidation here please?  Like moving the whole
> > set to src/common/?
> 
> My initial plan was simply to remove direct code duplications between
> front-end and back-end, not to re-engineer the full set of string to int
> conversion functions:-)
Well, if you expose functions to more places - e.g. now the whole
frontend - it becomes more important that they're reasonably designed.
> On the re-engineering front: Given the various points on the thread, ISTM
> that there should probably be two behaviors for str to signed/unsigned
> int{16,32,64}, and having only one kind of signature for all types would be
> definitely better.
I don't understand why we'd want to have different behaviours for
signed/unsigned? Maybe I'm mis-understanding your sentence, and you just
mean that there should be one that throws, and one that returns an
errorcode?
> Another higher-level one which possibly adds an error message (stderr for
> front-end, log for back-end).
Is there actually any need for a non-backend one that has an
error-message?  I'm not convinced that in the frontend it's very useful
to have such a function that exits - in the backend we have a much more
complete way to handle that, including pointing to the right location in
the query strings etc.
> One choice is whether there are two functions (the higher one calling the
> lower one and adding the messages) or just one with a boolean to trigger the
> messages. I do not have a strong opinion. Maybe one function would be
> simpler. Andres boolean-compatible enum return looks like a good option.
The boolean makes the calling code harder to understand, the function
slower, and the code harder to grep.
> Overall, this leads to something like:
> 
> enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
>   pg_strto{,u}int{8?,16,32,64}
>     (const char * string, const TYPE * result, const bool verbose);
What's with hthe const for the result? I assume that was just a copy&pasto?
Greetings,
Andres Freund
			
		Hi,
On 2019-07-17 17:29:58 +0900, Michael Paquier wrote:
> Actually, one thing which may be a problem is that we lack currently
> the equivalents of pg_mul_s16_overflow and such for unsigned
> integers.
It's much simpler to implement them for unsigned than for signed,
because unsigned overflow is well-defined. So I'd not be particularly
worried about just adding them.  E.g. comparing the "slow" version of
pg_mul_s64_overflow() with an untested implementation of
pg_mul_u64_overflow():
pg_mul_s64_overflow:
    /*
     * Overflow can only happen if at least one value is outside the range
     * sqrt(min)..sqrt(max) so check that first as the division can be quite a
     * bit more expensive than the multiplication.
     *
     * Multiplying by 0 or 1 can't overflow of course and checking for 0
     * separately avoids any risk of dividing by 0.  Be careful about dividing
     * INT_MIN by -1 also, note reversing the a and b to ensure we're always
     * dividing it by a positive value.
     *
     */
    if ((a > PG_INT32_MAX || a < PG_INT32_MIN ||
         b > PG_INT32_MAX || b < PG_INT32_MIN) &&
        a != 0 && a != 1 && b != 0 && b != 1 &&
        ((a > 0 && b > 0 && a > PG_INT64_MAX / b) ||
         (a > 0 && b < 0 && b < PG_INT64_MIN / a) ||
         (a < 0 && b > 0 && a < PG_INT64_MIN / b) ||
         (a < 0 && b < 0 && a < PG_INT64_MAX / b)))
    {
        *result = 0x5EED;        /* to avoid spurious warnings */
        return true;
    }
    *result = a * b;
    return false;
pg_mul_s64_overflow:
        /*
         * Checking for unsigned overflow is simple, just check
         * if reversing the multiplication indicates that the
         * multiplication overflowed.
         */
        int64 res = a * b;
        if (a != 0 && b != res / a)
        {
        *result = 0x5EED;        /* to avoid spurious warnings */
        return true;
    }
    *result = res;
    return false;
The cases for addition/subtraction are even easier:
addition:
res = a + b;
if (res < a)
   /* overflow */
subtraction:
if (a < b)
   /* underflow */
res  = a - b;
Greetings,
Andres Freund
			
		
>>> - The str->integer conversion routines, which actually have very
>>> similar characteristics to the strtol families as they remove trailing
>>> whitespaces first, check for a sign, etc, except that they work only
>>> on base 10.  And here we get into a state where pg_scanint8 should be
>>> actually called pg_strtoint64,
>>
>> I just removed that:-)
>
> What do you mean by that?
That I renamed something from a previous patch version and someone is 
complaining that I did.
>>> with an interface inconsistent with its int32/int16 relatives now only
>>> in the backend.
>>
>> We can, but I'm not at ease with changing the error handling approach.
>
> Why?
If a function reports an error to log, it should keep on doing it, 
otherwise there would be a regression.
>>> Could we consider more consolidation here please?  Like moving the whole
>>> set to src/common/?
>>
>> My initial plan was simply to remove direct code duplications between
>> front-end and back-end, not to re-engineer the full set of string to int
>> conversion functions:-)
>
> Well, if you expose functions to more places - e.g. now the whole
> frontend - it becomes more important that they're reasonably designed.
I can somehow only agree with that. Note that the contraposite assumption 
that badly designed functions would be okay for backend seems doubtful.
>> On the re-engineering front: Given the various points on the thread, 
>> ISTM that there should probably be two behaviors for str to 
>> signed/unsigned int{16,32,64}, and having only one kind of signature 
>> for all types would be definitely better.
>
> I don't understand why we'd want to have different behaviours for
> signed/unsigned? Maybe I'm mis-understanding your sentence, and you just
> mean that there should be one that throws, and one that returns an
> errorcode?
Yep for the backend (if reporting an error generates a longjump), for the 
frontend there is no exception mechanism, so it is showing the error or 
not to stderr, and returning whether it was ok.
>> Another higher-level one which possibly adds an error message (stderr for
>> front-end, log for back-end).
>
> Is there actually any need for a non-backend one that has an
> error-message?
Pgbench uses it. If the function is shared and one is loging something, it 
looks ok to send to stderr for front-end?
>  I'm not convinced that in the frontend it's very useful to have such a 
> function that exits - in the backend we have a much more complete way to 
> handle that, including pointing to the right location in the query 
> strings etc.
Sure. There is not exit though, just messages to stderr and return false.
>> One choice is whether there are two functions (the higher one calling the
>> lower one and adding the messages) or just one with a boolean to trigger the
>> messages. I do not have a strong opinion. Maybe one function would be
>> simpler. Andres boolean-compatible enum return looks like a good option.
>
> The boolean makes the calling code harder to understand, the function
> slower,
Hmmm. So I understand that you would prefer 2 functions, one raw (fast) 
one and the other with the other with the better error reporting facity, 
and the user must chose the one they like. I'm fine with that as well.
> and the code harder to grep.
>> Overall, this leads to something like:
>>
>> enum { STRTOINT_OK, STRTOINT_OVERFLOW_ERROR, STRTOINT_SYNTAX_ERROR }
>>   pg_strto{,u}int{8?,16,32,64}
>>     (const char * string, const TYPE * result, const bool verbose);
>
> What's with hthe const for the result? I assume that was just a copy&pasto?
Yep. The pointer is constant, not the value pointed, maybe it should be 
"TYPE * const result" or something like that. Or no const at all on 
result.
-- 
Fabien.
			
		Hi, On 2019-07-17 22:59:01 +0000, Fabien COELHO wrote: > > > > with an interface inconsistent with its int32/int16 relatives now only > > > > in the backend. > > > > > > We can, but I'm not at ease with changing the error handling approach. > > > > Why? > > If a function reports an error to log, it should keep on doing it, otherwise > there would be a regression. Err, huh. Especially if we change the signature, I fail to see how it's a regression if we change the behaviour. > > > Another higher-level one which possibly adds an error message (stderr for > > > front-end, log for back-end). > > > > Is there actually any need for a non-backend one that has an > > error-message? > > Pgbench uses it. If the function is shared and one is loging something, it > looks ok to send to stderr for front-end? > > I'm not convinced that in the frontend it's very useful to have such a > > function that exits - in the backend we have a much more complete way to > > handle that, including pointing to the right location in the query > > strings etc. > > Sure. There is not exit though, just messages to stderr and return false. I think it's a seriously bad idea to have a function that returns depending in the error case depending on whether it's frontend or backend code. We shouldn't do stuff like that, it just leads to bugs. > > > One choice is whether there are two functions (the higher one calling the > > > lower one and adding the messages) or just one with a boolean to trigger the > > > messages. I do not have a strong opinion. Maybe one function would be > > > simpler. Andres boolean-compatible enum return looks like a good option. > > > > The boolean makes the calling code harder to understand, the function > > slower, > > Hmmm. So I understand that you would prefer 2 functions, one raw (fast) one > and the other with the other with the better error reporting facity, and the > user must chose the one they like. I'm fine with that as well. Well, the one with error reporting would use the former. Greetings, Andres Freund
On Wed, Jul 17, 2019 at 11:14:28AM -0700, Andres Freund wrote: > That'd be considerably slower, so I'm *strongly* against that. These > conversion routines are *really* hot in a number of workloads, > e.g. bulk-loading with COPY. Check e.g. > https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de Thanks for the link. That makes sense! So stacking more function calls could also be an issue. Even if using static inline for the inner wrapper? That may sound like a stupid question but you have likely more experience than me regarding that with profiling. > I doubt it - it's not of that long-standing vintage (23a27b039d9, > 2016-03-12), and if so they are very likely to use base 10. We shouldn't > keep some barely tested function around, just for the hypothetical > scenario that some extension uses it. Especially if that function is > considerably slower than the potential replacement. Okay, I won't fight hard on that either. -- Michael
Вложения
Hi, On 2019-07-18 09:28:28 +0900, Michael Paquier wrote: > On Wed, Jul 17, 2019 at 11:14:28AM -0700, Andres Freund wrote: > > That'd be considerably slower, so I'm *strongly* against that. These > > conversion routines are *really* hot in a number of workloads, > > e.g. bulk-loading with COPY. Check e.g. > > https://www.postgresql.org/message-id/20171208214437.qgn6zdltyq5hmjpk%40alap3.anarazel.de > > Thanks for the link. That makes sense! So stacking more function > calls could also be an issue. Even if using static inline for the > inner wrapper? That may sound like a stupid question but you have > likely more experience than me regarding that with profiling. A static inline would be fine, depending on how you do that. I'm not quite sure what you mean with "inner wrapper" - isn't a wrapper normally outside? I'd probably do something like static inline int64 strtoint64(const char *str) { int64 res; e = strtoint64_e(str, &res); if (likely(e == STRTOINT_OK)) return res; else { report_strtoint_error(str, e, "int64"); return 0; /* pacify compiler */ } } and then have one non-inline report_strtoint_error() shared across the various functions. Even leaving code-duplication aside, not having the elog call itself in the inline function is nice, as that's quite a few instructions. Greetings, Andres Freund
Hello Andres,
>> If a function reports an error to log, it should keep on doing it, otherwise
>> there would be a regression.
>
> Err, huh. Especially if we change the signature, I fail to see how it's
> a regression if we change the behaviour.
ISTM that we do not understand one another, because I'm only trying to 
state the obvious. Currently error messages on overflow or syntax error 
are displayed. I just mean that such messages must keep on being emitted 
somehow otherwise there would be a regression.
   sql> SELECT INT8 '12345678901234567890';
   ERROR:  value "12345678901234567890" is out of range for type bigint
   LINE 1: SELECT INT8 '12345678901234567890';
>> Sure. There is not exit though, just messages to stderr and return false.
>
> I think it's a seriously bad idea to have a function that returns
> depending in the error case depending on whether it's frontend or
> backend code.  We shouldn't do stuff like that, it just leads to bugs.
Ok, you really want two functions and getting read of the errorOK boolean.
I got that. The scanint8 already exists with its ereport ERROR vs return 
based on a boolean, so the point is to get read of this kind of signature.
>>> The boolean makes the calling code harder to understand, the function
>>> slower,
>>
>> Hmmm. So I understand that you would prefer 2 functions, one raw (fast) one
>> and the other with the other with the better error reporting facity, and the
>> user must chose the one they like. I'm fine with that as well.
>
> Well, the one with error reporting would use the former.
Yep, possibly two functions, no boolean.
Having a common function will not escape the fact that there is no 
exception mechanism for front-end, and none seems desirable just for 
parsing ints. So if you want NOT to have a same function with return 
something vs raises an error, that would make three functions.
One basic int parsing in the fe/be common part.
   typedef enum { STRTOINT_OK, STRTOINT_OVERFLOW, STRTOINT_SYNTAX_ERROR }
     strtoint_error;
   strtoint_error pg_strtoint64(const char * str, int64 * result);
Then for backend, probably in backend somewhere:
   // raises an exception on errors
   void pg_strtoint64_log(const char * str, int64 * result) {
     switch (pg_strtoint64) {
       case STRTOINT_OK: return;
       case STRTOINT...: ereport(ERROR, ...);
     }
   }
And for frontend, only in frontend somewhere:
   // print to stderr and return false on error
   bool pg_strtoint64_err(const char * str, int64 * result);
I'm unhappy with the function names though, feel free to improve.
-- 
Fabien.
			
		On Thu, Jul 18, 2019 at 07:57:41AM +0000, Fabien COELHO wrote:
> I'm unhappy with the function names though, feel free to improve.
I would have something rather close to what you are suggesting, still
not exactly that because we just don't care about the error strings
generated for the frontend.  And my bet is that each frontend would
like to have their own error message depending on the error case.
FWIW, I had a similar experience with pg_strong_random() not so long
ago, which required a backend-specific handling because the fallback
random implementation needed some tweaks at postmaster startup (that's
why we have an alias pg_backend_random in include/port.h).  So I would
recommend the following, roughly:
- One set of functions in src/port/ which return the status code for
the error handling, without any error reporting in it to avoid any
ifdef FRONTEND business, which have a generic name pg_strto[u]intXX
(XX = {16,32,64}).  And have all that in a new, separate file, say
strtoint.c?
- One set of functions for the backend, called pg_stro[u]intXX_backend
or pg_backend_stro[u]intXX which can take as extra argument error_ok,
calling the portions in src/port/, and move those functions in a new
file prefixed with "backend_" in src/backend/utils/misc/ with a name
consistent with the one in src/port/ (with the previous naming that
would be backend_strtoint.c)
- We also need the unsigned-specific equivalents of
pg_mul_s64_overflow and such, so I would suggest putting that in a new
header, simply uint.h.  If I finish by committing this stuff, I would
handle that in a separate commit.
--
Michael
			
		Вложения
Hi,
On 2019-07-19 12:21:27 +0900, Michael Paquier wrote:
> On Thu, Jul 18, 2019 at 07:57:41AM +0000, Fabien COELHO wrote:
> > I'm unhappy with the function names though, feel free to improve.
> 
> I would have something rather close to what you are suggesting, still
> not exactly that because we just don't care about the error strings
> generated for the frontend.  And my bet is that each frontend would
> like to have their own error message depending on the error case.
Yea, the error messages pgbench is currently generating, for example,
don't make a lot of sense.
> FWIW, I had a similar experience with pg_strong_random() not so long
> ago, which required a backend-specific handling because the fallback
> random implementation needed some tweaks at postmaster startup (that's
> why we have an alias pg_backend_random in include/port.h).  So I would
> recommend the following, roughly:
> - One set of functions in src/port/ which return the status code for
> the error handling, without any error reporting in it to avoid any
> ifdef FRONTEND business, which have a generic name pg_strto[u]intXX
> (XX = {16,32,64}).  And have all that in a new, separate file, say
> strtoint.c?
Why not common? It's not a platform dependent bit. Could even be put
into the already existing string.c.
> - One set of functions for the backend, called pg_stro[u]intXX_backend
> or pg_backend_stro[u]intXX which can take as extra argument error_ok,
> calling the portions in src/port/, and move those functions in a new
> file prefixed with "backend_" in src/backend/utils/misc/ with a name
> consistent with the one in src/port/ (with the previous naming that
> would be backend_strtoint.c)
I'm not following. What would be the point of any of this?  The error_ok
bit is unnecessary, because the function is exactly the same as the
generic function.  And the backend_ prefix would be pretty darn weird,
given that that's already below src/backend.
> - We also need the unsigned-specific equivalents of
> pg_mul_s64_overflow and such, so I would suggest putting that in a new
> header, simply uint.h.  If I finish by committing this stuff, I would
> handle that in a separate commit.
Why not the same header? I fail to see what we'd gain by splitting it
up.
Greetings,
Andres Freund
			
		On Thu, Jul 18, 2019 at 09:16:22PM -0700, Andres Freund wrote: > On 2019-07-19 12:21:27 +0900, Michael Paquier wrote: > Why not common? It's not a platform dependent bit. Could even be put > into the already existing string.c. That would be fine to me, it is not like this file is bloated now. >> - One set of functions for the backend, called pg_stro[u]intXX_backend >> or pg_backend_stro[u]intXX which can take as extra argument error_ok, >> calling the portions in src/port/, and move those functions in a new >> file prefixed with "backend_" in src/backend/utils/misc/ with a name >> consistent with the one in src/port/ (with the previous naming that >> would be backend_strtoint.c) > > I'm not following. What would be the point of any of this? The error_ok > bit is unnecessary, because the function is exactly the same as the > generic function. And the backend_ prefix would be pretty darn weird, > given that that's already below src/backend. Do you have a better idea of name for those functions? >> - We also need the unsigned-specific equivalents of >> pg_mul_s64_overflow and such, so I would suggest putting that in a new >> header, simply uint.h. If I finish by committing this stuff, I would >> handle that in a separate commit. > > Why not the same header? I fail to see what we'd gain by splitting it > up. No objections to that at the end. Fabien, are you planning to send an updated patch? This stuff has value. -- Michael
Вложения
Bonjour Michaël, > Fabien, are you planning to send an updated patch? This stuff has > value. Yep, I will try for this week. -- Fabien.
Hi,
On 2019-07-29 10:48:41 +0900, Michael Paquier wrote:
> On Thu, Jul 18, 2019 at 09:16:22PM -0700, Andres Freund wrote:
> >> - One set of functions for the backend, called pg_stro[u]intXX_backend
> >> or pg_backend_stro[u]intXX which can take as extra argument error_ok,
> >> calling the portions in src/port/, and move those functions in a new
> >> file prefixed with "backend_" in src/backend/utils/misc/ with a name
> >> consistent with the one in src/port/ (with the previous naming that
> >> would be backend_strtoint.c)
> > 
> > I'm not following. What would be the point of any of this?  The error_ok
> > bit is unnecessary, because the function is exactly the same as the
> > generic function.  And the backend_ prefix would be pretty darn weird,
> > given that that's already below src/backend.
> 
> Do you have a better idea of name for those functions?
I don't understand why they need any prefix. pg_strto[u]int{32,64}{,_checked}.
The unchecked variant would be for both frontend backend. The checked one
either for both frontend/backend, or just for backend. I also could live with
_raises, _throws or such instead of _checked. Implement all of them in one
file in /common/, possibly hidin gthe ones not currently implemented for the
frontend.
Imo if _checked is implemented for both frontend/backend they'd need
different error messages. In my opinion
out_of_range:
    if (!errorOK)
        fprintf(stderr,
                "value \"%s\" is out of range for type bigint\n", str);
    return false;
invalid_syntax:
    if (!errorOK)
        fprintf(stderr,
                "invalid input syntax for type bigint: \"%s\"\n", str);
is unsuitable for generic code. In fact, I'm doubtful that it's applicable for
any use, except for int8in(), which makes me think it better ought to use the
a non-checked variant, and include the errors directly.  If we still want to
have _checked - which is reasonable imo - it should refer to 64bit integers or somesuch.
Greetings,
Andres Freund
			
		On Mon, Jul 29, 2019 at 07:04:09AM +0200, Fabien COELHO wrote: > Bonjour Michaël, > Yep, I will try for this week. Please note that for now I have marked the patch as returned with feedback as the CF is ending. -- Michael
Вложения
Bonjour Michaël, >> Yep, I will try for this week. > > Please note that for now I have marked the patch as returned with > feedback as the CF is ending. Ok. I have looked quickly at it, but I'm not sure that there is an agreement about what should be done precisely, so the feedback is not clearly actionable. -- Fabien.
On Thu, Aug 01, 2019 at 09:00:41AM +0200, Fabien COELHO wrote: > I have looked quickly at it, but I'm not sure that there is an agreement > about what should be done precisely, so the feedback is not clearly > actionable. Per the latest trends, it seems that the input of Andres was kind of the most interesting pieces. If you don't have room for it, I would not mind doing the legwork myself. -- Michael
Вложения
Michaël-san,
>> I have looked quickly at it, but I'm not sure that there is an agreement
>> about what should be done precisely, so the feedback is not clearly
>> actionable.
>
> Per the latest trends, it seems that the input of Andres was kind of
> the most interesting pieces.
Yes, definitely. I understood that we want in "string.h" something like
(just the spirit):
   typedef enum {
     STRTOINT_OK, STRTOINT_RANGE_ERROR, STRTOINT_SYNTAX_ERROR
   } strtoint_status;
   strtoint_status pg_strtoint64(const char * str, int64 * result);
However there is a contrary objective to have a unified interface,
but there also exists a:
   extern uint64 pg_strtouint64(const char *str, char **endptr, int base);
called 3 times, always with base == 10. We have a similar name but a 
totally different interface, so basically it would have to be replaced
by something like the first interface.
> If you don't have room for it, I would not mind doing the legwork 
> myself.
I think that it would be quick if the what is clear enough, so I can do 
it.
-- 
Fabien.
			
		On Thu, Aug 01, 2019 at 11:34:34AM +0200, Fabien COELHO wrote: > However there is a contrary objective to have a unified interface, > but there also exists a: > > extern uint64 pg_strtouint64(const char *str, char **endptr, int base); > > called 3 times, always with base == 10. We have a similar name but a totally > different interface, so basically it would have to be replaced > by something like the first interface. My understanding on this one was to nuke the base argument and unify the interface with our own, faster routines: https://www.postgresql.org/message-id/20190716201838.rwrd7xzbrybq7dop%40alap3.anarazel.de -- Michael
Вложения
>> extern uint64 pg_strtouint64(const char *str, char **endptr, int base); >> >> called 3 times, always with base == 10. We have a similar name but a totally >> different interface, so basically it would have to be replaced >> by something like the first interface. > > My understanding on this one was to nuke the base argument and unify > the interface with our own, faster routines: > https://www.postgresql.org/message-id/20190716201838.rwrd7xzbrybq7dop%40alap3.anarazel.de Ok, so there is an agreement on reworking the unsigned function. I missed this bit. So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and then possibly generalize to lower sizes, 32, 16, depending on what is actually needed. -- Fabien.
Hi Fabien, On Thu, Aug 01, 2019 at 04:48:35PM +0200, Fabien COELHO wrote: > Ok, so there is an agreement on reworking the unsigned function. I missed > this bit. > > So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and > then possibly generalize to lower sizes, 32, 16, depending on what is > actually needed. I am interested in this patch, and the next commit fest is close by. Are you working on an updated version? If not, do you mind if I work on it and post a new version by the beginning of next week based on all the feedback gathered? -- Michael
Вложения
Bonjour Michaël, >> So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and >> then possibly generalize to lower sizes, 32, 16, depending on what is >> actually needed. > > I am interested in this patch, and the next commit fest is close by. > Are you working on an updated version? If not, do you mind if I work > on it and post a new version by the beginning of next week based on > all the feedback gathered? I have started to do something, and I can spend some time on that this week, but I'm pretty unclear about what exactly should be done. The error returning stuff is simple enough, but I'm unclear about what to do with pg_uint64, which has a totally different signature. Should it be aligned? -- Fabien Coelho - CRI, MINES ParisTech
On Mon, Aug 26, 2019 at 11:05:55AM +0200, Fabien COELHO wrote: > I have started to do something, and I can spend some time on that this week, > but I'm pretty unclear about what exactly should be done. Thanks. > The error returning stuff is simple enough, but I'm unclear about what to do > with pg_uint64, which has a totally different signature. Should it be > aligned? I am not sure what you mean with aligned here. If you mean consistent, getting into a state where we have all functions for all three sizes, signed and unsigned, would be nice. -- Michael
Вложения
Bonjour Michaël, >> The error returning stuff is simple enough, but I'm unclear about what to do >> with pg_uint64, which has a totally different signature. Should it be >> aligned? > > I am not sure what you mean with aligned here. I meant same signature. > If you mean consistent, getting into a state where we have all functions > for all three sizes, signed and unsigned, would be nice. Ok, I look into it. -- Fabien.
Bonjour Michaël, >> So I'll set out to write and use "pg_strtou?int64", i.e. 2 functions, and >> then possibly generalize to lower sizes, 32, 16, depending on what is >> actually needed. > > I am interested in this patch, and the next commit fest is close by. > Are you working on an updated version? If not, do you mind if I work > on it and post a new version by the beginning of next week based on > all the feedback gathered? Here is an updated patch for the u?int64 conversion functions. I have taken the liberty to optimize the existing int64 function by removing spurious tests. I have not created uint64 specific inlined overflow functions. If it looks ok, a separate patch could address the 32 & 16 versions. -- Fabien.
Вложения
On Wed, Aug 28, 2019 at 08:51:29AM +0200, Fabien COELHO wrote: > Here is an updated patch for the u?int64 conversion functions. Thanks! > I have taken the liberty to optimize the existing int64 function by removing > spurious tests. Which are? > I have not created uint64 specific inlined overflow functions. Why? There is a comment below ;p > If it looks ok, a separate patch could address the 32 & 16 versions. I am surprised to see a negative diff actually just by doing that (adding the 32 and 16 parts will add much more code of course). At quick glance, I think that this is on the right track. Some comments I have on the top of my mind: - It would me good to have the unsigned equivalents of pg_mul_s64_overflow, etc. These are simple enough, and per the feedback from Andres they could live in common/int.h. - It is more consistent to use upper-case statuses in the enum strtoint_status. Could it be renamed to pg_strtoint_status? -- Michael
Вложения
Michaël, >> I have taken the liberty to optimize the existing int64 function by removing >> spurious tests. > > Which are? - *ptr && WHATEVER(*ptr) *ptr is redundant, WHATEVER yields false on '\0', and it costs on each char but at the end. It might be debatable in some places, e.g. it is likely that there are no spaces in the string, but likely that there are more than one digit. If you want all/some *ptr added back, no problem. - isdigit repeated on if and following while, used if/do-while instead. >> I have not created uint64 specific inlined overflow functions. > > Why? There is a comment below ;p See comment about comment below:-) >> If it looks ok, a separate patch could address the 32 & 16 versions. > > I am surprised to see a negative diff Is it? Long duplicate functions are factored out (this was my initial intent), one file is removed… > actually just by doing that (adding the 32 and 16 parts will add much > more code of course). At quick glance, I think that this is on the > right track. Some comments I have on the top of my mind: > - It would me good to have the unsigned equivalents of > pg_mul_s64_overflow, etc. These are simple enough, Hmmm. Have you looked at the fallback cases when the corresponding builtins are not available? I'm unsure of a reliable way to detect a generic unsigned int overflow without expensive dividing back and having to care about zero… So I was pretty happy with my two discreet, small and efficient tests. > and per the feedback from Andres they could live in common/int.h. Could be, however "string.c" already contains a string to int conversion function, so I put them together. Probably this function should be removed in the end, though. > - It is more consistent to use upper-case statuses in the enum > strtoint_status. I thought of that, but first enum I found used lower case, so it did not seem obvious that pg style was to use upper case. Indeed, some enum constants use upper cases. > Could it be renamed to pg_strtoint_status? Sure. I also prefixed the enum constants for consistency. Attached patch uses a prefix and uppers constants. Waiting for further input about other comments. -- Fabien.
Вложения
On Wed, Aug 28, 2019 at 09:50:44AM +0200, Fabien COELHO wrote: > - *ptr && WHATEVER(*ptr) > *ptr is redundant, WHATEVER yields false on '\0', and it costs on each > char but at the end. It might be debatable in some places, e.g. it is > likely that there are no spaces in the string, but likely that there are > more than one digit. Still this makes the checks less robust? > If you want all/some *ptr added back, no problem. > > - isdigit repeated on if and following while, used if/do-while instead. I see, you don't check twice if the first character is a digit this way. > Hmmm. Have you looked at the fallback cases when the corresponding builtins > are not available? > > I'm unsure of a reliable way to detect a generic unsigned int overflow > without expensive dividing back and having to care about zero… Mr Freund has mentioned that here: https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucmdmu@alap3.anarazel.de > So I was pretty happy with my two discreet, small and efficient tests. That's also a matter of code and interface consistency IMHO. -- Michael
Вложения
Bonjour Michaël,
>>  - *ptr && WHATEVER(*ptr)
>>   *ptr is redundant, WHATEVER yields false on '\0', and it costs on each
>>   char but at the end. It might be debatable in some places, e.g. it is
>>   likely that there are no spaces in the string, but likely that there are
>>   more than one digit.
>
> Still this makes the checks less robust?
I do not see any downside, but maybe I lack imagination.
On my laptop isWHATEVER is implementd through an array mapping characters 
to a bitfield saying whether each char is WHATEVER, depending on the bit. 
This array is well defined for index 0 ('\0').
If an implementation is based on comparisons, for isdigit it would be:
    c >= '0' && c <= '9'
Then checking c != '\0' is redundant with c >= '0'.
Given the way the character checks are used in the function, we do not go 
beyond the end of the string because we only proceed further when a 
character is actually recognized, else we return.
So I cannot see any robustness issue, just a few cycles saved.
>> Hmmm. Have you looked at the fallback cases when the corresponding builtins
>> are not available?
>>
>> I'm unsure of a reliable way to detect a generic unsigned int overflow
>> without expensive dividing back and having to care about zero…
>
> Mr Freund has mentioned that here:
> https://www.postgresql.org/message-id/20190717184820.iqz7schxdbucmdmu@alap3.anarazel.de
Yep, that is what I mean by expensive: there is an integer division, which 
is avoided if b is known to be 10, hence is not zero and the limit value 
can be checked directly on the input without having to perform a division 
each time.
>> So I was pretty happy with my two discreet, small and efficient tests.
>
> That's also a matter of code and interface consistency IMHO.
Possibly.
ISTM that part of the motivation is to reduce costs for heavily used 
conversions, eg on COPY. Function "scanf" is overly expensive because it 
has to interpret its format, and we have to check for overflows…
Anyway, if we assume that the builtins exist and rely on efficient 
hardware check, maybe we do not care about the fallback cases which would 
just be slow but never executed.
Note that all this is moot, as all instances of string to uint64 
conversion do not check for errors.
Attached v7 does create uint64 overflow inline functions. The stuff yet is 
not moved to "common/int.c", a file which does not exists, even if 
"common/int.h" does.
-- 
Fabien.
			
		Вложения
On Thu, Aug 29, 2019 at 08:14:54AM +0200, Fabien COELHO wrote: > Attached v7 does create uint64 overflow inline functions. The stuff yet is > not moved to "common/int.c", a file which does not exists, even if > "common/int.h" does. Thanks for the new version. I have begun reviewing your patch, and attached is a first cut that I would like to commit separately which adds all the compatibility overflow routines to int.h for uint16, uint32 and uint64 with all the fallback implementations (int128-based method added as well if available). I have also grouped at the top of the file the comments about each routine's return policy to avoid duplication. For the fallback implementations of uint64 using int128, I think that it is cleaner to use uint128 so as there is no need to check after negative results for multiplications, and I have made the various expressions consistent for each size. Attached is a small module called "overflow" with various regression tests that I used to check each implementation. I don't propose that for commit as I am not sure if that's worth the extra CPU, so let's consider it as a toy for now. What do you think? -- Michael
Вложения
Michaël, > attached is a first cut that I would like to commit separately which > adds all the compatibility overflow routines to int.h for uint16, uint32 > and uint64 with all the fallback implementations (int128-based method > added as well if available). I have also grouped at the top of the file > the comments about each routine's return policy to avoid duplication. > For the fallback implementations of uint64 using int128, I think that it > is cleaner to use uint128 so as there is no need to check after negative > results for multiplications, and I have made the various expressions > consistent for each size. Patch applies cleanly, compiles, "make check" ok, but the added functions are not used (yet). I think that factoring out comments is a good move. For symmetry and efficiency, ISTM that uint16 mul overflow could use uint32 and uint32 could use uint64, and the division-based method be dropped in these cases. Maybe I would add a comment before each new section telling about the type, eg: /* * UINT16 */ add/sub/mul uint16 functions… I would tend to commit working solutions per type rather than by installment, so that at least all added functions are actually used somewhere, but it does not matter much, really. I was unsure that having int128 implies uint128 availability, so I did not use it. The checking extension is funny, but ISTM that these checks should be (are already?) included in some standard sql test, which will test the macros from direct SQL operations: fabien=# SELECT INT2 '1234512434343'; ERROR: value "1234512434343" is out of range for type smallint Well, a quick look at "src/test/regress/sql/int2.sql" suggests that there the existing tests should be extended… :-( -- Fabien.
On Fri, Aug 30, 2019 at 10:06:11AM +0200, Fabien COELHO wrote: > Patch applies cleanly, compiles, "make check" ok, but the added functions > are not used (yet). Thanks. > I think that factoring out comments is a good move. > > For symmetry and efficiency, ISTM that uint16 mul overflow could use uint32 > and uint32 could use uint64, and the division-based method be dropped in > these cases. Yes, the division would be worse than the other. What do you think about using the previous module I sent and measure how long it takes to evaluate the overflows in some specific cases N times in loops? > Maybe I would add a comment before each new section telling about the type, > eg: > > /* > * UINT16 > */ > add/sub/mul uint16 functions. Let's do as you suggest here. > I would tend to commit working solutions per type rather than by > installment, so that at least all added functions are actually used > somewhere, but it does not matter much, really. I prefer by section, with testing dedicated to each part properly done so as we can move to the next parts. > I was unsure that having int128 implies uint128 availability, so I did not > use it. The recent Ryu-floating point work has begun using them (see f2s.c and d2s.c). > The checking extension is funny, but ISTM that these checks should be (are > already?) included in some standard sql test, which will test the macros > from direct SQL operations: Sure. But not for the unsigned part as there are no equivalent in-core data types, still it is possible to trick things with signed integer arguments. I found my toy useful to check test all implementations consistently. > fabien=# SELECT INT2 '1234512434343'; > ERROR: value "1234512434343" is out of range for type smallint > > Well, a quick look at "src/test/regress/sql/int2.sql" suggests that > there the existing tests should be extended… :-( We can tackle that separately. -32768 is perfectly legal for smallint, and the test is wrong here. -- Michael
Вложения
Michaël, >> For symmetry and efficiency, ISTM that uint16 mul overflow could use uint32 >> and uint32 could use uint64, and the division-based method be dropped in >> these cases. > > Yes, the division would be worse than the other. What do you think > about using the previous module I sent and measure how long it takes > to evaluate the overflows in some specific cases N times in loops? Given the overheads of the SQL interpreter, I'm unsure about what you would measure at the SQL level. You could just write a small standalone C program to test perf and accuracy. Maybe this is what you have in mind. >> I would tend to commit working solutions per type rather than by >> installment, so that at least all added functions are actually used >> somewhere, but it does not matter much, really. > > I prefer by section, with testing dedicated to each part properly > done so as we can move to the next parts. This suggests that you will test twice: once when adding the inlined functions, once when calling from SQL. >> The checking extension is funny, but ISTM that these checks should be (are >> already?) included in some standard sql test, which will test the macros >> from direct SQL operations: > > Sure. But not for the unsigned part as there are no equivalent > in-core data types, Yep, it bothered me sometimes, but not enough to propose to add them. > still it is possible to trick things with signed integer arguments. Is it? > I found my toy useful to check test all implementations consistently. Ok. >> fabien=# SELECT INT2 '1234512434343'; >> ERROR: value "1234512434343" is out of range for type smallint >> >> Well, a quick look at "src/test/regress/sql/int2.sql" suggests that >> there the existing tests should be extended… :-( > > We can tackle that separately. -32768 is perfectly legal for > smallint, and the test is wrong here. Do you mean: sql> SELECT -32768::INT2; ERROR: smallint out of range This is not a negative constant but the reverse of a positive, which is indeed out of range, although the error message could help more. sql> SELECT (-32768)::INT2; -32768 # ok sql> SELECT INT2 '-32768'; -32768 # ok -- Fabien.
On Fri, Aug 30, 2019 at 04:50:21PM +0200, Fabien COELHO wrote:
> Given the overheads of the SQL interpreter, I'm unsure about what you would
> measure at the SQL level. You could just write a small standalone C program
> to test perf and accuracy. Maybe this is what you have in mind.
After a certain threshold, you can see the difference anyway by paying
once the overhead of the function.  See for example the updated module
attached that I used for my tests.
I have been testing the various implementations, and doing 2B
iterations leads to roughly the following with a non-assert, -O2
build using mul_u32:
- __builtin_sub_overflow => 5s
- cast to uint64 => 5.9s
- division => 8s
You are right as well that having symmetry with the signed methods is
much better.  In order to see the difference, you can just do that
with the extension attached, after of course hijacking int.h with some
undefs and recompiling the backend and the module:
select pg_overflow_check(10000, 10000, 2000000000, 'uint32', 'mul');
>> still it is possible to trick things with signed integer arguments.
>
> Is it?
If you pass -1 and then you can fall back to the maximum of each 16,
32 or 64 bits for the unsigned (see the regression tests I added with
the module).
> Do you mean:
>
>  sql> SELECT -32768::INT2;
>  ERROR:  smallint out of range
You are incorrect here, as the minus sign is ignored by the cast.
This works though:
=# SELECT (-32768)::INT2;
  int2
--------
 -32768
(1 row)
If you look at int2.sql, we do that:
-- largest and smallest values
INSERT INTO INT2_TBL(f1) VALUES ('32767');
INSERT INTO INT2_TBL(f1) VALUES ('-32767');
That's the part I mean is wrong, as the minimum is actually -32768,
but the test fails to consider that.  I'll go fix that after
double-checking other similar tests for int4 and int8.
Attached is an updated patch to complete the work for common/int.h,
with the following changes:
- Changed the multiplication methods for uint16 and uint32 to not be
division-based.  uint64 can use that only if int128 exists.
- Added comments on top of each sub-sections for the types checked.
Attached is also an updated version of the module I used to validate
this stuff.  Fabien, any thoughts?
--
Michael
			
		Вложения
Bonjour Michaël, > You are right as well that having symmetry with the signed methods is > much better. In order to see the difference, you can just do that with > the extension attached, after of course hijacking int.h with some undefs > and recompiling the backend and the module: select > pg_overflow_check(10000, 10000, 2000000000, 'uint32', 'mul'); Ok. >>> still it is possible to trick things with signed integer arguments. >> >> Is it? > > If you pass -1 and then you can fall back to the maximum of each 16, > 32 or 64 bits for the unsigned (see the regression tests I added with > the module). > Attached is also an updated version of the module I used to validate > this stuff. Fabien, any thoughts? Patch apply cleanly, compiles, "make check" ok (although changes are untested). I would put back unlikely() on overflow tests, as there are indeed unlikely to occur and it may help some compilers, and cannot be harmful. It also helps the code reader to know that these path are not expected to be taken often. On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128 are useful. The res < a or b > a tricks should suffice, just like for u16 and u32 cases, and it may cost a little less anyway. I would suggest keep the overflow extension as "contrib/overflow_test". For mul tests, I'd suggest not to try only min/max values like add/sub, but also "standard" multiplications that overflow or not. It would be good if "make check" could be made to work", for some reason it requires "installcheck". I could not test performance directly, loops are optimized out by the compiler. I added "volatile" on input value declarations to work around that. On 2B iterations I got on my laptop: int16: mul = 2770 ms, add = 1830 ms, sub = 1826 ms int32: mul = 1838 ms, add = 1835 ms, sub = 1840 ms int64: mul = 1836 ms, add = 1834 ms, sub = 1833 ms uint16: mul = 3670 ms, add = 1830 ms, sub = 2148 ms uint32: mul = 2438 ms, add = 1834 ms, sub = 1831 ms uint64: mul = 2139 ms, add = 1841 ms, sub = 1882 ms Why int16 mul, uint* mul and uint16 sub are bad is unclear. With fallback code triggered with: #undef HAVE__BUILTIN_OP_OVERFLOW int16: mul = 1433 ms, add = 1424 ms, sub = 1254 ms int32: mul = 1433 ms, add = 1425 ms, sub = 1443 ms int64: mul = 1430 ms, add = 1429 ms, sub = 1441 ms uint16: mul = 1445 ms, add = 1291 ms, sub = 1265 ms uint32: mul = 1419 ms, add = 1434 ms, sub = 1493 ms uint64: mul = 1266 ms, add = 1430 ms, sub = 1440 ms For some unclear reason, 4 tests are significantly faster. Forcing further down fallback code with: #undef HAVE_INT128 int64: mul = 1424 ms, add = 1429 ms, sub = 1440 ms uint64: mul = 24145 ms, add = 1434 ms, sub = 1435 ms There is no doubt that dividing 64 bits integers is a very bad idea, at least on my architecture! Note that checks depends on value, so actual performance may vary depending on actual val1 and val2 passed. I used 10000 10000 like your example. These results are definitely depressing because the fallback code is nearly twice as fast as the builtin overflow detection version. For the record: gcc 7.4.0 on ubuntu 18.04 LTS. Not sure what to advise, relying on the builtin should be the better idea… -- Fabien.
On Sun, Sep 01, 2019 at 01:57:06PM +0200, Fabien COELHO wrote: > I would put back unlikely() on overflow tests, as there are indeed unlikely > to occur and it may help some compilers, and cannot be harmful. It also > helps the code reader to know that these path are not expected to be taken > often. Hm. I don't agree about that part, and the original signed portions don't do that. I think that we should let the callers of the routines decide if a problem is unlikely to happen or not as we do now. > On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128 > are useful. The res < a or b > a tricks should suffice, just like for u16 > and u32 cases, and it may cost a little less anyway. Actually, I agree and this is something I can see as well with some extra measurements. mul_u64 without int128 is twice slower, while add_u64 and sub_u64 are 15~20% faster. > I would suggest keep the overflow extension as "contrib/overflow_test". For > mul tests, I'd suggest not to try only min/max values like add/sub, but also > "standard" multiplications that overflow or not. It would be good if "make > check" could be made to work", for some reason it requires "installcheck". Any extensions out of core can only work with "installcheck", and check is not supported (see pgxs.mk). I am still not convinced that this module is worth the extra cycles to justify its existence though. > There is no doubt that dividing 64 bits integers is a very bad idea, at > least on my architecture! That's surprising. I cannot reproduce that. Are you sure that you didn't just undefine HAVE_INT128? This would cause HAVE__BUILTIN_OP_OVERFLOW to still be active in all the code paths. Here are a couple of results from my side with this query, FWIW, and some numbers for all the compile flags (-O2 used): select pg_overflow_check(10000, 10000, 2000000000, 'XXX', 'XXX'); 1) uint16: 1-1) mul: - built-in: 5.5s - fallback: 5.5s 1-2) sub: - built-in: 5.3s - fallback: 5.4s 1-3) add: - built-in: 5.3s - fallback: 6.2s 2) uint32: 2-1) mul: - built-in: 5.1s - fallback: 5.9s 2-2) sub: - built-in: 5.2s - fallback: 5.4s 2-3) add: - built-in: 5.2s - fallback: 6.2s 2) uint64: 2-1) mul: - built-in: 5.1s - fallback (with uint128): 8.0s - fallback (without uint128): 18.1s 2-2) sub: - built-in: 5.2s - fallback (with uint128): 7.1s - fallback (without uint128): 5.5s 2-3) add: - built-in: 5.2s - fallback (with uint128): 7.1s - fallback (without uint128): 6.3s So, the built-in option is always faster, and keeping the int128 path if available for the multiplication makes sense, but not for the subtraction and the addition. I am wondering if we should review further the signed part for add and sub, but I'd rather not touch it in this patch. > Note that checks depends on value, so actual performance may vary depending > on actual val1 and val2 passed. I used 10000 10000 like your example. Sure. Still that offers helpful hints as we do the same operations for all code paths the same number of times. If you have done any changes on my previous patch, or if you have a script to share I could use to attempt to reproduce your results, I would be happy to do so. So, do you have more comments? -- Michael
Вложения
Michaël, >> I would put back unlikely() on overflow tests, as there are indeed unlikely >> to occur and it may help some compilers, and cannot be harmful. It also >> helps the code reader to know that these path are not expected to be taken >> often. > > Hm. I don't agree about that part, and the original signed portions > don't do that. I think that we should let the callers of the routines > decide if a problem is unlikely to happen or not as we do now. Hmmm. Maybe inlining propates them, but otherwise they make sense for a compiler perspective. > I am still not convinced that this module is worth the extra cycles to > justify its existence though. They allow to quickly do performance tests, for me it is useful to keep it around, but you are the committer, you do as you feel. >> [...] >> There is no doubt that dividing 64 bits integers is a very bad idea, at >> least on my architecture! > > That's surprising. I cannot reproduce that. It seems to me that somehow you can, you have a 5 to 18 seconds drop below. There are actual reasons why some processors are more expensive than others, it is not just marketing:-) > 2-1) mul: > - built-in: 5.1s > - fallback (with uint128): 8.0s > - fallback (without uint128): 18.1s > So, the built-in option is always faster, and keeping the int128 path > if available for the multiplication makes sense, but not for the > subtraction and the addition. Yep. > I am wondering if we should review further the signed part for add and > sub, but I'd rather not touch it in this patch. The signed overflows are trickier even, I have not paid attention to the fallback code. I agree that it is better left untouched for know. > If you have done any changes on my previous patch, or if you have a > script to share I could use to attempt to reproduce your results, I > would be happy to do so. Hmmm. I did manual tests really. Attached a psql script replicating them. # with builtin overflow detection sh> psql < oc.sql NOTICE: int 16 mul: 00:00:02.747269 # slow NOTICE: int 16 add: 00:00:01.83281 NOTICE: int 16 sub: 00:00:01.8501 NOTICE: uint 16 mul: 00:00:03.68362 # slower NOTICE: uint 16 add: 00:00:01.835294 NOTICE: uint 16 sub: 00:00:02.136895 # slow NOTICE: int 32 mul: 00:00:01.828065 NOTICE: int 32 add: 00:00:01.840269 NOTICE: int 32 sub: 00:00:01.843557 NOTICE: uint 32 mul: 00:00:02.447052 # slow NOTICE: uint 32 add: 00:00:01.849899 NOTICE: uint 32 sub: 00:00:01.840773 NOTICE: int 64 mul: 00:00:01.839051 NOTICE: int 64 add: 00:00:01.839065 NOTICE: int 64 sub: 00:00:01.838599 NOTICE: uint 64 mul: 00:00:02.161346 # slow NOTICE: uint 64 add: 00:00:01.839404 NOTICE: uint 64 sub: 00:00:01.838549 DO > So, do you have more comments? No other comments. -- Fabien.
Вложения
On Sun, Sep 01, 2019 at 08:07:06PM +0200, Fabien COELHO wrote: > They allow to quickly do performance tests, for me it is useful to keep it > around, but you are the committer, you do as you feel. If there are more voices for having that in core, we could consider it. For now I have added that into my own plugin repository with all the functions discussed on this thread: https://github.com/michaelpq/pg_plugins/ > The signed overflows are trickier even, I have not paid attention to the > fallback code. I agree that it is better left untouched for know. Thanks. > Hmmm. I did manual tests really. Attached a psql script replicating them. > > # with builtin overflow detection > sh> psql < oc.sql > NOTICE: int 16 mul: 00:00:02.747269 # slow > NOTICE: int 16 add: 00:00:01.83281 > NOTICE: int 16 sub: 00:00:01.8501 > NOTICE: uint 16 mul: 00:00:03.68362 # slower > NOTICE: uint 16 add: 00:00:01.835294 > NOTICE: uint 16 sub: 00:00:02.136895 # slow > NOTICE: int 32 mul: 00:00:01.828065 > NOTICE: int 32 add: 00:00:01.840269 > NOTICE: int 32 sub: 00:00:01.843557 > NOTICE: uint 32 mul: 00:00:02.447052 # slow > NOTICE: uint 32 add: 00:00:01.849899 > NOTICE: uint 32 sub: 00:00:01.840773 > NOTICE: int 64 mul: 00:00:01.839051 > NOTICE: int 64 add: 00:00:01.839065 > NOTICE: int 64 sub: 00:00:01.838599 > NOTICE: uint 64 mul: 00:00:02.161346 # slow > NOTICE: uint 64 add: 00:00:01.839404 > NOTICE: uint 64 sub: 00:00:01.838549 Actually that's much faster than a single core on my debian SID with gcc 9.2.1. Here are more results from me: Built-in undef Built-in int16 mul 00:00:05.425207 00:00:05.634417 int16 add 00:00:05.389738 00:00:06.38885 int16 sub 00:00:05.446529 00:00:06.39569 uint16 mul 00:00:05.499066 00:00:05.541617 uint16 add 00:00:05.281622 00:00:06.252511 uint16 sub 00:00:05.366424 00:00:05.457148 int32 mul 00:00:05.121209 00:00:06.154989 int32 add 00:00:05.228722 00:00:06.344721 int32 sub 00:00:05.237594 00:00:06.323543 uint32 mul 00:00:05.126339 00:00:05.921738 uint32 add 00:00:05.212085 00:00:06.183031 uint32 sub 00:00:05.201884 00:00:05.363667 int64 mul 00:00:05.136129 00:00:06.148101 int64 add 00:00:05.200201 00:00:06.329091 int64 sub 00:00:05.218028 00:00:06.313114 uint64 mul 00:00:05.444733 00:00:08.089742 uint64 add 00:00:05.603978 00:00:06.377753 uint64 sub 00:00:05.544838 00:00:05.490873 This part has been committed, now let's move to the next parts of your patch. -- Michael
Вложения
> This part has been committed, now let's move to the next parts of your > patch. Attached a rebased version which implements the int64/uint64 stuff. It is basically the previous patch without the overflow inlined functions. -- Fabien Coelho - CRI, MINES ParisTech
Вложения
On Tue, Sep 03, 2019 at 08:10:37PM +0200, Fabien COELHO wrote:
> Attached a rebased version which implements the int64/uint64 stuff. It is
> basically the previous patch without the overflow inlined functions.
-     if (!strtoint64(yytext, true, &yylval->ival))
+     if (unlikely(pg_strtoint64(yytext, &yylval->ival) != PG_STRTOINT_OK))
It seems to me that we should have unlikely() only within
pg_strtoint64(), pg_strtouint64(), etc.
-   /* skip leading spaces; cast is consistent with strtoint64 */
-   while (*ptr && isspace((unsigned char) *ptr))
+   /* skip leading spaces; cast is consistent with pg_strtoint64 */
+   while (isspace((unsigned char) *ptr))
        ptr++;
What do you think about splitting this part in two?  I would suggest
to do the refactoring in a first patch, and the consider all the
optimizations for the routines you have in mind afterwards.
I think that we don't actually need is_an_int() and str2int64() at all
in pgbench.c as we could just check for the return code of
pg_strtoint64() and switch to the double parsing only if we don't have
PG_STRTOINT_OK.
scanint8() only has one caller in the backend with your patch, and we
don't check after its return result in int8.c, so I would suggest to
move the error handling directly in int8in() and to remove scanint8().
I think that we should also introduce the [u]int[16|32] flavors and
expand them in the code in a single patch, in a way consistent with
what you have done for int64/uint64 as the state that we reach with
the patch is kind of weird as there are existing versions numutils.c.
Have you done some performance testing of the patch?  The COPY
bulkload is a good example of that:
https://www.postgresql.org/message-id/20190717181428.krqpmduejbqr4m6g%40alap3.anarazel.de
--
Michael
			
		Вложения
Hi,
On 2019-09-03 20:10:37 +0200, Fabien COELHO wrote:
> @@ -113,7 +113,7 @@ parse_output_parameters(List *options, uint32 *protocol_version,
>                           errmsg("conflicting or redundant options")));
>              protocol_version_given = true;
>
> -            if (!scanint8(strVal(defel->arg), true, &parsed))
> +            if (unlikely(pg_strtoint64(strVal(defel->arg), &parsed) != PG_STRTOINT_OK))
>                  ereport(ERROR,
>                          (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>                           errmsg("invalid proto_version")));
Unexcited about adding unlikely() to any place that's not a hot path -
which this certainly is not.
But I also wonder if we shouldn't just put the branch likelihood of
pg_strtoint64 not failing into one central location. E.g. something like
static inline pg_strtoint_status
pg_strtoint64(const char *str, int64 *result)
{
        pg_strtoint_status status;
        status = pg_strtoint64_impl(str, result);
        likely(status == PG_STRTOINT_OK);
        return status;
}
I've not tested this, but IIRC that should be sufficient to propagate
that knowledge up to callers.
> +    if (likely(stat == PG_STRTOINT_OK))
> +        return true;
> +    else if (stat == PG_STRTOINT_RANGE_ERROR)
>      {
> -        /* could fail if input is most negative number */
> -        if (unlikely(tmp == PG_INT64_MIN))
> -            goto out_of_range;
> -        tmp = -tmp;
> -    }
> -
> -    *result = tmp;
> -    return true;
> -
> -out_of_range:
> -    if (!errorOK)
>          ereport(ERROR,
>                  (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
>                   errmsg("value \"%s\" is out of range for type %s",
>                          str, "bigint")));
> -    return false;
> -
> -invalid_syntax:
> -    if (!errorOK)
> +        return false;
> +    }
> +    else if (stat == PG_STRTOINT_SYNTAX_ERROR)
> +    {
>          ereport(ERROR,
>                  (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
>                   errmsg("invalid input syntax for type %s: \"%s\"",
>                          "bigint", str)));
> -    return false;
> +        return false;
> +    }
> +    else
> +        /* cannot get here */
> +        Assert(0);
I'd write this as a switch over the enum - that way we'll get a
compile-time warning if we're not handling a value.
> +static bool
> +str2int64(const char *str, int64 *val)
> +{
> +    pg_strtoint_status        stat = pg_strtoint64(str, val);
> +
I find it weird to have a wrapper that's named 'str2...' that then calls
'strto' to implement itself.
> +/*
> + * pg_strtoint64 -- convert a string to 64-bit integer
> + *
> + * The function returns if the conversion failed, or
> + * "*result" is set to the result.
> + */
> +pg_strtoint_status
> +pg_strtoint64(const char *str, int64 *result)
> +{
> +    const char *ptr = str;
> +    int64        tmp = 0;
> +    bool        neg = false;
> +
> +    /*
> +     * Do our own scan, rather than relying on sscanf which might be broken
> +     * for long long.
> +     *
> +     * As INT64_MIN can't be stored as a positive 64 bit integer, accumulate
> +     * value as a negative number.
> +     */
> +
> +    /* skip leading spaces */
> +    while (isspace((unsigned char) *ptr))
> +        ptr++;
> +
> +    /* handle sign */
> +    if (*ptr == '-')
> +    {
> +        ptr++;
> +        neg = true;
> +    }
> +    else if (*ptr == '+')
> +        ptr++;
> +
> +    /* require at least one digit */
> +    if (unlikely(!isdigit((unsigned char) *ptr)))
> +        return PG_STRTOINT_SYNTAX_ERROR;
> +
> +    /* process digits, we know that we have one ahead */
> +    do
> +    {
> +        int64        digit = (*ptr++ - '0');
> +
> +        if (unlikely(pg_mul_s64_overflow(tmp, 10, &tmp)) ||
> +            unlikely(pg_sub_s64_overflow(tmp, digit, &tmp)))
> +            return PG_STRTOINT_RANGE_ERROR;
> +    }
> +    while (isdigit((unsigned char) *ptr));
Hm. If we're concerned about the cost of isdigit etc - and I think
that's reasonable, after looking at their implementation in glibc (it
performs a lookup in a global table to handle potential locale changes)
- I think we ought to just not use the provided isdigit, and probably
not isspace either.  This code effectively relies on the input being
ascii anyway, so we don't need any locale specific behaviour afaict
(we'd e.g. get wrong results if isdigit() returned true for something
other than the ascii chars).
To me the generated code looks considerably better if I use something
like
static inline bool
pg_isdigit(char c)
{
    return c >= '0' && c <= '9';
}
static inline bool
pg_isspace(char c)
{
    return c == ' ';
}
(if we were to actually go for this, we'd probably want to add some docs
that we don't expect EOF, or make the code safe against that).
I've not benchmarked that, but I'd be surprised if it didn't improve
matters.
And once coded using the above, there's no point in the do/while
conversion imo, as any compiler can trivially optimize redundant checks.
> +/*
> + * pg_strtouint64 -- convert a string to unsigned 64-bit integer
> + *
> + * The function returns if the conversion failed, or
> + * "*result" is set to the result.
> + */
> +pg_strtoint_status
> +pg_strtouint64(const char *str, uint64 *result)
> +{
> +    const char *ptr = str;
> +    uint64        tmp = 0;
> +
> +    /* skip leading spaces */
> +    while (isspace((unsigned char) *ptr))
> +        ptr++;
> +
> +    /* handle sign */
> +    if (*ptr == '+')
> +        ptr++;
> +    else if (unlikely(*ptr == '-'))
> +        return PG_STRTOINT_SYNTAX_ERROR;
Hm. Seems like that should return PG_STRTOINT_RANGE_ERROR?
> +typedef enum {
Please don't define anonyous types (the enum itself, which now is only
reachable via the typedef). Also, there's a missing newline here).
Greetings,
Andres Freund
			
		Bonjour Michaël, >> Attached a rebased version which implements the int64/uint64 stuff. It is >> basically the previous patch without the overflow inlined functions. > > - if (!strtoint64(yytext, true, &yylval->ival)) > + if (unlikely(pg_strtoint64(yytext, &yylval->ival) != PG_STRTOINT_OK)) > It seems to me that we should have unlikely() only within > pg_strtoint64(), pg_strtouint64(), etc. From a compiler perspective, the (un)likely tip is potentially useful on any test. We know when parsing a that it is very unlikely that the string conversion would fail, so we can tell that, so that the compiler knows which branch it should optimize first. You can argue against that if the functions are inlined, because maybe the compiler would propagate the information, but for distinct functions compiled separately the information is useful at each level. > - /* skip leading spaces; cast is consistent with strtoint64 */ > - while (*ptr && isspace((unsigned char) *ptr)) > + /* skip leading spaces; cast is consistent with pg_strtoint64 */ > + while (isspace((unsigned char) *ptr)) > ptr++; > What do you think about splitting this part in two? I would suggest > to do the refactoring in a first patch, and the consider all the > optimizations for the routines you have in mind afterwards. I would not bother. > I think that we don't actually need is_an_int() and str2int64() at all > in pgbench.c as we could just check for the return code of > pg_strtoint64() and switch to the double parsing only if we don't have > PG_STRTOINT_OK. Yep, you are right, now that the conversion functions does not error out a message, its failure can be used as a test. The version attached changes slightly the semantics, because on int overflows a double conversion will be attempted instead of erroring. I do not think that it is worth the effort of preserving the previous semantic of erroring. > scanint8() only has one caller in the backend with your patch, and we > don't check after its return result in int8.c, so I would suggest to > move the error handling directly in int8in() and to remove scanint8(). Ok. > I think that we should also introduce the [u]int[16|32] flavors and > expand them in the code in a single patch, in a way consistent with > what you have done for int64/uint64 as the state that we reach with > the patch is kind of weird as there are existing versions numutils.c. Before dealing with the 16/32 versions, which involve quite a significant amount of changes, I would want a clear message that "the 64 bit approach" is the model to follow. Moreover, I'm unsure how to rename the existing pg_strtoint32 and others which call ereport, if the name is used for the common error returning version. > Have you done some performance testing of the patch? The COPY > bulkload is a good example of that: > https://www.postgresql.org/message-id/20190717181428.krqpmduejbqr4m6g%40alap3.anarazel.de I have done no such thing for now. I would not expect any significant performance difference when loading int8 things because basically scanint8 has just been renamed pg_strtoint64 and made global, and that is more or less all. It might be a little bit slower because possible the compiler cannot inline the conversion, but on the other hand, the *likely hints and removed tests may marginaly help performance. I think that the only way to test performance significantly would be to write a specific program that loops over a conversion. -- Fabien.
Вложения
On Wed, Sep 04, 2019 at 02:08:39AM -0700, Andres Freund wrote:
>> +static bool
>> +str2int64(const char *str, int64 *val)
>> +{
>> +    pg_strtoint_status        stat = pg_strtoint64(str, val);
>> +
>
> I find it weird to have a wrapper that's named 'str2...' that then calls
> 'strto' to implement itself.
It happens that this wrapper in pgbench.c is not actually needed.
> Hm. If we're concerned about the cost of isdigit etc - and I think
> that's reasonable, after looking at their implementation in glibc (it
> performs a lookup in a global table to handle potential locale changes)
> - I think we ought to just not use the provided isdigit, and probably
> not isspace either.  This code effectively relies on the input being
> ascii anyway, so we don't need any locale specific behaviour afaict
> (we'd e.g. get wrong results if isdigit() returned true for something
> other than the ascii chars).
Yeah.  It seems to me that we have more optimizations that could come
in line here, and actually we have perhaps more refactoring at hand
with each one of the 6 functions we'd like to add at the end.  I had
in mind about first shaping the refactoring patch, consolidating all
the interfaces, and then evaluate the improvements we can come up with
as after the refactoring we'd need to update only common/string.c.
> I've not benchmarked that, but I'd be surprised if it didn't improve
> matters.
I think that you are right here, there is something to gain.  Looking
at their stuff this makes use of __isctype as told by ctype/ctype.h.
--
Michael
			
		Вложения
On Wed, Sep 04, 2019 at 12:49:17PM +0200, Fabien COELHO wrote: > From a compiler perspective, the (un)likely tip is potentially useful on any > test. We know when parsing a that it is very unlikely that the string > conversion would fail, so we can tell that, so that the compiler knows which > branch it should optimize first. > > You can argue against that if the functions are inlined, because maybe the > compiler would propagate the information, but for distinct functions > compiled separately the information is useful at each level. Hmm. There has been an extra recent argument on the matter here: https://www.postgresql.org/message-id/20190904090839.stp3madovtynq3px@alap3.anarazel.de I am not sure that we should tackle that as part of the first refactoring though, as what we'd want is first to put all the interfaces in a single place we can deal with afterwards. > Yep, you are right, now that the conversion functions does not error out a > message, its failure can be used as a test. > > The version attached changes slightly the semantics, because on int > overflows a double conversion will be attempted instead of erroring. I do > not think that it is worth the effort of preserving the previous semantic of > erroring. Yes. I would move things in this direction. I may reconsider this part again with more testing but switching from one to the other is simple enough so let's keep the code as you suggest for now. >> scanint8() only has one caller in the backend with your patch, and we >> don't check after its return result in int8.c, so I would suggest to >> move the error handling directly in int8in() and to remove scanint8(). > > Ok. As per the extra comments of upthread, this should use a switch without a default. > Before dealing with the 16/32 versions, which involve quite a significant > amount of changes, I would want a clear message that "the 64 bit approach" > is the model to follow. > > Moreover, I'm unsure how to rename the existing pg_strtoint32 and others > which call ereport, if the name is used for the common error returning > version. Right, there was this part. This brings also the point of having one interface for the backend as all the error messages for the backend are actually the same, with the most simple name being that: pg_strtoint(value, size, error_ok). This then calls all the sub-routines we have in src/common/. There were more suggestions here: https://www.postgresql.org/message-id/20190729050539.d7mbjabcrlv7bxc3@alap3.anarazel.de > I would not expect any significant performance difference when loading int8 > things because basically scanint8 has just been renamed pg_strtoint64 and > made global, and that is more or less all. It might be a little bit slower > because possible the compiler cannot inline the conversion, but on the other > hand, the *likely hints and removed tests may marginaly help performance. I > think that the only way to test performance significantly would be to write > a specific program that loops over a conversion. I would suspect a change for pg_strtouint64(). -- Michael
Вложения
On Thu, Sep 05, 2019 at 03:52:48PM +0900, Michael Paquier wrote:
> Right, there was this part.  This brings also the point of having one
> interface for the backend as all the error messages for the backend
> are actually the same, with the most simple name being that:
> pg_strtoint(value, size, error_ok).
I have been looking at that for the last couple of days.  First, I
have consolidated all the error strings in a single routine like this
one, except that error_ok is not really needed if you take this
approach: callers that don't care about failures could just call the
set of routines in common/string.c and be done with it.
Attached is an update of my little module that I used to check that
the refactoring was done correctly (regression tests attached), it
also includes a couple of routines to check the performance difference
between one approach and the other, with focus on two things:
- Is pg_strtouint64 really improved?
- How much do we lose by moving to a common interface in the backend
with pg_strtoint?
The good news is that removing strtol from pg_strtouint64 really
improves the performance as already reported, and with one billion
calls in a tight loop you see a clear difference:
=# select pg_strtouint64_old_check('10000', 1000000000);
 pg_strtouint64_old_check
--------------------------
                    10000
(1 row)
Time: 15576.539 ms (00:15.577)
=# select pg_strtouint64_new_check('10000', 1000000000);
 pg_strtouint64_new_check
--------------------------
                    10000
(1 row)
Time: 8820.544 ms (00:08.821)
So the new implementation is more than 40% faster with this
micro-benchmark on my Debian box.
The bad news is that a pg_strtoint() wrapper is not a good idea:
=# select pg_strtoint_check('10000', 1000000000, 4);
 pg_strtoint_check
-------------------
             10000
(1 row)
Time: 11178.101 ms (00:11.178)
=# select pg_strtoint32_check('10000', 1000000000);
 pg_strtoint32_check
---------------------
               10000
(1 row)
Time: 9252.894 ms (00:09.253)
So trying to consolidate all error messages leads to a 15% hit with
this test, which sucks.
So, it seems to me that if we want to have a consolidation of
everything, we basically need to have the generic error messages for
the backend directly into common/string.c where the routines are
refactored, and I think that we should do the following based on the
patch attached:
- Just remove pg_strtoint(), and move the error generation logic in
common/string.c.
- Add an error_ok flag for all the pg_strto[u]int{16|32|64} routines,
which generates errors only for FRONTEND is not defined.
- Use pg_log_error() for the frontend errors.
If those errors are added everywhere, we would basically have no code
paths in the frontend or the backend (for the unsigned part only)
which use them yet.  Another possibility would be to ignore the
error_ok flag in those cases, but that's inconsistent.  My take would
be here to have a more generic error message, like that:
"value \"%s\" is out of range for [unsigned] {16|32|64}-bit integer"
"invalid input syntax for [unsigned] {16|32|64}-bit integer: \"%s\"\n"
I do not suggest to change the messages for the backend for signed
entries though, as these are linked to the data types used.
Attached is an update of my toy module, and an updated patch with what
I have done up to now.  This stuff already does a lot, so for now I
have not worked on the removal strtoint() in string.c yet.  We could
just do that with a follow-up patch and make it use the new conversion
routines once we are sure that they are in a correct shape.  As
strtoint() makes use of strtol(), switching to the new routines will
be much faster anyway...
Fabien, any thoughts?
--
Michael
			
		Вложения
Hi,
On 2019-09-09 14:28:14 +0900, Michael Paquier wrote:
> On Thu, Sep 05, 2019 at 03:52:48PM +0900, Michael Paquier wrote:
> > Right, there was this part.  This brings also the point of having one
> > interface for the backend as all the error messages for the backend
> > are actually the same, with the most simple name being that:
> > pg_strtoint(value, size, error_ok).
I *VEHEMENTLY* oppose the introduction of any future pseudo-generic
routines taking the type width as a parameter. They're weakening the
type system and they're unnecessarily inefficient.
I don't really buy that saving a few copies of a strings is worth that
much. But if you really want to do that, the right approach imo would be
to move the error reporting into a separate function. I.e. something
void pg_attribute_noreturn()
pg_strtoint_error(pg_strtoint_status status, const char *input, const char *type)
that you'd call in small wrappers. Something like
static inline int32
pg_strtoint32_check(const char* s)
{
    int32 result;
    pg_strtoint_status status = pg_strtoint32(s, &result);
    if (unlikely(status == PG_STRTOINT_OK))
       pg_strtoint_error(status, s, "int32");
    return result;
}
> So, it seems to me that if we want to have a consolidation of
> everything, we basically need to have the generic error messages for
> the backend directly into common/string.c where the routines are
> refactored, and I think that we should do the following based on the
> patch attached:
> - Just remove pg_strtoint(), and move the error generation logic in
> common/string.c.
I'm not quite sure what you mean by moving the "error generation logic"?
> - Add an error_ok flag for all the pg_strto[u]int{16|32|64} routines,
> which generates errors only for FRONTEND is not defined.
I think this is a bad idea.
> - Use pg_log_error() for the frontend errors.
>
> If those errors are added everywhere, we would basically have no code
> paths in the frontend or the backend (for the unsigned part only)
> which use them yet.  Another possibility would be to ignore the
> error_ok flag in those cases, but that's inconsistent.
Yea, ignoring it would be terrible idea.
> diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
> index a9bd47d937..593a5ef54e 100644
> --- a/src/backend/libpq/pqmq.c
> +++ b/src/backend/libpq/pqmq.c
> @@ -286,10 +286,10 @@ pq_parse_errornotice(StringInfo msg, ErrorData *edata)
>                  edata->hint = pstrdup(value);
>                  break;
>              case PG_DIAG_STATEMENT_POSITION:
> -                edata->cursorpos = pg_strtoint32(value);
> +                edata->cursorpos = pg_strtoint(value, 4);
>                  break;
I'd be really upset if this type of change went in.
>  #include "fmgr.h"
>  #include "miscadmin.h"
> +#include "common/string.h"
>  #include "nodes/extensible.h"
>  #include "nodes/parsenodes.h"
>  #include "nodes/plannodes.h"
> @@ -80,7 +81,7 @@
>  #define READ_UINT64_FIELD(fldname) \
>      token = pg_strtok(&length);        /* skip :fldname */ \
>      token = pg_strtok(&length);        /* get field value */ \
> -    local_node->fldname = pg_strtouint64(token, NULL, 10)
> +    (void) pg_strtouint64(token, &local_node->fldname)
Seems like these actually could just ought to use the error-checked
variants. And I think it ought to change all of
READ_{INT,UINT,LONG,UINT64,OID}_FIELD, rather than just redirecting one
of them to the new routines.
>  static void pcb_error_callback(void *arg);
> @@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location)
>
>          case T_Float:
>              /* could be an oversize integer as well as a float ... */
> -            if (scanint8(strVal(value), true, &val64))
> +            if (pg_strtoint64(strVal(value), &val64) == PG_STRTOINT_OK)
>              {
>                  /*
>                   * It might actually fit in int32. Probably only INT_MIN can
Not for this change, but we really ought to move away from this crappy
logic. It's really bonkers to have T_Float represent large integers and
floats.
> +/*
> + * pg_strtoint16
> + *
> + * Convert input string to a signed 16-bit integer.  Allows any number of
> + * leading or trailing whitespace characters.
> + *
> + * NB: Accumulate input as a negative number, to deal with two's complement
> + * representation of the most negative number, which can't be represented as a
> + * positive number.
> + *
> + * The function returns immediately if the conversion failed with a status
> + * value to let the caller handle the error.  On success, the result is
> + * stored in "*result".
> + */
> +pg_strtoint_status
> +pg_strtoint16(const char *s, int16 *result)
> +{
> +    const char *ptr = s;
> +    int16        tmp = 0;
> +    bool        neg = false;
> +
> +    /* skip leading spaces */
> +    while (likely(*ptr) && isspace((unsigned char) *ptr))
> +        ptr++;
> +
> +    /* handle sign */
> +    if (*ptr == '-')
> +    {
> +        ptr++;
> +        neg = true;
> +    }
> +    else if (*ptr == '+')
> +        ptr++;
> +    /* require at least one digit */
> +    if (unlikely(!isdigit((unsigned char) *ptr)))
> +        return PG_STRTOINT_SYNTAX_ERROR;
Wonder if there's an argument for moving this behaviour to someplace
else - in most cases we really don't expect whitespace, and checking for
it is unnecessary overhead.
Greetings,
Andres Freund
			
		On Mon, Sep 09, 2019 at 03:17:38AM -0700, Andres Freund wrote:
> On 2019-09-09 14:28:14 +0900, Michael Paquier wrote:
> I *VEHEMENTLY* oppose the introduction of any future pseudo-generic
> routines taking the type width as a parameter. They're weakening the
> type system and they're unnecessarily inefficient.
I saw that, using the previous wrapper increases the time of one call
from roughly 0.9ms to 1.1ns.
> I don't really buy that saving a few copies of a strings is worth that
> much. But if you really want to do that, the right approach imo would be
> to move the error reporting into a separate function. I.e. something
>
> void pg_attribute_noreturn()
> pg_strtoint_error(pg_strtoint_status status, const char *input, const char *type)
>
> that you'd call in small wrappers. Something like
I am not completely sure if we should do that either anyway.  Another
approach would be to try to make the callers of the routines generate
their own error messages.  The errors we have now are really linked to
the data types we have in core for signed integers (smallint, int,
bigint).  In most cases do they really make sense (varlena.c, pqmq.c,
etc.)?  And for errors which should never happen we could just use
elog().  For the input functions of int2/4/8 we still need the
existing errors of course.
>> So, it seems to me that if we want to have a consolidation of
>> everything, we basically need to have the generic error messages for
>> the backend directly into common/string.c where the routines are
>> refactored, and I think that we should do the following based on the
>> patch attached:
>
>> - Just remove pg_strtoint(), and move the error generation logic in
>> common/string.c.
>
> I'm not quite sure what you mean by moving the "error generation logic"?
I was referring to the error messages we have on HEAD in scanint8()
and pg_strtoint16() for bad inputs and overflows.
> Seems like these actually could just ought to use the error-checked
> variants. And I think it ought to change all of
> READ_{INT,UINT,LONG,UINT64,OID}_FIELD, rather than just redirecting one
Right.
>>  static void pcb_error_callback(void *arg);
>> @@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location)
>>
>>          case T_Float:
>>              /* could be an oversize integer as well as a float ... */
>> -            if (scanint8(strVal(value), true, &val64))
>> +            if (pg_strtoint64(strVal(value), &val64) == PG_STRTOINT_OK)
>>              {
>>                  /*
>>                   * It might actually fit in int32. Probably only INT_MIN can
>
> Not for this change, but we really ought to move away from this crappy
> logic. It's really bonkers to have T_Float represent large integers and
> floats.
I am not sure but what are you suggesting here?
>> +    /* skip leading spaces */
>> +    while (likely(*ptr) && isspace((unsigned char) *ptr))
>> +        ptr++;
>> +
>> +    /* handle sign */
>> +    if (*ptr == '-')
>> +    {
>> +        ptr++;
>> +        neg = true;
>> +    }
>> +    else if (*ptr == '+')
>> +        ptr++;
>
>> +    /* require at least one digit */
>> +    if (unlikely(!isdigit((unsigned char) *ptr)))
>> +        return PG_STRTOINT_SYNTAX_ERROR;
>
> Wonder if there's an argument for moving this behaviour to someplace
> else - in most cases we really don't expect whitespace, and checking for
> it is unnecessary overhead.
Not sure about that.  I would keep the scope of the patch simple as of
now, where we make sure that we have the right interface for
everything.  There are a couple of extra improvements which could be
done afterwards, and if we move everything in the same place that
should be easier to move on with more improvements.  Hopefully.
--
Michael
			
		Вложения
Hi,
On 2019-09-09 20:57:46 +0900, Michael Paquier wrote:
> > I don't really buy that saving a few copies of a strings is worth that
> > much. But if you really want to do that, the right approach imo would be
> > to move the error reporting into a separate function. I.e. something
> > 
> > void pg_attribute_noreturn()
> > pg_strtoint_error(pg_strtoint_status status, const char *input, const char *type)
> > 
> > that you'd call in small wrappers. Something like
> 
> I am not completely sure if we should do that either anyway.  Another
> approach would be to try to make the callers of the routines generate
> their own error messages.  The errors we have now are really linked to
> the data types we have in core for signed integers (smallint, int,
> bigint).  In most cases do they really make sense (varlena.c, pqmq.c,
> etc.)?
I think there's plenty places that ought to use the checked functions,
even if they currently don't. And typically calling in the caller will
actually be slightly less efficient, than an out-of-line function like I
was proposing above, because it'll be in-line code for infrequent code.
But ISTM all of them ought to just use the C types, rather than the SQL
types however. Since in the above proposal the caller determines the
type names, if you want a different type - like the SQL input routines -
can just invoke pg_strtoint_error() themselves (or just have it open
coded).
> And for errors which should never happen we could just use
> elog().  For the input functions of int2/4/8 we still need the
> existing errors of course.
Right, there it makes sense to continue to refer the SQL level types.
> >> So, it seems to me that if we want to have a consolidation of
> >> everything, we basically need to have the generic error messages for
> >> the backend directly into common/string.c where the routines are
> >> refactored, and I think that we should do the following based on the
> >> patch attached:
> > 
> >> - Just remove pg_strtoint(), and move the error generation logic in
> >> common/string.c.
> > 
> > I'm not quite sure what you mean by moving the "error generation logic"?
> 
> I was referring to the error messages we have on HEAD in scanint8()
> and pg_strtoint16() for bad inputs and overflows.
Not the right direction imo.
> >>  static void pcb_error_callback(void *arg);
> >> @@ -496,7 +496,7 @@ make_const(ParseState *pstate, Value *value, int location)
> >>
> >>          case T_Float:
> >>              /* could be an oversize integer as well as a float ... */
> >> -            if (scanint8(strVal(value), true, &val64))
> >> +            if (pg_strtoint64(strVal(value), &val64) == PG_STRTOINT_OK)
> >>              {
> >>                  /*
> >>                   * It might actually fit in int32. Probably only INT_MIN can
> > 
> > Not for this change, but we really ought to move away from this crappy
> > logic. It's really bonkers to have T_Float represent large integers and
> > floats.
> 
> I am not sure but what are you suggesting here?
I'm saying that we shouldn't need the whole logic of trying to parse the
string as an int, and then fail to float if it's not that. But that it's
not this patchset's task to fix this.
> >> +    /* skip leading spaces */
> >> +    while (likely(*ptr) && isspace((unsigned char) *ptr))
> >> +        ptr++;
> >> +
> >> +    /* handle sign */
> >> +    if (*ptr == '-')
> >> +    {
> >> +        ptr++;
> >> +        neg = true;
> >> +    }
> >> +    else if (*ptr == '+')
> >> +        ptr++;
> > 
> >> +    /* require at least one digit */
> >> +    if (unlikely(!isdigit((unsigned char) *ptr)))
> >> +        return PG_STRTOINT_SYNTAX_ERROR;
> > 
> > Wonder if there's an argument for moving this behaviour to someplace
> > else - in most cases we really don't expect whitespace, and checking for
> > it is unnecessary overhead.
> 
> Not sure about that.  I would keep the scope of the patch simple as of
> now, where we make sure that we have the right interface for
> everything.  There are a couple of extra improvements which could be
> done afterwards, and if we move everything in the same place that
> should be easier to move on with more improvements.  Hopefully.
The only reason for thinking about it now is that we'd then avoid
changing the API twice.
Greetings,
Andres Freund
			
		On Mon, Sep 09, 2019 at 03:17:38AM -0700, Andres Freund wrote:
> On 2019-09-09 14:28:14 +0900, Michael Paquier wrote:
>> @@ -80,7 +81,7 @@
>>  #define READ_UINT64_FIELD(fldname) \
>>      token = pg_strtok(&length);        /* skip :fldname */ \
>>      token = pg_strtok(&length);        /* get field value */ \
>> -    local_node->fldname = pg_strtouint64(token, NULL, 10)
>> +    (void) pg_strtouint64(token, &local_node->fldname)
>
> Seems like these actually could just ought to use the error-checked
> variants. And I think it ought to change all of
> READ_{INT,UINT,LONG,UINT64,OID}_FIELD, rather than just redirecting one
> of them to the new routines.
Okay for these changes, except for READ_INT_FIELD where we have short
variables using it as well (for example StrategyNumber) so this
generates a justified warning.  I think that a correct solution
here would be to add a new READ_SHORT_FIELD which uses pg_strtoint16.
I am not adding that for now.
--
Michael
			
		Вложения
On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote: > On 2019-09-09 20:57:46 +0900, Michael Paquier wrote: > But ISTM all of them ought to just use the C types, rather than the SQL > types however. Since in the above proposal the caller determines the > type names, if you want a different type - like the SQL input routines - > can just invoke pg_strtoint_error() themselves (or just have it open > coded). Yep, that was my line of thoughts. >> And for errors which should never happen we could just use >> elog(). For the input functions of int2/4/8 we still need the >> existing errors of course. > > Right, there it makes sense to continue to refer the SQL level types. Actually, I found your suggestion of using a noreturn function for the error reporting to be a very clean alternative. I didn't know though that gcc is not able to detect that a function does not return if you don't have a default in the switch for all the status codes. And this even if all the values of the enum for the switch are listed. > I'm saying that we shouldn't need the whole logic of trying to parse the > string as an int, and then fail to float if it's not that. But that it's > not this patchset's task to fix this. Ah, sure. Agreed. >> Not sure about that. I would keep the scope of the patch simple as of >> now, where we make sure that we have the right interface for >> everything. There are a couple of extra improvements which could be >> done afterwards, and if we move everything in the same place that >> should be easier to move on with more improvements. Hopefully. > > The only reason for thinking about it now is that we'd then avoid > changing the API twice. What I think we would be looking for here is an extra argument for the low-level routines to control the behavior of the function in an extensible way, say a bits16 for a set of flags, with one flag to ignore checks for trailing and leading whitespace. This feels a bit over-engineered though for this purpose. Attached is an updated patch? How does it look? I have left the parts of readfuncs.c for now as there are more issues behind that than doing a single switch, short reads are one, long reads a second. And the patch already does a lot. There could be also an argument for having extra _check wrappers for the unsigned portions but these would be mostly unused in the backend code, so I have left that out on purpose. After all that stuff, there are still some issues which need more care, in short: - the T_Float conversion. - removal of strtoint() - the part for readfuncs.c -- Michael
Вложения
On Tue, Sep 10, 2019 at 12:05:25PM +0900, Michael Paquier wrote: > Attached is an updated patch? How does it look? I have left the > parts of readfuncs.c for now as there are more issues behind that than > doing a single switch, short reads are one, long reads a second. And > the patch already does a lot. There could be also an argument for > having extra _check wrappers for the unsigned portions but these would > be mostly unused in the backend code, so I have left that out on > purpose. I have looked at this patch again today after letting it aside a couple of days, and I quite like the resulting shape of the routines. Does anybody else have any comments? Would it make sense to extend more the string-to-int conversion routines with a set of control flags to bypass the removal of leading and trailing whitespaces? -- Michael
Вложения
On 2019-09-10 12:05:25 +0900, Michael Paquier wrote:
> On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
> > On 2019-09-09 20:57:46 +0900, Michael Paquier wrote:
> > But ISTM all of them ought to just use the C types, rather than the SQL
> > types however. Since in the above proposal the caller determines the
> > type names, if you want a different type - like the SQL input routines -
> > can just invoke pg_strtoint_error() themselves (or just have it open
> > coded).
> 
> Yep, that was my line of thoughts.
> 
> >> And for errors which should never happen we could just use
> >> elog().  For the input functions of int2/4/8 we still need the
> >> existing errors of course.
> > 
> > Right, there it makes sense to continue to refer the SQL level types.
> 
> Actually, I found your suggestion of using a noreturn function for the
> error reporting to be a very clean alternative.  I didn't know though
> that gcc is not able to detect that a function does not return if you
> don't have a default in the switch for all the status codes.  And this
> even if all the values of the enum for the switch are listed.
As I proposed they'd be in different translation units, so the compiler
wouldn't see the definition of the function, just the declaration.
> >> Not sure about that.  I would keep the scope of the patch simple as of
> >> now, where we make sure that we have the right interface for
> >> everything.  There are a couple of extra improvements which could be
> >> done afterwards, and if we move everything in the same place that
> >> should be easier to move on with more improvements.  Hopefully.
> > 
> > The only reason for thinking about it now is that we'd then avoid
> > changing the API twice.
> 
> What I think we would be looking for here is an extra argument for the
> low-level routines to control the behavior of the function in an
> extensible way, say a bits16 for a set of flags, with one flag to
> ignore checks for trailing and leading whitespace.
That'd probably be a bad idea, for performance reasons.
> Attached is an updated patch?  How does it look?  I have left the
> parts of readfuncs.c for now as there are more issues behind that than
> doing a single switch, short reads are one, long reads a second.
Hm? I don't know what you mean by those issues.
> And the patch already does a lot.  There could be also an argument for
> having extra _check wrappers for the unsigned portions but these would
> be mostly unused in the backend code, so I have left that out on
> purpose.
I'd value consistency higher here.
> diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
> index 2c0ae395ba..8e75d52b06 100644
> --- a/src/backend/executor/spi.c
> +++ b/src/backend/executor/spi.c
> @@ -21,6 +21,7 @@
>  #include "catalog/heap.h"
>  #include "catalog/pg_type.h"
>  #include "commands/trigger.h"
> +#include "common/string.h"
>  #include "executor/executor.h"
>  #include "executor/spi_priv.h"
>  #include "miscadmin.h"
> @@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
>                      CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
>  
>                      if (strncmp(completionTag, "SELECT ", 7) == 0)
> -                        _SPI_current->processed =
> -                            pg_strtouint64(completionTag + 7, NULL, 10);
> +                        (void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);
I'd just use the checked version here, seems like a good thing to check
for, and I can't imagine it matters performance wise.
> @@ -63,8 +63,16 @@ Datum
>  int2in(PG_FUNCTION_ARGS)
>  {
>      char       *num = PG_GETARG_CSTRING(0);
> +    int16        res;
> +    pg_strtoint_status status;
>  
> -    PG_RETURN_INT16(pg_strtoint16(num));
> +    /* Use a custom set of error messages here adapted to the data type */
> +    status = pg_strtoint16(num, &res);
I don't know what that comment is supposed to mean?
> +/*
> + * pg_strtoint64_check
> + *
> + * Convert input string to a signed 64-bit integer.
> + *
> + * This throws ereport() upon bad input format or overflow.
> + */
> +int64
> +pg_strtoint64_check(const char *s)
> +{
> +    int64        result;
> +    pg_strtoint_status status = pg_strtoint64(s, &result);
> +
> +    if (unlikely(status != PG_STRTOINT_OK))
> +        pg_strtoint_error(status, s, "int64");
> +    return result;
> +}
I think I'd just put these as inlines in the header.
Greetings,
Andres Freund
			
		On Fri, Sep 13, 2019 at 06:38:31PM -0700, Andres Freund wrote:
> On 2019-09-10 12:05:25 +0900, Michael Paquier wrote:
>> On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
>> Attached is an updated patch?  How does it look?  I have left the
>> parts of readfuncs.c for now as there are more issues behind that than
>> doing a single switch, short reads are one, long reads a second.
>
> Hm? I don't know what you mean by those issues.
I think that we have more issues than it looks.  For example:
- Switching UINT to use pg_strtouint32() causes an incompatibility
issue compared to atoui().
- Switching INT to use pg_strtoint32() causes a set of warnings as for
example with AttrNumber:
72 |  (void) pg_strtoint32(token, &local_node->fldname)
   |                              ^~~~~~~~~~~~~~~~~~~~~
   |                              |
   |                              AttrNumber * {aka short int *}
And it is not like we should use a cast either, as we could hide real
issues.     Hence it seems to me that we need to have a new routine
definition for shorter integers and switch more flags to that.
- Switching LONG to use pg_strtoint64() leads to another set of
issues, particularly one could see an assertion failure related to Agg
nodes.  I am not sure either that we should use int64 here as the size
can be at least 32b.
- Switching OID to use pg_strtoint32 causes a failure with initdb.
So while I agree with you that a switch should be doable, there is a
large set of issues to ponder about here, and the patch already does a
lot, so I really think that we had better do a closer lookup at those
issues separately, once the basics are in place, and consider them if
they actually make sense.  There is much more than just doing a direct
switch in this area with the family of ato*() system calls.
>> And the patch already does a lot.  There could be also an argument for
>> having extra _check wrappers for the unsigned portions but these would
>> be mostly unused in the backend code, so I have left that out on
>> purpose.
>
> I'd value consistency higher here.
Okay, no objections to that.
>> @@ -2338,8 +2339,7 @@ _SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
>>                      CreateTableAsStmt *ctastmt = (CreateTableAsStmt *) stmt->utilityStmt;
>>
>>                      if (strncmp(completionTag, "SELECT ", 7) == 0)
>> -                        _SPI_current->processed =
>> -                            pg_strtouint64(completionTag + 7, NULL, 10);
>> +                        (void) pg_strtouint64(completionTag + 7, &_SPI_current->processed);
>
> I'd just use the checked version here, seems like a good thing to check
> for, and I can't imagine it matters performance wise.
Yeah, makes sense.  I don't think it matters either for
pg_stat_statements in the same context.  So changed that part as
well.
>> @@ -63,8 +63,16 @@ Datum
>>  int2in(PG_FUNCTION_ARGS)
>>  {
>>      char       *num = PG_GETARG_CSTRING(0);
>> +    int16        res;
>> +    pg_strtoint_status status;
>>
>> -    PG_RETURN_INT16(pg_strtoint16(num));
>> +    /* Use a custom set of error messages here adapted to the data type */
>> +    status = pg_strtoint16(num, &res);
>
> I don't know what that comment is supposed to mean?
I mean here that the _check equivalent cannot be used as any error
messages generated need to be consistent with the SQL data type.  I
have updated the comment, does it look better now?
>> +/*
>> + * pg_strtoint64_check
>> + *
>> + * Convert input string to a signed 64-bit integer.
>> + *
>> + * This throws ereport() upon bad input format or overflow.
>> + */
>> +int64
>> +pg_strtoint64_check(const char *s)
>> +{
>> +    int64        result;
>> +    pg_strtoint_status status = pg_strtoint64(s, &result);
>> +
>> +    if (unlikely(status != PG_STRTOINT_OK))
>> +        pg_strtoint_error(status, s, "int64");
>> +    return result;
>> +}
>
> I think I'd just put these as inlines in the header.
I have not considered that.  This bloats a bit more builtins.h.  We
could live with that, or just move that into a separate header in
include/utils/, say int.h?  Even if common/int.h exists?
Attached is an updated patch.  Perhaps you have something else in
mind?
--
Michael
			
		Вложения
Bonjour Michaël,
> - Switching INT to use pg_strtoint32() causes a set of warnings as for
> example with AttrNumber:
> 72 |  (void) pg_strtoint32(token, &local_node->fldname)
>   |                              ^~~~~~~~~~~~~~~~~~~~~
>   |                              |
>   |                              AttrNumber * {aka short int *}
> And it is not like we should use a cast either, as we could hide real
> issues.
It should rather call pg_strtoint16? And possibly switch the "short int" 
declaration to int16?
About batch v14: applies cleanly and compiles without warnings. "make 
check" ok.
I do not think that "pg_strtoint_error" should be inlinable. The function 
is unlikely to be called, so it is not performance critical to inline it, 
and would enlarge the executable needlessly. However, the 
"pg_strto*_check" variants should be inlinable, as you have done.
About the code, on several instances of:
        /* skip leading spaces */
          while (likely(*ptr) && isspace((unsigned char) *ptr)) …
I would drop the "likely(*ptr)".
And on several instances of:
    !unlikely(isdigit((unsigned char) *ptr)))
ISTM that we want "unlikely(!isdigit((unsigned char) *ptr)))". Parsing 
!unlikely leads to false conclusion and a headache:-)
Otherwise, this batch of changes looks ok to me.
-- 
Fabien.
			
		On Sat, Sep 14, 2019 at 10:24:10AM +0200, Fabien COELHO wrote: > It should rather call pg_strtoint16? And possibly switch the "short int" > declaration to int16? Sure, but you get into other problems if using the 16-bit version for some other fields, which is why it seems to me that we should add an extra routine for shorts. So for now I prefer discarding this part. > I do not think that "pg_strtoint_error" should be inlinable. The function is > unlikely to be called, so it is not performance critical to inline it, and > would enlarge the executable needlessly. However, the "pg_strto*_check" > variants should be inlinable, as you have done. Makes sense. > About the code, on several instances of: > > /* skip leading spaces */ > while (likely(*ptr) && isspace((unsigned char) *ptr)) … > > I would drop the "likely(*ptr)". Right as well. There were two places out of six with that pattern. > And on several instances of: > > !unlikely(isdigit((unsigned char) *ptr))) > > ISTM that we want "unlikely(!isdigit((unsigned char) *ptr)))". Parsing > !unlikely leads to false conclusion and a headache:-) That part was actually inconsistent with the rest. > Otherwise, this batch of changes looks ok to me. Thanks. -- Michael
Вложения
Hi,
On 2019-09-14 15:02:36 +0900, Michael Paquier wrote:
> On Fri, Sep 13, 2019 at 06:38:31PM -0700, Andres Freund wrote:
> > On 2019-09-10 12:05:25 +0900, Michael Paquier wrote:
> >> On Mon, Sep 09, 2019 at 05:27:04AM -0700, Andres Freund wrote:
> >> Attached is an updated patch?  How does it look?  I have left the
> >> parts of readfuncs.c for now as there are more issues behind that than
> >> doing a single switch, short reads are one, long reads a second.
> > 
> > Hm? I don't know what you mean by those issues.
> 
> I think that we have more issues than it looks.  For example:
> - Switching UINT to use pg_strtouint32() causes an incompatibility
> issue compared to atoui().
"An incompatibility" is, uh, vague.
> - Switching INT to use pg_strtoint32() causes a set of warnings as for
> example with AttrNumber:
> 72 |  (void) pg_strtoint32(token, &local_node->fldname)
>    |                              ^~~~~~~~~~~~~~~~~~~~~
>    |                              |
>    |                              AttrNumber * {aka short int *}
> And it is not like we should use a cast either, as we could hide real
> issues.     Hence it seems to me that we need to have a new routine
> definition for shorter integers and switch more flags to that.
Yea.
> - Switching LONG to use pg_strtoint64() leads to another set of
> issues, particularly one could see an assertion failure related to Agg
> nodes.  I am not sure either that we should use int64 here as the size
> can be at least 32b.
That seems pretty clearly something that needs to be debugged before
applying this series. If there's such a failure, it indicates that
there's either a problem in this patchset, or a pre-existing problem in
readfuncs.
> - Switching OID to use pg_strtoint32 causes a failure with initdb.
Needs to be debugged too. Although I suspect this might just be that you
need to use unsigned variant.
> So while I agree with you that a switch should be doable, there is a
> large set of issues to ponder about here, and the patch already does a
> lot, so I really think that we had better do a closer lookup at those
> issues separately, once the basics are in place, and consider them if
> they actually make sense.  There is much more than just doing a direct
> switch in this area with the family of ato*() system calls.
I have no problme applying this separately, but I really don't think
it's wise to apply this before these problems have been debugged.
Greetings,
Andres Freund
			
		Bonjour Michaël, >> Otherwise, this batch of changes looks ok to me. > > Thanks. About v15: applies cleanly, compiles, "make check" ok. While re-reading the patch, there are bit of repetitions on pg_strtou?int* comments. I'm wondering whether it would make sense to write a global comments before each 3-type series to avoid that. -- Fabien.
On Mon, Sep 16, 2019 at 10:08:19AM -0700, Andres Freund wrote: > On 2019-09-14 15:02:36 +0900, Michael Paquier wrote: >> - Switching OID to use pg_strtoint32 causes a failure with initdb. > > Needs to be debugged too. Although I suspect this might just be that you > need to use unsigned variant. No, that's not it. My last message had a typo as I used the unsigned variant. Anyway, by switching the routines of readfuncs.c to use the new _check wrappers it is easy to see what the actual issue is. And here we go with one example: "FATAL: invalid input syntax for type uint32: "12089 :relkind v" So, the root of the problem is that the optimized conversion routines would complain if the end of the string includes incorrect characters when converting a node from text, which is not something that strtoXX complains about. So our wrappers atooid() and atoui() accept more types of strings in input as they rely on the system's strtol(). And that counts for the failures with UINT and OID. atoi() also is more flexible on that, which explains the failures for INT, as well as atol() for LONG (this shows a failure in the regression tests, not at initdb time though). One may think that this restriction does not actually apply to READ_UINT64_FIELD because the routine involves no other things than queryId. However once I enable -DWRITE_READ_PARSE_PLAN_TREES -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES in the builds, queryId parsing also complains with the patch. So except if we redesign the node reads we are bound to keep around the wrapper of strtoXX on HEAD called pg_strtouint64() to avoid an incompatibility when parsing the 64-bit query ID. We could keep that isolated in readfuncs.c close to the declarations of atoui() and strtobool() though. This also points out that pg_strtouint64 of HEAD is inconsistent with its signed relatives in the treatment of the input string. The code paths of the patch calling pg_strtouint64_check to parse completion tags (spi.c and pg_stat_statements.c) should complain in those cases as the format of the tags for SELECT and COPY is fixed. In order to unify the parsing interface and put all the conversion routines in a single place, I still think that the patch has value so I would still keep it (with a fix for the queryId parsing of course), but there is much more to it. >> So while I agree with you that a switch should be doable, there is a >> large set of issues to ponder about here, and the patch already does a >> lot, so I really think that we had better do a closer lookup at those >> issues separately, once the basics are in place, and consider them if >> they actually make sense. There is much more than just doing a direct >> switch in this area with the family of ato*() system calls. > > I have no problem applying this separately, but I really don't think > it's wise to apply this before these problems have been debugged. Sure. No problem with that line of reasoning. -- Michael
Вложения
On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote: > The code paths of the patch calling pg_strtouint64_check to parse > completion tags (spi.c and pg_stat_statements.c) should complain in > those cases as the format of the tags for SELECT and COPY is fixed. > > In order to unify the parsing interface and put all the conversion > routines in a single place, I still think that the patch has value so > I would still keep it (with a fix for the queryId parsing of course), > but there is much more to it. Forgot to mention that another angle of attack would be of course to add some control flags in the optimized parsing functions to make them more permissive regarding the handling of the trailing characters, by not considering as a syntax error the case where the last character is not a zero-termination. -- Michael
Вложения
On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote: > In order to unify the parsing interface and put all the conversion > routines in a single place, I still think that the patch has value so > I would still keep it (with a fix for the queryId parsing of course), > but there is much more to it. As of now, here is an updated patch which takes the path to not complicate the refactored APIs and fixes the issue with queryID in readfuncs.c. Thoughts? -- Michael
Вложения
Is there any specific reason for hard coding the *base* of a number representing the string in strtouint64(). I understand that currently strtouint64() is being used just to convert an input string to decimal unsigned value but what if we want it to be used for hexadecimal values or may be some other values, in that case it can't be used. Further, the function name is strtouint64() but the comments atop it's definition says it's pg_strtouint64(). That needs to be corrected. At few places, I could see that the function call to pg_strtoint32_check() is followed by an error handling. Isn't that already being done in pg_strtoint32_check function itself. For e.g. in refint.c the function call to pg_strtoint32_check is followed by a if condition that checks for an error which I assume shouldn't be there as it is already being done by pg_strtoint32_check. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Sep 18, 2019 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Sep 17, 2019 at 11:29:13AM +0900, Michael Paquier wrote: > > In order to unify the parsing interface and put all the conversion > > routines in a single place, I still think that the patch has value so > > I would still keep it (with a fix for the queryId parsing of course), > > but there is much more to it. > > As of now, here is an updated patch which takes the path to not > complicate the refactored APIs and fixes the issue with queryID in > readfuncs.c. Thoughts? > -- > Michael
Hi, On 2019-10-04 14:27:44 +0530, Ashutosh Sharma wrote: > Is there any specific reason for hard coding the *base* of a number > representing the string in strtouint64(). I understand that currently > strtouint64() is being used just to convert an input string to decimal > unsigned value but what if we want it to be used for hexadecimal > values or may be some other values, in that case it can't be used. It's a lot slower if the base is variable, because the compiler cannot replace the division by shifts. Greetings, Andres Freund
On Fri, Oct 4, 2019 at 8:58 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-10-04 14:27:44 +0530, Ashutosh Sharma wrote: > > Is there any specific reason for hard coding the *base* of a number > > representing the string in strtouint64(). I understand that currently > > strtouint64() is being used just to convert an input string to decimal > > unsigned value but what if we want it to be used for hexadecimal > > values or may be some other values, in that case it can't be used. > > It's a lot slower if the base is variable, because the compiler cannot > replace the division by shifts. > Thanks Andres for the reply. I didn't know that the compiler won't be able to replace division with shifts operator if the base is variable and it's true that it would make the things a lot slower. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Fri, Oct 04, 2019 at 02:27:44PM +0530, Ashutosh Sharma wrote: > Is there any specific reason for hard coding the *base* of a number > representing the string in strtouint64(). I understand that currently > strtouint64() is being used just to convert an input string to decimal > unsigned value but what if we want it to be used for hexadecimal > values or may be some other values, in that case it can't be used. > Further, the function name is strtouint64() but the comments atop it's > definition says it's pg_strtouint64(). That needs to be corrected. Performance, as Andres has already stated upthread. Moving away from strtol gives roughly a 40% improvement with a call-to-call comparison: https://www.postgresql.org/message-id/20190909052814.GA26605@paquier.xyz > At few places, I could see that the function call to > pg_strtoint32_check() is followed by an error handling. Isn't that > already being done in pg_strtoint32_check function itself. For e.g. in > refint.c the function call to pg_strtoint32_check is followed by a if > condition that checks for an error which I assume shouldn't be there > as it is already being done by pg_strtoint32_check. pg_strtoint32_check is used for a signed integer, so it would complain about incorrect input syntax, but not when the parsed integer is less or equal than 0, which is what refint.c complains about. -- Michael
Вложения
On Wed, Sep 18, 2019 at 10:13:20AM +0900, Michael Paquier wrote: > As of now, here is an updated patch which takes the path to not > complicate the refactored APIs and fixes the issue with queryID in > readfuncs.c. Thoughts? For now, and seeing that there is little interest in it. I have marked the patch as returned with feedback in this CF. The thread has gone long, so if there is a new submission I would suggest using a new thread with a fresh start point.. Not sure if I'll work on that or not. -- Michael