Обсуждение: Redacting information from logs

Поиск

Список

Период

Сортировка

Redacting information from logs

От

Jeff Davis

Дата:

30 июля 2019 г., 21:54:55

Logs are important to diagnose problems or monitor operations, but logs
can contain sensitive information which is often unnecessary for these
purposes. Redacting the sensitive information would enable easier
access and simpler integration with analysis tools without compromising
the sensitive information.

The challenge is that nobody wants to classify all of the log messages;
and even if someone did that today, there would be never-ending work in
the future to try to maintain that classification.

My proposal is:

 * redact every '%s' in an ereport by having a special mode for
snprintf.c (this is possible because we now own snprintf)
 * generate both redacted and unredacted messages (if redaction is
enabled)
 * choose which destinations (stderr, eventlog, syslog, csvlog) get
redacted or plain messages
 * emit_log_hook always has both redacted and plain messages available
 * allow specifying a custom redaction function, e.g. a function that
hashes the string rather than completely redacting it

I think '%s' in a log message is a pretty close match to the kind of
information that might be sensitive. All data goes through type output
functions (e.g. the conflicting datum for a unique constraint violation
message), and most other things that a user might type would go through
%s. A lot of other information useful in logs, like LSNs, %m's, PIDs,
etc. would be preserved.

All object names would be redacted, but that's not as bad as it sounds:
  (a) You can specify a custom redaction function that hashes rather
than completely redacts. That allows you to see if different messages
refer to the same object, and also map back to suspected objects if you
really need to.
  (b) The unredacted object names are still a part of ErrorData, so you
can do something interesting with emit_log_hook.
  (c) You still might have the unredacted logs in a more protected
place, and can access them when you really need to.

A weakness of this proposal is that it could be confusing to use
ereport() in combination with snprintf(). If using snprintf to build
the format string, nothing would be redacted, so you'd have to be
careful not to expand any %s that might be sensitive. If using snprintf
to build up an argument, the entire argument would be redacted. The
first case should not be common, because good coding generally avoids
non-constant format strings. The second case is just over-redaction,
which is not necessarily bad.

One annoying case would be if some of the arguments to ereport() are
used for things like the right number of commas or tabs -- redacting
those would just make the message look horrible. I didn't find such
cases but I'm pretty sure they exist. Another annoying case is time,
which is useful for debugging, but formatted with %s so it gets
redacted (I did find plenty of these cases).

But I don't see a better solution. Right now, it's a pain to treat log
files as sensitive things when there are so many ways they can help
with smooth operations and so many tools available to analyze them.
This proposal seems like a practical solution to enable better use of
log files while protecting potentially-sensitive information.

Attached is a WIP patch.

Regards,
    Jeff Davis

Вложения

0001-WIP-redaction.patch

Re: Redacting information from logs

От

Tomas Vondra

Дата:

04 августа 2019 г., 00:34:46

On Tue, Jul 30, 2019 at 11:54:55AM -0700, Jeff Davis wrote:
>Logs are important to diagnose problems or monitor operations, but logs
>can contain sensitive information which is often unnecessary for these
>purposes. Redacting the sensitive information would enable easier
>access and simpler integration with analysis tools without compromising
>the sensitive information.
>

OK, that's a worthwhile goal. I assume by "sensitive data" you mean user
data, right?

>The challenge is that nobody wants to classify all of the log messages;
>and even if someone did that today, there would be never-ending work in
>the future to try to maintain that classification.
>
>My proposal is:
>
> * redact every '%s' in an ereport by having a special mode for
>snprintf.c (this is possible because we now own snprintf)
> * generate both redacted and unredacted messages (if redaction is
>enabled)
> * choose which destinations (stderr, eventlog, syslog, csvlog) get
>redacted or plain messages
> * emit_log_hook always has both redacted and plain messages available
> * allow specifying a custom redaction function, e.g. a function that
>hashes the string rather than completely redacting it
>
>I think '%s' in a log message is a pretty close match to the kind of
>information that might be sensitive. All data goes through type output
>functions (e.g. the conflicting datum for a unique constraint violation
>message), and most other things that a user might type would go through
>%s. A lot of other information useful in logs, like LSNs, %m's, PIDs,
>etc. would be preserved.
>

IMHO the crucial part here is 'might be sensitive'. How often is that
actually true? My guess is 99% of places using %s are not sensitive at
all, and are used for things like filenames, table/attribute names,
and so on. And redacting those parts will make the logs essentially
useless, because we'll get things like this:

    ERROR:  column "******" does not exist at character 10

    ERROR:  division by zero
    CONTEXT:  SQL function "******" during inlining

I'm not sure those are the logs I'd like to see on a production system
while investigating an issue.

>All object names would be redacted, but that's not as bad as it sounds:
>  (a) You can specify a custom redaction function that hashes rather
>than completely redacts. That allows you to see if different messages
>refer to the same object, and also map back to suspected objects if you
>really need to.
>  (b) The unredacted object names are still a part of ErrorData, so you
>can do something interesting with emit_log_hook.

Isn't hashing essentially an information leak, i.e. somewhat undesirable
for sensitive data?

>  (c) You still might have the unredacted logs in a more protected
>place, and can access them when you really need to.
>

The question is whether that's actually an acceptable solution for
deployments that do handle sensitive data ...

>A weakness of this proposal is that it could be confusing to use
>ereport() in combination with snprintf(). If using snprintf to build
>the format string, nothing would be redacted, so you'd have to be
>careful not to expand any %s that might be sensitive. If using snprintf
>to build up an argument, the entire argument would be redacted. The
>first case should not be common, because good coding generally avoids
>non-constant format strings. The second case is just over-redaction,
>which is not necessarily bad.
>
>One annoying case would be if some of the arguments to ereport() are
>used for things like the right number of commas or tabs -- redacting
>those would just make the message look horrible. I didn't find such
>cases but I'm pretty sure they exist. Another annoying case is time,
>which is useful for debugging, but formatted with %s so it gets
>redacted (I did find plenty of these cases).
>
>But I don't see a better solution. Right now, it's a pain to treat log
>files as sensitive things when there are so many ways they can help
>with smooth operations and so many tools available to analyze them.
>This proposal seems like a practical solution to enable better use of
>log files while protecting potentially-sensitive information.
>

Hmm. I wonder how difficult would it be to actually go through the
ereport calls and classify those that can leak sensitive data, and then
do redaction only for those. That's about the only alternative approach
I can think of.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Redacting information from logs

От

Andres Freund

Дата:

04 августа 2019 г., 01:47:57

Hi,

On 2019-07-30 11:54:55 -0700, Jeff Davis wrote:
> My proposal is:
>
>  * redact every '%s' in an ereport by having a special mode for
> snprintf.c (this is possible because we now own snprintf)

I'm extremely doubtful this is a sane approach. We use snprintf for a
heck of a lot of things. The likelihood of this having unintended
consequences seems high (consider an error being thrown while trying to
report another error message and such). Nor do I think that snprintf.c
is a good layer to perform redaction - it's too low level. It's used for
both frontend/backend. It's used for both non-error and error purposes.

I also don't think you're actually going to get that far with it -
there's plenty places where we concatenate error messages without using
*printf, but e.g. appendStringInfoString().

> But I don't see a better solution. Right now, it's a pain to treat log
> files as sensitive things when there are so many ways they can help
> with smooth operations and so many tools available to analyze them.
> This proposal seems like a practical solution to enable better use of
> log files while protecting potentially-sensitive information.

I don't really see a low-effort way either. But I'm fairly certain that
this will cause at least many problems as it'll help solve.

I think incrementally moving to messages where portions of information
are separated out (e.g. the things we'd inline with %s) is, although a
lengthy process, the better approach. It'll make richer output formats
possible, it'll allow for proper redaction, etc.

I.e. something very roughly like

ereport(ERROR,
        errmsg_rich("string with %{named}s references to %{parameter}s"),
        errparam("named", somevar),
        errparam("parameter", othervar, .redact = CONTEXT));

Which would allow us to add annotate whether a specific parameter needs
to be redacted for certain contexts.

I'd probably add a errredact(bool) to annotate whether a message needs
to be redacted, mostly so we can easily flag a lot of current messages
as OK. When not present, I'd redact the entire message when errmsg() is
being used, and redact nothing if errmsg_rich() is used, and none of the
parameters flag an error.

That'd then also allow us to reference parameters that clients /
exception handlers may not see, e.g. the arguments to leakproof
functions. Which currently makes a lot of issues harder to debug,
because we don't get the values for e.g. overflows, input syntax errors
etc.

Allowing errparam()s to be specified that are not used in the error
messages, we can provide more detail to errors for people using richer
log outputs. I'd assume we'd fairly quickly have logfmt/json logging
target/format.

Greetings,

Andres Freund

Re: Redacting information from logs

От

Chapman Flack

Дата:

04 августа 2019 г., 01:57:44

On 7/30/19 2:54 PM, Jeff Davis wrote:
> Logs are important to diagnose problems or monitor operations, but logs
> can contain sensitive information which is often unnecessary for these
> purposes. Redacting the sensitive information would enable easier
> access and simpler integration with analysis tools without compromising
> the sensitive information.

This seems like a thread that could contain a link to this other thread:

https://www.postgresql.org/message-id/1055919c-98ea-56d2-9ea7-f8a7c72e16b4%40anastigmatix.net

Implementing ideas are different, but motivating concerns the same.

Regards,
-Chap

Re: Redacting information from logs

От

Tom Lane

Дата:

04 августа 2019 г., 02:14:13

Andres Freund <andres@anarazel.de> writes:
> On 2019-07-30 11:54:55 -0700, Jeff Davis wrote:
>> My proposal is:
>> * redact every '%s' in an ereport by having a special mode for
>> snprintf.c (this is possible because we now own snprintf)

> I'm extremely doubtful this is a sane approach.

Yeah, it's really hard to believe that messing with snprintf isn't
going to have a lot of unintended consequences.  If we want to do
something about this, we're going to have to bite the bullet and
go around to annotate all those arguments with redaction-worthiness
info.  It'd be a lot of work, but as long as we can make sure to
do it incrementally, we could get there.  (It's not like ereport
itself has been there forever ...)

> I.e. something very roughly like

> ereport(ERROR,
>         errmsg_rich("string with %{named}s references to %{parameter}s"),
>         errparam("named", somevar),
>         errparam("parameter", othervar, .redact = CONTEXT));

I'm not terribly attracted by that specific approach, though.  With
this, we'd get to maintain *two* variants of snprintf, and I think
the one with annotation knowledge would have significant performance
problems.  (Something I'm sensitive to, on account of certain persons
beating me over the head about snprintf.c's performance.)  It'd be
an amazing pain in the rear for translators, too, on account of
their tools not understanding this format-string language.

It seems to me that it'd be sufficient to do the annotation by
inserting wrapper functions, like the errparam() you suggest above.
If we just had errparam() choosing whether to return "..." instead of
its argument string, we'd have what we need, without messing with
the format language.

            regards, tom lane

Re: Redacting information from logs

От

Andres Freund

Дата:

04 августа 2019 г., 21:17:57

Hi,

On 2019-08-03 19:14:13 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I.e. something very roughly like
>
> > ereport(ERROR,
> >         errmsg_rich("string with %{named}s references to %{parameter}s"),
> >         errparam("named", somevar),
> >         errparam("parameter", othervar, .redact = CONTEXT));
>
> I'm not terribly attracted by that specific approach, though.  With
> this, we'd get to maintain *two* variants of snprintf, and I think
> the one with annotation knowledge would have significant performance
> problems.  (Something I'm sensitive to, on account of certain persons
> beating me over the head about snprintf.c's performance.)

I don't think that the performance issues of *printf in general - which
we use in a lot of hot paths like output functions - are the same as
when we log messages/throw errors.  My gut feeling that at least >=
DEBUG1 the difference would be hard to measure.  And given how we
e.g. build up the log_line_prefix, we could probably also recoup a lot
potentially lost performance, around logging.

We would probably just allow both positional references and named
references. I.e. for smaller messages we can just use the current
format, and just for the very long ones with many parameters we'd
normally use a positional reference - but still have the name associated
with the parameter. Even if we don't want to allow reference parameters
by name in the format string, I think it's worthwhile. See below.

It's really kind of a shame that we constantly re-parse the format
strings. With something like modern C++ it'd be fairly easy to not do
that (by using compile time evaluated constexprs when possible) - but I
wonder if we couldn't also manage to do that within plain C, at the cost
of a branch and some macro magic.  I'm imagining something like

#define pg_printf(fmt, ...) \
    do { \
        if ( __builtin_constant_p(fmt)) \
        { \
            static processed_fmt processed_fmt_ = {.format = fmt}; \
            if (unlikely(!processed_fmt_.parsed)) \
               preprocess_format_string(&processed_fmt_) \
            pg_printf_processed(&processed_fmt_, __VA_ARGS__); \
        } \
        else \
            pg_printf_unprocessed(fmt, __VA_ARGS__); \
    } while (0) \

having to use a static variable for this purpose sucks somewhat, but I
can't think of something in C that'd let us avoid that.

> It'd be an amazing pain in the rear for translators, too, on account
> of their tools not understanding this format-string language.

The limitations of that world are starting to be really frustrating.

> It seems to me that it'd be sufficient to do the annotation by
> inserting wrapper functions, like the errparam() you suggest above.
> If we just had errparam() choosing whether to return "..." instead of
> its argument string, we'd have what we need, without messing with
> the format language.

I agree that if redaction were the only reason, that the expanded format
strings wouldn't be worth it. And perhaps they still aren't.

I think we've quite repeatedly had requests for a log format that can be
parsed reasonably, and annotated with additional information that's too
verbose for the main message.  It's a pain to have to parse quite
complex strings in production, just to extract the value of the actual
information in the message.

As an *easy* example of something I wanted to do in the past: Plot the
total wait time of backends that had to wait for locks longer than
deadlock_timeout. In that case, because we saw the system failing with
too much waiting, but didn't really know when that started become problematic, and what the neormal amount of waiting
is
So we need to parse
                ereport(LOG,
                        (errmsg("process %d acquired %s on %s after %ld.%03d ms",
                                MyProcPid, modename, buf.data, msecs, usecs)));
into something usable.  First problem is of course that the
log_line_prefix is variable, and that we'll likely need information from
it to hunt down the source of locks. Parsing the above with regexes
isn't too hard - but then we also need to deal with the fact that the
message can be translated.  Oh, and finding the associated statement
requires looking forward in the log stream by a hard to predict amount,
and requires knowledge about log verbosity was set to.

In the past I had also wanted to parse the log to understand how much
work manual and auto-vacuum an installation was doing, and whether
that's perhaps correlated to problems. But for that one need a parser
that can handle this mess:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/vacuumlazy.c;h=a3c4a1df3b4b28953ab1dfcd43f3d566eecac555;hb=HEAD#l396

which is obviously much harder.

If we instead had the relevant variables separated out in a named
fashion, we could - like many other tools these days - have a richer log
target, that'd first have typical log message as a human readable
string, but then also all the interpolated values separately.

E.g. in theaforementioned simpler case it could be something like

ts="2019-08-04 10:19:21.286 PDT" pid=32416 vxid=12/9 sev=LOG \
    msg="Prozess 15027 erlangte AccessShareLock-Sperre auf Relation 1259 der Datenbank 13416 nach 29808,865 ms" \
    fmt="process %d acquired %s on %s after %ld.%03d ms" \
    p:modename="AccessShareLock-Sperre auf Relation 1259" \
    p:msec=18343 p:usec=173 \
    p:locktag=190345/134343/0/0/relation/default \
    c:statement="LOCK TABLE pg_class;" \
    l:func=ProcSleep l:file=proc.c l:line=1495

So, the prefix fields are just included as k=v, the evaluated message as
one field, the *untranslated* format as another. All the parameters are
included prefixed with p:, context information prefixed with c:, and
location with l:. Note how e.g. p:modename is a pre-translated string,
which is why p:locktag was added as an intentionall untranslated
paramter that's not included in the message.

This isn't perfect, but still a heck of a lot easier to parse than what
we have now. All of the data is one block rather than in consecutive log
lines that may or may not belong together, it's key=value in a way that
can be parsed, the untranslated format string is available, so tools can
actually analyze logs independent of the current locale.

This isn't a proposal to adopt precisely the above log format - that's
just something I came up with while writing this email - but to have
enough information available to actually produce something like it when
emitting logs.

If we have enough information to pass to the logging hook, we don't even
need to define how all of this is going to look like exactly (although
I'd probably argue that a logfmt or json target ought to be in core).

The cool part - and people from other projects/company projects might be
laughing about me describing it as cool in this day and age - is that
once you have a format that can readily be parsed it's pretty simple to
write tools that process a logfile and extract just the information you
want.

Greetings,

Andres Freund

Re: Redacting information from logs

От

Jeff Davis

Дата:

05 августа 2019 г., 23:37:50

On Sat, 2019-08-03 at 19:14 -0400, Tom Lane wrote:
> It seems to me that it'd be sufficient to do the annotation by
> inserting wrapper functions, like the errparam() you suggest above.
> If we just had errparam() choosing whether to return "..." instead of
> its argument string, we'd have what we need, without messing with
> the format language.

I'm having trouble getting the ergonomics to work out here so that it
can generate both a redacted and an unredacted message.

If errparam() is a normal argument to errmsg(), then errparam() will be
evaluated first. Will it return the redacted version, the unredacted
version, or a special type that holds both?

If I try to use macros to force multiple evaluation (to get one
redacted and one unredacted string), then it seems like that would
happen for all arguments (not just errparam arguments), which would be
bad.

Suggestions?

Regards,
        Jeff Davis

Re: Redacting information from logs

От

Andres Freund

Дата:

06 августа 2019 г., 00:10:26

Hi,

On 2019-08-05 13:37:50 -0700, Jeff Davis wrote:
> On Sat, 2019-08-03 at 19:14 -0400, Tom Lane wrote:
> > It seems to me that it'd be sufficient to do the annotation by
> > inserting wrapper functions, like the errparam() you suggest above.
> > If we just had errparam() choosing whether to return "..." instead of
> > its argument string, we'd have what we need, without messing with
> > the format language.
> 
> I'm having trouble getting the ergonomics to work out here so that it
> can generate both a redacted and an unredacted message.
> 
> If errparam() is a normal argument to errmsg(), then errparam() will be
> evaluated first. Will it return the redacted version, the unredacted
> version, or a special type that holds both?

I was thinking that it'd just store a struct ErrParam, which'd reference
the passed value and metadata like the name (for structured log
output) and redaction category.  The bigger problem I see is handling
the different types of arguments - but perhaps the answer there would be
to just make the macro typesafe? Or have one for scalar values and one
for pointer types?

We could even allocate the necessary information for this on the stack,
with some macro trickery. Not sure if it's worth it, but ...

Doing something like this does however require not directly using plain
sprintf. I'm personally not bothered by that, but I'd not be surprised
if e.g. Tom saw that differently.

> If I try to use macros to force multiple evaluation (to get one
> redacted and one unredacted string), then it seems like that would
> happen for all arguments (not just errparam arguments), which would be
> bad.

Yea, multiple evaluation clearly is a no-go.

Greetings,

Andres Freund

Re: Redacting information from logs

От

Jeff Davis

Дата:

06 августа 2019 г., 00:26:44

On Sun, 2019-08-04 at 11:17 -0700, Andres Freund wrote:
> I'm imagining something like
> 
> #define pg_printf(fmt, ...) \
>     do { \
>         if ( __builtin_constant_p(fmt)) \
>         { \
>             static processed_fmt processed_fmt_ = {.format = fmt}; \
>             if (unlikely(!processed_fmt_.parsed)) \
>                preprocess_format_string(&processed_fmt_) \
>             pg_printf_processed(&processed_fmt_, __VA_ARGS__); \
>         } \
>         else \
>             pg_printf_unprocessed(fmt, __VA_ARGS__); \
>     } while (0) \

What would you do in the preprocessing exactly? Create a list of
indexes into the string where the format codes are?

> I think we've quite repeatedly had requests for a log format that can
> be
> parsed reasonably, and annotated with additional information that's
> too
> verbose for the main message.  It's a pain to have to parse quite
> complex strings in production, just to extract the value of the
> actual
> information in the message.
> 
> ts="2019-08-04 10:19:21.286 PDT" pid=32416 vxid=12/9 sev=LOG \
>     msg="Prozess 15027 erlangte AccessShareLock-Sperre auf Relation
> 1259 der Datenbank 13416 nach 29808,865 ms" \
>     fmt="process %d acquired %s on %s after %ld.%03d ms" \
>     p:modename="AccessShareLock-Sperre auf Relation 1259" \
>     p:msec=18343 p:usec=173 \
>     p:locktag=190345/134343/0/0/relation/default \
>     c:statement="LOCK TABLE pg_class;" \
>     l:func=ProcSleep l:file=proc.c l:line=1495
> 

...

> If we have enough information to pass to the logging hook, we don't
> even
> need to define how all of this is going to look like exactly
> (although
> I'd probably argue that a logfmt or json target ought to be in core).

I think I see where you are going with this now: it is almost
orthogonal to your new-style format strings ( %{foo}s ), but not quite.

You're suggesting that we save all of the arguments, along with some
annotation, in the ErrorData struct, and then let emit_log_hook sort it
out (perhaps by constructing some JSON message, perhaps translating the
message_id, etc.).

I like the idea, but still playing with the ergonomics a bit, and how
it interacts with various message parts (msg, hint, detail, etc.). If
we had the name-based format strings, then the message parts could all
share a common set of parameters; but with the current positional
format strings, I think each message part would need its own set of
arguments.

Positional:

   ereport(ERROR,
           (errcode(ERRCODE_UNIQUE_VIOLATION),
            errmsg_params("duplicate key value violates unique
constraint \"%s\"",
                          errparam("constraintname", MSGDEFAULT,
                                   RelationGetRelationName(rel)),
            errdetail_params("Key %s already exists.",
                             errparam("key", MSGUSERDATA, key_desc)))
           );

Named:

   ereport(ERROR,
           (errcode(ERRCODE_UNIQUE_VIOLATION),
            errmsg_rich("duplicate key value violates unique constraint
\"%{constraintname}s\""),
            errdetail_rich("Key %{key}s already exists."),
            errparam("key", MSGUSERDATA, key_desc))
            errparam("constraintname", MSGDEFAULT, 
                     RelationGetRelationName(rel)))
           );

I think either one needs some ergonomic improvements, but it seems like
we are going in the right direction.

Maybe we can make the parameters common to different message parts by
using an integer index to reference the parameter, like:

   ereport(ERROR,
           (errcode(ERRCODE_UNIQUE_VIOLATION),
            errmsg_rich("duplicate key value violates unique constraint
\"%s\"", 1 /* second errparam */),
            errdetail_rich("Key %s already exists.", 0 /* first
errparam */),
            errparam("key", MSGUSERDATA, key_desc))
            errparam("constraintname", MSGDEFAULT, 
                     RelationGetRelationName(rel)))
           );

Not quite ideal, but might get us closer.

Regards,
    Jeff Davis

Re: Redacting information from logs

От

Jeff Davis

Дата:

06 августа 2019 г., 00:32:36

On Mon, 2019-08-05 at 14:10 -0700, Andres Freund wrote:
> I was thinking that it'd just store a struct ErrParam, which'd
> reference
> the passed value and metadata like the name (for structured log
> output) and redaction category.  The bigger problem I see is handling
> the different types of arguments - but perhaps the answer there would
> be
> to just make the macro typesafe? Or have one for scalar values and
> one
> for pointer types?

Losing the compile-time checks for compatibility between format codes
and arguments would be a shame. Are you saying there's a potential
solution for that?

> Doing something like this does however require not directly using
> plain
> sprintf. I'm personally not bothered by that, but I'd not be
> surprised
> if e.g. Tom saw that differently.

It may be possible to still use sprintf if we translate the ErrParams
into plain values first.

Regards,
    Jeff Davis

Re: Redacting information from logs

От

Andres Freund

Дата:

06 августа 2019 г., 00:44:58

Hi,

On 2019-08-05 14:26:44 -0700, Jeff Davis wrote:
> On Sun, 2019-08-04 at 11:17 -0700, Andres Freund wrote:
> > I'm imagining something like
> > 
> > #define pg_printf(fmt, ...) \
> >     do { \
> >         if ( __builtin_constant_p(fmt)) \
> >         { \
> >             static processed_fmt processed_fmt_ = {.format = fmt}; \
> >             if (unlikely(!processed_fmt_.parsed)) \
> >                preprocess_format_string(&processed_fmt_) \
> >             pg_printf_processed(&processed_fmt_, __VA_ARGS__); \
> >         } \
> >         else \
> >             pg_printf_unprocessed(fmt, __VA_ARGS__); \
> >     } while (0) \
> 
> What would you do in the preprocessing exactly? Create a list of
> indexes into the string where the format codes are?

Yea, basically. If you look at snprint.c's dopr(), there's a fair bit of
parsing going on that is going to stay the same from call to call in
nearly all cases (and the cases where not we possibly ought to fix). And
things like the $ processing for argument order, or having named
arguments as I suggest, make that more pronounced.


> > If we have enough information to pass to the logging hook, we don't
> > even
> > need to define how all of this is going to look like exactly
> > (although
> > I'd probably argue that a logfmt or json target ought to be in core).
> 
> I think I see where you are going with this now: it is almost
> orthogonal to your new-style format strings ( %{foo}s ), but not quite.
> 
> You're suggesting that we save all of the arguments, along with some
> annotation, in the ErrorData struct, and then let emit_log_hook sort it
> out (perhaps by constructing some JSON message, perhaps translating the
> message_id, etc.).

Right.


> I like the idea, but still playing with the ergonomics a bit, and how
> it interacts with various message parts (msg, hint, detail, etc.). If
> we had the name-based format strings, then the message parts could all
> share a common set of parameters; but with the current positional
> format strings, I think each message part would need its own set of
> arguments.

Right, I think that's a good part of where I was coming from.


> Maybe we can make the parameters common to different message parts by
> using an integer index to reference the parameter, like:
> 
>    ereport(ERROR,
>            (errcode(ERRCODE_UNIQUE_VIOLATION),
>             errmsg_rich("duplicate key value violates unique constraint
> \"%s\"", 1 /* second errparam */),
>             errdetail_rich("Key %s already exists.", 0 /* first
> errparam */),
>             errparam("key", MSGUSERDATA, key_desc))
>             errparam("constraintname", MSGDEFAULT, 
>                      RelationGetRelationName(rel)))
>            );
> 
> Not quite ideal, but might get us closer.

If we insist that errmsg_rich/errdetail_rich may not have parameters,
then they can just use the same set of arguments, without any of this,
at the cost of sometimes more complicated % syntax (i.e. %1$d to refer
to the first argument).

I think the probable loss of gcc format warnings would be the biggest
issue with this whole proposal, and potential translator trouble the
biggest impediment for named parameters.

Greetings,

Andres Freund

Re: Redacting information from logs

От

Andres Freund

Дата:

06 августа 2019 г., 00:49:52

Hi,

On 2019-08-05 14:32:36 -0700, Jeff Davis wrote:
> On Mon, 2019-08-05 at 14:10 -0700, Andres Freund wrote:
> > I was thinking that it'd just store a struct ErrParam, which'd
> > reference
> > the passed value and metadata like the name (for structured log
> > output) and redaction category.  The bigger problem I see is handling
> > the different types of arguments - but perhaps the answer there would
> > be
> > to just make the macro typesafe? Or have one for scalar values and
> > one
> > for pointer types?
> 
> Losing the compile-time checks for compatibility between format codes
> and arguments would be a shame. Are you saying there's a potential
> solution for that?

Yea, I'd just written that in another reply to yours. I did actually
think about this some recently:
https://www.postgresql.org/message-id/20190730181845.jyyk4selyohagwnf%40alap3.anarazel.de

Not sure if any of that is really applicable.  Once more I really am
regretting that PG is C rather than C++.


> > Doing something like this does however require not directly using
> > plain
> > sprintf. I'm personally not bothered by that, but I'd not be
> > surprised
> > if e.g. Tom saw that differently.
> 
> It may be possible to still use sprintf if we translate the ErrParams
> into plain values first.

It's pretty hard to call those functions with a runtime variable number
of arguments. va_list as an interface is not great...

Greetings,

Andres Freund

Re: Redacting information from logs

От

Jeff Davis

Дата:

06 августа 2019 г., 02:05:11

On Mon, 2019-08-05 at 14:44 -0700, Andres Freund wrote:
> at the cost of sometimes more complicated % syntax (i.e. %1$d to
> refer
> to the first argument).
> 
> I think the probable loss of gcc format warnings would be the biggest
> issue with this whole proposal, and potential translator trouble the
> biggest impediment for named parameters.

I'd be OK with '%1$d' syntax.

That leaves type safety as the main problem. Your solution involving
_Generic is interesting -- I didn't even know that existed. I don't
think it needs to be supported on all compilers, as long as we are
getting errors from somewhere. They would be runtime errors instead of
compile-time errors, though.

Regards,
    Jeff Davis

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Redacting information from logs

Вложения