Обсуждение: Fix XML handling with DOCTYPE
Hi all,
I'm investigating the issue I reported here: https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org
As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT should accept all valid XML. At this time, XML with a DOCTYPE declaration is not accepted with this setting even though it is considered valid XML. I'd like to work on a patch to address this issue and make it work as advertised.
I traced the source of the error to line ~1500 in /src/backend/utils/adt/xml.c
res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string + count, NULL);
It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE element makes the XML not well-balanced. From:
This function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise
I see xmlParseBalancedChunkMemoryRecover that might provide the functionality needed. That function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser error code otherwise In case recover is set to 1, the nodelist will not be empty even if the parsed chunk is not well balanced, assuming the parsing succeeded to some extent.
I haven't tested yet to see if this parses the data w/ DOCTYPE successfully yet. If it does, I don't think it would be difficult to update the check on res_code to not fail. I'm making another assumption that there is a distinct code from libxml to differentiate from other errors, but I couldn't find those codes quickly. The current check is this:
if (res_code != 0 || xmlerrcxt->err_occurred)
Does this sound reasonable? Have I missed some major aspect? If this is on the right track I can work on creating a patch to move this forward.
Thanks,
On 03/16/19 16:10, Ryan Lambert wrote: > As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT > should accept all valid XML. At this time, XML with a DOCTYPE declaration > is not accepted with this setting even though it is considered valid XML. Hello Ryan, A patch for your issue is currently registered in the 2019-03 commitfest[1]. If it attracts somebody to review it before the end of the month, it might make it into PG v12. It is the xml-content-2006-2.patch found on the email thread [2]. (The other patch found there is associated documentation fixes, and also needs to be reviewed.) Further conversation should probably be on that email thread so that it stays associated with the commitfest entry. Thanks for your interest in the issue! Regards, Chapman Flack [1] https://commitfest.postgresql.org/22/1872/ [2] https://www.postgresql.org/message-id/flat/5C81F8C0.6090901@anastigmatix.net
Ryan Lambert <ryan@rustprooflabs.com> writes: > I'm investigating the issue I reported here: > https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org > I'd like to work on a patch to address this issue and make it work as > advertised. Good idea, because it doesn't seem like anybody else cares ... > I see xmlParseBalancedChunkMemoryRecover that might provide the > functionality needed. TBH, our experience with libxml has not been so positive that I'd think adding dependencies on new parts of its API would be a good plan. Experimenting with different inputs, it seems like removing the "<!DOCTYPE ...>" tag is enough to make it work. So what I'm wondering about is writing something like parse_xml_decl() to skip over that. Bear in mind though that I know next to zip about XML. There may be some good reason why we don't want to strip off the !DOCTYPE part from what libxml sees. regards, tom lane
Chapman Flack <chap@anastigmatix.net> writes: > A patch for your issue is currently registered in the 2019-03 commitfest[1]. Oh! I apologize for saying nobody was working on this issue. Taking a very quick look at your patch, though, I dunno --- it seems like it adds a boatload of new assumptions about libxml's data structures and error-reporting behavior. I'm sure it works for you, but will it work across a wide range of libxml versions? What do you think of the idea I just posted about parsing off the DOCTYPE thing for ourselves, and not letting libxml see it? regards, tom lane
On 03/16/19 16:42, Tom Lane wrote: > Ryan Lambert <ryan@rustprooflabs.com> writes: >> I'm investigating the issue I reported here: >> https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org >> I'd like to work on a patch to address this issue and make it work as >> advertised. > > Good idea, because it doesn't seem like anybody else cares ... ahem
On 03/16/19 16:55, Tom Lane wrote: > What do you think of the idea I just posted about parsing off the DOCTYPE > thing for ourselves, and not letting libxml see it? The principled way of doing that would be to pre-parse to find a DOCTYPE, and if there is one, leave it there and parse the input as we do for 'document'. Per XML, if there is a DOCTYPE, the document must satisfy the 'document' syntax requirements, and per SQL/XML:2006-and-later, 'content' is a proper superset of 'document', so if we were asked for 'content' and can successfully parse it as 'document', we're good, and if we see a DOCTYPE and yet it incurs a parse error as 'document', well, that's what needed to happen. The DOCTYPE can appear arbitrarily far in, but the only things that can precede it are the XML decl, whitespace, XML comments, and XML processing instructions. None of those things nest, so the preceding stuff makes a regular language, and a regular expression that matches any amount of that stuff ending in <!DOCTYPE would be enough to detect that the parse should be shunted off to get 'document' treatment. The patch I submitted essentially relies on libxml to do that same parsing up to that same point and detect the error, so it would add no upfront cost in the majority of cases that aren't this corner one. But keeping a little compiled regex around and testing the input with that would hardly break the bank, either. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 03/16/19 16:55, Tom Lane wrote: >> What do you think of the idea I just posted about parsing off the DOCTYPE >> thing for ourselves, and not letting libxml see it? > The principled way of doing that would be to pre-parse to find a DOCTYPE, > and if there is one, leave it there and parse the input as we do for > 'document'. Per XML, if there is a DOCTYPE, the document must satisfy > the 'document' syntax requirements, and per SQL/XML:2006-and-later, > 'content' is a proper superset of 'document', so if we were asked for > 'content' and can successfully parse it as 'document', we're good, > and if we see a DOCTYPE and yet it incurs a parse error as 'document', > well, that's what needed to happen. Hm, so, maybe just (1) always try to parse as document. If successful, we're done. (2) otherwise, if allowed by xmloption, try to parse using our current logic for the CONTENT case. This avoids adding any new assumptions about how libxml acts, which is what I was hoping to achieve. One interesting question is which error to report if both (1) and (2) fail. regards, tom lane
On 03/16/19 17:21, Tom Lane wrote: > Chapman Flack <chap@anastigmatix.net> writes: >> On 03/16/19 16:55, Tom Lane wrote: >>> What do you think of the idea I just posted about parsing off the DOCTYPE >>> thing for ourselves, and not letting libxml see it? > >> The principled way of doing that would be to pre-parse to find a DOCTYPE, >> and if there is one, leave it there and parse the input as we do for >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy >> the 'document' syntax requirements, and per SQL/XML:2006-and-later, >> 'content' is a proper superset of 'document', so if we were asked for >> 'content' and can successfully parse it as 'document', we're good, >> and if we see a DOCTYPE and yet it incurs a parse error as 'document', >> well, that's what needed to happen. > > Hm, so, maybe just > > (1) always try to parse as document. If successful, we're done. > > (2) otherwise, if allowed by xmloption, try to parse using our > current logic for the CONTENT case. What I don't like about that is that (a) the input could be arbitrarily long and complex to parse (not that you can't imagine a database populated with lots of short little XML snippets, but at the same time, a query could quite plausibly deal in yooge ones), and (b), step (1) could fail at the last byte of the input, followed by total reparsing as (2). I think the safer structure is clearly that of the current patch, modulo whether the "has a DOCTYPE" test is done by libxml itself (with the assumptions you don't like) or by a pre-scan. So the current structure is: restart: asked for document? parse as document, or fail else asked for content: parse as content failed? because DOCTYPE? restart as if document else fail and a pre-scan structure could be very similar: restart: asked for document? parse as document, or fail else asked for content: pre-scan finds DOCTYPE? restart as if document else parse as content, or fail The pre-scan is a simple linear search and will ordinarily say yes or no within a couple dozen characters--you could *have* an input with 20k of leading whitespace and comments, but it's hardly the norm. Just trying to parse as 'document' first could easily parse a large fraction of the input before discovering it's followed by something that can't follow a document element. Regards, -Chap
Thank you both! I had glanced at that item in the commitfest but didn't notice it would fix this issue.
I'll try to test/review this before the end of the month, much better than starting from scratch myself. A quick glance at the patch looks logical and looks like it should work for my use case.
Thanks,
Ryan Lambert
On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack <chap@anastigmatix.net> wrote:
On 03/16/19 17:21, Tom Lane wrote:
> Chapman Flack <chap@anastigmatix.net> writes:
>> On 03/16/19 16:55, Tom Lane wrote:
>>> What do you think of the idea I just posted about parsing off the DOCTYPE
>>> thing for ourselves, and not letting libxml see it?
>
>> The principled way of doing that would be to pre-parse to find a DOCTYPE,
>> and if there is one, leave it there and parse the input as we do for
>> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
>> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
>> 'content' is a proper superset of 'document', so if we were asked for
>> 'content' and can successfully parse it as 'document', we're good,
>> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
>> well, that's what needed to happen.
>
> Hm, so, maybe just
>
> (1) always try to parse as document. If successful, we're done.
>
> (2) otherwise, if allowed by xmloption, try to parse using our
> current logic for the CONTENT case.
What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).
I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.
So the current structure is:
restart:
asked for document?
parse as document, or fail
else asked for content:
parse as content
failed?
because DOCTYPE? restart as if document
else fail
and a pre-scan structure could be very similar:
restart:
asked for document?
parse as document, or fail
else asked for content:
pre-scan finds DOCTYPE?
restart as if document
else parse as content, or fail
The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.
Regards,
-Chap
On 03/16/19 18:33, Chapman Flack wrote: > The pre-scan is a simple linear search and will ordinarily say yes or no > within a couple dozen characters--you could *have* an input with 20k of > leading whitespace and comments, but it's hardly the norm. Just trying to If the available regexp functions want to start by munging the entire input into a pg_wchar array, then it may be better to implement the pre-scan as open code, the same way parse_xml_decl() is already implemented. Given that parse_xml_decl() already covers the first optional thing that can precede the doctype, the remaining scan routine would only need to recognize comments, PIs, and whitespace. That would be pretty straightforward. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 03/16/19 17:21, Tom Lane wrote: >> Hm, so, maybe just >> >> (1) always try to parse as document. If successful, we're done. >> >> (2) otherwise, if allowed by xmloption, try to parse using our >> current logic for the CONTENT case. > What I don't like about that is that (a) the input could be > arbitrarily long and complex to parse (not that you can't imagine > a database populated with lots of short little XML snippets, but > at the same time, a query could quite plausibly deal in yooge ones), > and (b), step (1) could fail at the last byte of the input, followed > by total reparsing as (2). That doesn't seem particularly likely to me: based on what's been said here, I'd expect parsing with the wrong expectation to usually fail near the start of the input. In any case, the other patch also requires repeat parsing, no? It's just doing that in a different set of cases. The reason I'm pressing you for a simpler patch is that dump/reload failures are pretty bad, so ideally we'd find a fix that we're comfortable with back-patching into the released branches. Personally I would never dare to back-patch the proposed patch: it's too complex, so it's not real clear that it doesn't have unwanted side-effects, and it's not at all certain that there aren't libxml version dependencies in it. (Maybe another committer with more familiarity with libxml would evaluate the risks differently, but I doubt it.) But I think that something close to what I sketched above would pass muster as safe-to-backpatch. regards, tom lane
On 03/17/19 11:45, Tom Lane wrote: > Chapman Flack <chap@anastigmatix.net> writes: >> On 03/16/19 17:21, Tom Lane wrote: >>> (1) always try to parse as document. If successful, we're done. >>> (2) otherwise, if allowed by xmloption, try to parse using our > >> What I don't like about that is that (a) the input could be >> arbitrarily long and complex to parse (not that you can't imagine >> a database populated with lots of short little XML snippets, but >> at the same time, a query could quite plausibly deal in yooge ones), >> and (b), step (1) could fail at the last byte of the input, followed >> by total reparsing as (2). > > That doesn't seem particularly likely to me: based on what's been > said here, I'd expect parsing with the wrong expectation to usually > fail near the start of the input. In any case, the other patch > also requires repeat parsing, no? It's just doing that in a different > set of cases. I'll do up a version with the open-coded prescan I proposed last night. Whether parsing with the wrong expectation is likely to fail near the start of the input depends on both the input and the expectation. If your expectation is DOCUMENT and the input is CONTENT, it's possible for the determining difference to be something that follows the first element, and a first element can be (and often is) nearly all of the input. What I was doing in the patch is the reverse: parsing with the expectation of CONTENT to see if a DTD gets tripped over. It isn't allowed for an element to precede a DTD, so that approach can be expected to fail fast if the other branch needs to be taken. But a quick pre-scan for the same thing would have the same property, without the libxml dependencies that bother you here. Watch this space. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > What I was doing in the patch is the reverse: parsing with the expectation > of CONTENT to see if a DTD gets tripped over. It isn't allowed for an > element to precede a DTD, so that approach can be expected to fail fast > if the other branch needs to be taken. Ah, right. I don't have any problem with trying the CONTENT approach before the DOCUMENT approach rather than vice-versa. What I was concerned about was adding a lot of assumptions about exactly how libxml would report the failure. IMO a maximally-safe patch would just rearrange things we're already doing without adding new things. > But a quick pre-scan for the same thing would have the same property, > without the libxml dependencies that bother you here. Watch this space. Do we need a pre-scan at all? regards, tom lane
On 03/17/19 13:16, Tom Lane wrote: > Chapman Flack <chap@anastigmatix.net> writes: >> What I was doing in the patch is the reverse: parsing with the expectation >> of CONTENT to see if a DTD gets tripped over. It isn't allowed for an >> element to precede a DTD, so that approach can be expected to fail fast >> if the other branch needs to be taken. > > Ah, right. I don't have any problem with trying the CONTENT approach > before the DOCUMENT approach rather than vice-versa. What I was concerned > about was adding a lot of assumptions about exactly how libxml would > report the failure. IMO a maximally-safe patch would just rearrange > things we're already doing without adding new things. > >> But a quick pre-scan for the same thing would have the same property, >> without the libxml dependencies that bother you here. Watch this space. > > Do we need a pre-scan at all? Without it, we double the time to a failure result in every case that should actually fail, as well as in this one corner case that we want to see succeed, and the question you posed earlier about which error message to return becomes thornier. If the query asked for CONTENT, any error result should be one you could get when parsing as CONTENT. If we switch and try parsing as DOCUMENT _because the input is claiming to have the form of a DOCUMENT_, then it's defensible to return errors explaining why it's not a DOCUMENT ... but not in the general case of just throwing DOCUMENT at it any time CONTENT parse fails. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 03/17/19 13:16, Tom Lane wrote: >> Do we need a pre-scan at all? > Without it, we double the time to a failure result in every case that > should actually fail, as well as in this one corner case that we want to > see succeed, and the question you posed earlier about which error message > to return becomes thornier. I have absolutely zero concern about whether it takes twice as long to detect bad input; nobody should be sending bad input if they're concerned about performance. (The costs of the ensuing transaction abort would likely dwarf xml_in's runtime in any case.) Besides, with what we're talking about doing here, (1) the extra runtime is consumed only in cases that would fail up to now, so nobody can complain about a performance regression; (2) doing a pre-scan *would* be a performance regression for cases that work today; not a large one we hope, but still... The error message issue is indeed a concern, but I don't see why it's complicated if you agree that > If the query asked for CONTENT, any error result should be one you could get > when parsing as CONTENT. That just requires us to save the first error message and be sure to issue that one not the DOCUMENT one, no? That's what we'd want to do from a backwards-compatibility standpoint anyhow, since that's the error message wording you'd get with today's code. regards, tom lane
On 03/17/19 15:06, Tom Lane wrote: > The error message issue is indeed a concern, but I don't see why it's > complicated if you agree that > >> If the query asked for CONTENT, any error result should be one you could get >> when parsing as CONTENT. > > That just requires us to save the first error message and be sure to issue > that one not the DOCUMENT one, no? I confess I haven't looked hard yet at how to do that. The way errors come out of libxml is more involved than "here's a message", so there's a choice of (a) try to copy off that struct in a way that's sure to survive re-executing the parser, and then use the copy, or (b) generate a message right away from the structured information and save that, and I guess b might not be too bad; a might not be too bad, or it might, and slide right back into the kind of libxml-behavior-assumptions you're wanting to avoid. Meanwhile, here is a patch on the lines I proposed earlier, with a pre-check. Any performance hit that it could entail (which I'd really expect to be de minimis, though I haven't benchmarked) ought to be compensated by the strlen I changed to strnlen in parse_xml_decl (as there's really no need to run off and count the whole rest of the input just to know if 1, 2, 3, or 4 bytes are available to decode a UTF-8 char). ... and, yes, I know that could be an independent patch, and then the performance effect here should be measured from there. But it was near what I was doing anyway, so I included it here. Attaching both still-outstanding patches (this one and docfix) so the CF app doesn't lose one. Regards, -Chap
Вложения
There might be too many different email threads on this with patches, but in case it went under the radar, xml-content-2006-3.patch appeared in my previous message on this thread[1]. It is based on a simple pre-check of the prefix of the input, determining which form of parse to apply. That may or may not be simpler than parse- once-save-error-parse-again-report-first-error, but IMV it's a more direct solution and clearer (the logic is clearly about "how do I determine the way this input should be parsed?" which is the problem on the table, rather than "how should I save and regurgitate this libxml error?" which turns the problem on the table to a different one). I decided, for a first point of reference, to wear the green eyeshade and write a pre-check that exactly implements the applicable rules. That gives a starting point for simplifications that are probably safe. For example, a bunch of lines at the end have to do with verifying the content inside of a processing-instruction, after finding where it ends. We could reasonably decide that, for the purpose of skipping it, knowing where it ends is enough, as libxml will parse it next and report any errors anyway. That would slightly violate my intention of sending input to (the parser that wasn't asked for) /only/ when it's completely clear (from the prefix we've seen) that that's where it should go. The relaxed version could do that in completely-clear cases and cases with an invalid PI ahead of what looks like a DTD. But you'd pretty much expect both parsers to produce the same message for a bad PI anyway. That made me just want to try it now, and--surprise!--the messages from libxml are not the same. So maybe I would lean to keeping the green-eyeshade rules in the test, if you can stomach them, but I would understand taking them out. Regards, -Chap [1] https://www.postgresql.org/message-id/5C8ECAA4.3090301@anastigmatix.net
Chapman Flack <chap@anastigmatix.net> writes: > I decided, for a first point of reference, to wear the green eyeshade and > write a pre-check that exactly implements the applicable rules. That gives > a starting point for simplifications that are probably safe. > For example, a bunch of lines at the end have to do with verifying the > content inside of a processing-instruction, after finding where it ends. > We could reasonably decide that, for the purpose of skipping it, knowing > where it ends is enough, as libxml will parse it next and report any errors > anyway. Yeah, I did not like that code too much, particularly not all the magic Unicode-code-point numbers. I removed that, made some other changes to bring the patch more in line with PG coding style, and pushed it. > That made me just want to try it now, and--surprise!--the messages from > libxml are not the same. So maybe I would lean to keeping the green-eyeshade > rules in the test, if you can stomach them, but I would understand taking > them out. I doubt anyone will care too much about whether error messages for bad XML input are exactly like what they were before; and even if someone does, I doubt that these extra tests would be enough to ensure that the messages don't change. You're not really validating that the input is something that libxml would accept, unless its processing of XML PIs is far stupider than I would expect it to be. regards, tom lane
On 03/23/19 16:59, Tom Lane wrote: > Unicode-code-point numbers. I removed that, made some other changes to > bring the patch more in line with PG coding style, and pushed it. Thanks! It looks good. I'm content with the extra PI checking being gone. The magic Unicode-code-point numbers come straight from the XML standard; I couldn't make that stuff up. :) > > You're not really validating that the input > is something that libxml would accept, unless its processing of XML PIs > is far stupider than I would expect it to be. Out of curiosity, what further processing would you expect libxml to do? XML parsers are supposed to be transparent PI-preservers, except in the rare case of seeing a PI that actually means something to the embedding application, which isn't going to be the case for a database simply implementing an XML data type. The standard literally requires that the target must be a NAME, and can't match [Xx][Mm][Ll], and if there's whitespace and anything after that, there can't be an embedded ?> ... and that's it. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 03/23/19 16:59, Tom Lane wrote: >> You're not really validating that the input >> is something that libxml would accept, unless its processing of XML PIs >> is far stupider than I would expect it to be. > Out of curiosity, what further processing would you expect libxml to do? Hm, I'd have thought it'd try to parse the arguments to some extent, but maybe not. Does everybody reimplement attribute parsing for themselves when using PIs? regards, tom lane
On 03/23/19 18:22, Tom Lane wrote: >> Out of curiosity, what further processing would you expect libxml to do? > > Hm, I'd have thought it'd try to parse the arguments to some extent, > but maybe not. Does everybody reimplement attribute parsing for > themselves when using PIs? Yeah, the content of a PI (whatever's after the target name) is left all to be defined by whatever XML-using application might care about that PI. It could have an attribute=value syntax inspired by XML elements, or some other form entirely, but there'd just better not be any ?> in it. Regards, -Chap
I am unable to get latest patches I found [1] to apply cleanly to current branches. It's possible I missed the latest patches so if I'm using the wrong ones please let me know. I tried against master, 11.2 stable and the 11.2 tag with similar results. It's quite possible it's user error on my end, I am new to this process but didn't have issues with the previous patches when I tested those a couple weeks ago.
Ryan Lambert
On Sat, Mar 23, 2019 at 5:07 PM Chapman Flack <chap@anastigmatix.net> wrote:
On 03/23/19 18:22, Tom Lane wrote:
>> Out of curiosity, what further processing would you expect libxml to do?
>
> Hm, I'd have thought it'd try to parse the arguments to some extent,
> but maybe not. Does everybody reimplement attribute parsing for
> themselves when using PIs?
Yeah, the content of a PI (whatever's after the target name) is left
all to be defined by whatever XML-using application might care about
that PI.
It could have an attribute=value syntax inspired by XML elements, or
some other form entirely, but there'd just better not be any ?> in it.
Regards,
-Chap
On 03/24/19 21:04, Ryan Lambert wrote: > I am unable to get latest patches I found [1] to apply cleanly to current > branches. It's possible I missed the latest patches so if I'm using the > wrong ones please let me know. I tried against master, 11.2 stable and the > 11.2 tag with similar results. Tom pushed the content-with-DOCTYPE patch, it's now included in master, REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and REL9_4_STABLE. The only patch that's left to be reviewed and applied is the documentation fix, latest in [1]. If you were interested in giving a review opinion on some XML documentation. Regards, -Chap [1] https://www.postgresql.org/message-id/5C96DBB5.2080103@anastigmatix.net
Chapman Flack <chap@anastigmatix.net> writes: > On 03/24/19 21:04, Ryan Lambert wrote: >> I am unable to get latest patches I found [1] to apply cleanly to current >> branches. It's possible I missed the latest patches so if I'm using the >> wrong ones please let me know. I tried against master, 11.2 stable and the >> 11.2 tag with similar results. > Tom pushed the content-with-DOCTYPE patch, it's now included in master, > REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and > REL9_4_STABLE. Right. If you want to test (and please do!) you could grab the relevant branch tip from our git repo, or download a nightly snapshot tarball from https://www.postgresql.org/ftp/snapshot/ regards, tom lane
Perfect, thank you! I do remember seeing that message now, but hadn't understood what it really meant.
I will test later today. Thanks
Ryan
On Sun, Mar 24, 2019 at 9:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Chapman Flack <chap@anastigmatix.net> writes:
> On 03/24/19 21:04, Ryan Lambert wrote:
>> I am unable to get latest patches I found [1] to apply cleanly to current
>> branches. It's possible I missed the latest patches so if I'm using the
>> wrong ones please let me know. I tried against master, 11.2 stable and the
>> 11.2 tag with similar results.
> Tom pushed the content-with-DOCTYPE patch, it's now included in master,
> REL_11_STABLE, REL_10_STABLE, REL9_6_STABLE, REL9_5_STABLE, and
> REL9_4_STABLE.
Right. If you want to test (and please do!) you could grab the relevant
branch tip from our git repo, or download a nightly snapshot tarball from
https://www.postgresql.org/ftp/snapshot/
regards, tom lane
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested I tested the master branch (commit 8edd0e7), REL_11_STABLE (commit 24df866) and REL9_6_STABLE (commit 5097368) and verifiedfunctionality. This patch fixes the bug I had reported [1] previously. With this in the stable branches is it safe to assume this will be included with the next minor releases? Thanks for yourwork on this!! Ryan [1] https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org The new status of this patch is: Ready for Committer
On 03/25/19 18:03, Ryan Lambert wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: not tested > Documentation: not tested Hi, Thanks for the review! Yes, that part of this commitfest entry has been committed already and will appear in the next minor releases of those branches. That leaves only one patch in this commitfest entry that is still in need of review, namely the update to the documentation. If you happened to feel moved to look over a documentation patch, that would be what this CF entry most needs in the waning days of the commitfest. There seem to be community members reluctant to review it because of not feeling sufficiently expert in XML to scrutinize every technical detail, but there are other valuable angles for documentation review. (And the reason there *is* a documentation patch is the plentiful room for improvement in the documentation that's already there, so as far as reviewing goes, the old yarn about the two guys, the running shoes, and the bear comes to mind.) I can supply pointers to specs, etc., for anyone who does see some technical details in the patch and has questions about them. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > Thanks for the review! Yes, that part of this commitfest entry has been > committed already and will appear in the next minor releases of those > branches. Indeed, thanks for verifying that this fixes your problem. > That leaves only one patch in this commitfest entry that is still in > need of review, namely the update to the documentation. Yeah. Since it *is* in need of review, I changed the CF entry's state back to Needs Review. regards, tom lane
Thanks for putting up with a new reviewer!
--with-libxml is the PostgreSQL configure option to make it use libxml2.
The very web page http://xmlsoft.org/index.html says "The XML C parser
and toolkit of Gnome: libxml" and is all about libxml2.
So I think I was unsure what convention to follow, and threw up my hands
and went with libxml. I could just as easily throw them up and go with
libxml2. Which do you think would be better?
I think leaving it as libxml makes sense with all that. Good point that --with-libxml is used to build so I think staying with that works and is consistent. I agree that having this point included does clarify the how and why of the limitations of this implementation.
I also over-parenthesize so I'm used to looking for that in my own writing. The full sentences were the ones that seemed excessive to me, I think the others are ok and I won't nit-pick either way on those (unless you want me to!).
If you agree, I should go through and fix my nodesets to be node-sets.
Yes, I like node-sets better, especially knowing it conforms to the spec's language.
Thanks,
Ryan Lambert
On Tue, Mar 26, 2019 at 11:05 PM Chapman Flack <chap@anastigmatix.net> wrote:
On 03/26/19 21:39, Ryan Lambert wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world: tested, passed
> Implements feature: tested, passed
> Spec compliant: not tested
> Documentation: tested, passed
Thanks for the review!
> I have two recommendations for features.sgml. You state:
>
>> relies on the libxml library
>
> Should this be clarified as the libxml2 library? That's what I installed
> to build postgres from source (Ubuntu 16/18). If it is the libxml library
> and the "2" is irrelevant
That's a good catch. I'm not actually sure whether there is any "libxml"
library that isn't libxml2. Maybe there was once and nobody admits to
hanging out with it. Most Google hits on "libxml" seem to be modules
that have libxml in their names and libxml2 as their actual dependency.
Perl XML:LibXML: "This module is an interface to libxml2"
Haskell libxml: "Binding to libxml2"
libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings
for the GNOME Libxml2 ..."
--with-libxml is the PostgreSQL configure option to make it use libxml2.
The very web page http://xmlsoft.org/index.html says "The XML C parser
and toolkit of Gnome: libxml" and is all about libxml2.
So I think I was unsure what convention to follow, and threw up my hands
and went with libxml. I could just as easily throw them up and go with
libxml2. Which do you think would be better?
On 03/26/19 23:52, Tom Lane wrote:
> Do we need to mention that at all? If you're not building from source,
> it doesn't seem very interesting ... but maybe I'm missing some reason
> why end users would care.
The three places I've mentioned it were the ones where I thought users
might care:
- why are we stuck at XPath 1.0? It's what we get from the library we use.
- in what order do we get things out from a (hmm) node-set? Per XPath 1.0,
it's indeterminate (it's a set!), unlike XPath 2.0/XQuery where there's
a sequence type and you have order control. Observable behavior from
libxml2 (and you could certainly want to know this) is you get things out
in document order, whether that's what you wanted or not, even though
this is undocumented, and even counter-documented[1], libxml2 behavior.
So it's an example of something you would fundamentally like to know,
where the only available answer depends precariously on the library
we happen to be using.
- which limits in our implementation are inherent to the library, and
which are just current limits in our embedding of it? (Maybe this is
right at the border of what a user would care to know, but I know it's
a question that crosses my mind when I bonk into a limit I wasn't
expecting.)
> There are a few places where the parenthesis around a block of text
> seem unnecessary.
)blush( that's a long-standing wart in my writing ... seems I often think
in parentheses, then look and say "those aren't needed" and take them out,
only sometimes I don't.
I skimmed just now and found a few instances of parenthesized whole
sentence: the one you quoted, and some (if argument is null, the result
is null), and (No rows will be produced if ....). Shall I deparenthesize
them all? Did you have other instances in mind?
> It seems you are standardizing from "node set" to "nodeset", is that
> the preferred nomenclature or preference?
Another good catch. I remember consciously making a last pass to get them
all consistent, and I wanted them consistent with the spec, and I see now
I messed up.
XPath 1.0 [2] has zero instances of "nodeset", two of "node set" and about
six dozen of "node-set". The only appearances of "node set" without the
hyphen are in a heading and its ToC entry. The stuff under that heading
consistently uses node-set. It seems that's the XPath 1.0 term for sure.
When I made my consistency pass, I must have been looking too recently
in libxml2 C source, rather than the spec.
On 03/26/19 23:52, Tom Lane wrote:
> That seemed a bit jargon-y to me too. If that's standard terminology
> in the XML world, maybe it's fine; but I'd have stuck with "node set".
It really was my intention (though I flubbed it) to use XPath 1.0's term
for XPath 1.0's concept; in my doc philosophy, that gives readers
the most breadcrumbs to follow for the rest of the details if they want
them. "Node set" might be some sort of squishy expository concept I'm
using, but node-set is a thing, in a spec.
If you agree, I should go through and fix my nodesets to be node-sets.
I do think the terminology matters here, especially because of the
differences between what you can do with a node-set (XPath 1.0 thing)
and with a sequence (XPath 2.0+,XQuery,SQL/XML thing).
Let me know what you'd like best on these points and I'll revise the patch.
Regards,
-Chap
[1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes
in no particular order"
[2] https://www.w3.org/TR/1999/REC-xpath-19991116/
On 2019-Mar-27, Chapman Flack wrote: > On 03/26/19 21:39, Ryan Lambert wrote: > > Should this be clarified as the libxml2 library? That's what I installed > > to build postgres from source (Ubuntu 16/18). If it is the libxml library > > and the "2" is irrelevant > > That's a good catch. I'm not actually sure whether there is any "libxml" > library that isn't libxml2. Maybe there was once and nobody admits to > hanging out with it. Most Google hits on "libxml" seem to be modules > that have libxml in their names and libxml2 as their actual dependency. > > Perl XML:LibXML: "This module is an interface to libxml2" > Haskell libxml: "Binding to libxml2" > libxml-ruby: "The Libxml-Ruby project provides Ruby language bindings > for the GNOME Libxml2 ..." > > --with-libxml is the PostgreSQL configure option to make it use libxml2. > > The very web page http://xmlsoft.org/index.html says "The XML C parser > and toolkit of Gnome: libxml" and is all about libxml2. > > So I think I was unsure what convention to follow, and threw up my hands > and went with libxml. I could just as easily throw them up and go with > libxml2. Which do you think would be better? Daniel Veillard actually had libxml version 1 in that repository (mostly of GNOME provenance, it seems, put together during some W3C meeting in 1998). The version number changed to 2 sometime during year 2000. Version 1 was mostly abandoned at that point, and for some reason everyone keeps using "libxml2" as the name as though it was a different thing from "libxml". I suppose the latter name is just too generic, or because they wanted to differentiate from the old (probably incompatible API) code. https://gitlab.gnome.org/GNOME/libxml2/tree/LIB_XML_1_BRANCH Everyone calls it "libxml2" nowadays. Let's just use that and avoid any possible confusion. If some libxml3 emerges one day, it's quite likely we'll need to revise much more than our docs in order to use it. > On 03/26/19 23:52, Tom Lane wrote: > > Do we need to mention that at all? If you're not building from source, > > it doesn't seem very interesting ... but maybe I'm missing some reason > > why end users would care. > > The three places I've mentioned it were the ones where I thought users > might care: These seem relevant details. > If you agree, I should go through and fix my nodesets to be node-sets. +1 > [1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes > in no particular order" What this means is "we don't guarantee any specific order". It's like a query without ORDER BY: you may currently always get document order, but if you upgrade the library one day, it's quite possible to get the nodes in another order and you'll not get a refund. So you (the user) should not rely on the order, or at least be mindful that it may change in the future. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/27/19 9:31 AM, Alvaro Herrera wrote: > Everyone calls it "libxml2" nowadays. Let's just use that and avoid any > possible confusion. If some libxml3 emerges one day, it's quite likely > we'll need to revise much more than our docs in order to use it. That's persuasive to me. I'll change the references to say libxml2 and let a committer serve as tiebreaker. >> [1] http://xmlsoft.org/html/libxml-xpath.html#xmlNodeSet : "array of nodes >> in no particular order" > > What this means is "we don't guarantee any specific order". It's like a > query without ORDER BY: you may currently always get document order, but > if you upgrade the library one day, it's quite possible to get the nodes > in another order and you'll not get a refund. So you (the user) should > not rely on the order, or at least be mindful that it may change in the > future. Exactly. I called the behavior "counter-documented" to distinguish this from the usual "undocumented" case, where you notice that a library is behaving in a way you like, but its docs are utterly silent on the matter, so you know you're going out on a limb to count on what you've noticed. In this case, you can notice the handy behavior but the doc *comes right out and disclaims it* so if you count on it, you're going out on a limb that has no bark left and looks punky. And yet it seems worthwhile to mention how the library does in fact seem to behave, because you might well be in the situation of porting code over from SQL/XML:2006+ or XQuery or XPath 2+, or those are the languages you've learned, so you may have order assumptions you've made, and be surprised that XPath 1 doesn't let you make them, and at least we can say "in a pinch, if you don't mind standing on this punky limb here, you may be able to use the code you've got without having to refactor every XMLTABLE() or xpath() into something wrapped in an outer SQL query with ORDER BY. You just don't get your money back if a later library upgrade changes the order." The wiki page remembers[1] that I had tried some pretty gnarly XPath 1 queries to see if I could make libxml2 return things in a different order, but no, got document order every time. Regards, -Chap [1] https://www.postgresql.org/message-id/5C465A65.4030305%40anastigmatix.net
Hi, xml-functions-type-docfix-5.patch attached, with node-sets instead of nodesets, libxml2 instead of libxml, and parenthesized full sentences now au naturel. I ended up turning the formerly-parenthesized note about libxml2's node-set ordering into a DocBook <note>: there is really something parenthetical about it, with the official statement of node-set element ordering being that there is none, and the description of what the library happens to do being of possible interest, but set apart, with the necessary caveats about relying on it. Spotted and fixed a couple more typos in the process. Regards, -Chap
Вложения
On 03/27/19 19:07, Chapman Flack wrote: > xml-functions-type-docfix-5.patch attached, with node-sets instead of > nodesets, libxml2 instead of libxml, and parenthesized full sentences > now au naturel. > > I ended up turning the formerly-parenthesized note about libxml2's > node-set ordering into a DocBook <note>: there is really something > parenthetical about it, with the official statement of node-set > element ordering being that there is none, and the description of > what the library happens to do being of possible interest, but set > apart, with the necessary caveats about relying on it. I have just suffered a giant sinking feeling upon re-reading this sentence in our XMLTABLE doc: A column marked FOR ORDINALITY will be populated with row numbers matching the order in which the output rows appeared in the original input XML document. I've been skimming right over it all this time, and that right there is a glaring built-in reliance on the observable-but-disclaimed iteration order of a libxml2 node-set. I'm a bit unsure what any clarifying language should even say. Regards, -Chap
On 03/27/19 19:27, Chapman Flack wrote: > A column marked FOR ORDINALITY will be populated with row numbers > matching the order in which the output rows appeared in the original > input XML document. > > I've been skimming right over it all this time, and that right there is > a glaring built-in reliance on the observable-but-disclaimed iteration > order of a libxml2 node-set. So, xml-functions-type-docfix-6.patch. I changed that language to say "populated with row numbers, starting with 1, in the order of nodes retrieved from the row_expression's result node-set." That's not such a terrible thing to have to say; in fact, it's the *correct* description for the standard, XQuery-based, XMLTABLE (where the language gives you control of the result sequence's order). I followed that with a short note saying since XPath 1.0 doesn't specify that order, relying on it is implementation-dependent, and linked to the existing Appendix D discussion. I would have like to link directly to the <listitem>, but of course <xref> doesn't know what to call that, so I linked to the <sect3> instead. Regards, -Chap
Вложения
I applied and reviewed xml-functions-type-docfix-6.patch. Looks good to me.
I like the standardization (e.g. libxml2, node-set) and I didn't catch any spots that used the other versions. I agree that the <note> is appropriate for that block.
It also looks like you incorporated Alvaro's feedback about sorting, or the lack thereof.
Let me know if there's anything else I can do to help get this accepted. Thanks,
Ryan
On Thu, Mar 28, 2019 at 5:45 PM Chapman Flack <chap@anastigmatix.net> wrote:
On 03/27/19 19:27, Chapman Flack wrote:
> A column marked FOR ORDINALITY will be populated with row numbers
> matching the order in which the output rows appeared in the original
> input XML document.
>
> I've been skimming right over it all this time, and that right there is
> a glaring built-in reliance on the observable-but-disclaimed iteration
> order of a libxml2 node-set.
So, xml-functions-type-docfix-6.patch.
I changed that language to say "populated with row numbers, starting
with 1, in the order of nodes retrieved from the row_expression's
result node-set."
That's not such a terrible thing to have to say; in fact, it's the
*correct* description for the standard, XQuery-based, XMLTABLE (where
the language gives you control of the result sequence's order).
I followed that with a short note saying since XPath 1.0 doesn't
specify that order, relying on it is implementation-dependent, and
linked to the existing Appendix D discussion.
I would have like to link directly to the <listitem>, but of course
<xref> doesn't know what to call that, so I linked to the <sect3>
instead.
Regards,
-Chap
Chapman Flack <chap@anastigmatix.net> writes: > So, xml-functions-type-docfix-6.patch. Pushed with some light(?) copy-editing. I believe this closes out everything discussed in https://commitfest.postgresql.org/22/1872/ but I haven't gone through all three threads in detail. Please confirm whether that CF entry can be closed or not. regards, tom lane
On 4/1/19 4:22 PM, Tom Lane wrote: > Chapman Flack <chap@anastigmatix.net> writes: >> So, xml-functions-type-docfix-6.patch. > > Pushed with some light(?) copy-editing. > > I believe this closes out everything discussed in > > https://commitfest.postgresql.org/22/1872/ > > but I haven't gone through all three threads in detail. > Please confirm whether that CF entry can be closed or not. I think that does wrap up everything in the CF entry. Thanks! And thanks for the copy-edits; they do read better than what I came up with. When I get a moment, I'll update the PostgreSQL vs. SQL/XML wiki page to reflect the things that were fixed. Regards, -Chap
On 2019-Apr-01, Chapman Flack wrote: > When I get a moment, I'll update the PostgreSQL vs. SQL/XML wiki page > to reflect the things that were fixed. I think there were some outright bugs in the docs, at least for XMLTABLE, that maybe we should backpatch. If you have the energy to cherry-pick a minimal doc update to 10/11, I offer to back-patch it. Thanks everyone for taking care of this! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/01/19 17:34, Alvaro Herrera wrote: > I think there were some outright bugs in the docs, at least for > XMLTABLE, that maybe we should backpatch. If you have the energy to > cherry-pick a minimal doc update to 10/11, I offer to back-patch it. I'll see what I can do. There's breathing room for that after the end of the CF, right? It seems to me that the conformance-appendix part is worth using, along with all of the clarifications in datatype.sgml and func.sgml except the ones clarifying fixed behavior, where the behavior fix wasn't backpatched. That'll be where the cherry-picking effort lies. Regards, -Chap
On 04/01/19 17:34, Alvaro Herrera wrote: > I think there were some outright bugs in the docs, at least for > XMLTABLE, that maybe we should backpatch. If you have the energy to > cherry-pick a minimal doc update to 10/11, I offer to back-patch it. I don't know if this fits your intention for "minimal". What I've done is taken the doc commit made by Tom for 12 (12d46a), then revised it so it describes the unfixed behavior for the bugs whose fixes weren't backpatched to 11 or 10. I don't know if it's too late to get in the upcoming minor releases, but maybe it can, if it looks ok, or the next ones, if that's too rushed. 11.patch applies cleanly to 11, 10.patch to 10. I've confirmed the 11 docs build successfully, but without sgml tools, I haven't confirmed that for 10. Regards, -Chap
Вложения
On 2019-Aug-03, Chapman Flack wrote: > I don't know if it's too late to get in the upcoming minor releases, > but maybe it can, if it looks ok, or the next ones, if that's too rushed. Hmm, I'm travelling back home from a conference the weekend, so yeah I think it would be rushed for me to handle for the upcoming set. But I can look at it before the *next* set. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Alvaro, On 08/03/19 12:15, Alvaro Herrera wrote: >> I don't know if it's too late to get in the upcoming minor releases, >> but maybe it can, if it looks ok, or the next ones, if that's too rushed. > > Hmm, I'm travelling back home from a conference the weekend, so yeah I > think it would be rushed for me to handle for the upcoming set. But I > can look at it before the *next* set. Are these on your radar to maybe backpatch in this round of activity? The latest patches I did for 11 and 10 are in https://www.postgresql.org/message-id/5D45A44F.8010803%40anastigmatix.net Cheers, -Chap
Hi Chapman, On 2019-Sep-05, Chapman Flack wrote: > Are these on your radar to maybe backpatch in this round of activity? > > The latest patches I did for 11 and 10 are in > https://www.postgresql.org/message-id/5D45A44F.8010803%40anastigmatix.net Thanks! I just pushed these to those branches. I think we're finally done with these. Many thanks for your persistence. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services