On 03/16/19 17:21, Tom Lane wrote:
> Chapman Flack <chap@anastigmatix.net> writes:
>> On 03/16/19 16:55, Tom Lane wrote:
>>> What do you think of the idea I just posted about parsing off the DOCTYPE
>>> thing for ourselves, and not letting libxml see it?
>
>> The principled way of doing that would be to pre-parse to find a DOCTYPE,
>> and if there is one, leave it there and parse the input as we do for
>> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
>> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
>> 'content' is a proper superset of 'document', so if we were asked for
>> 'content' and can successfully parse it as 'document', we're good,
>> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
>> well, that's what needed to happen.
>
> Hm, so, maybe just
>
> (1) always try to parse as document. If successful, we're done.
>
> (2) otherwise, if allowed by xmloption, try to parse using our
> current logic for the CONTENT case.
What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).
I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.
So the current structure is:
restart:
asked for document?
parse as document, or fail
else asked for content:
parse as content
failed?
because DOCTYPE? restart as if document
else fail
and a pre-scan structure could be very similar:
restart:
asked for document?
parse as document, or fail
else asked for content:
pre-scan finds DOCTYPE?
restart as if document
else parse as content, or fail
The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.
Regards,
-Chap