Re: Fix XML handling with DOCTYPE

Поиск
Список
Период
Сортировка
От Ryan Lambert
Тема Re: Fix XML handling with DOCTYPE
Дата
Msg-id CAN-V+g884QQLJu+guDArhmNMejgb7e5f6b7i1mfTRgHdQFzSQQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Fix XML handling with DOCTYPE  (Chapman Flack <chap@anastigmatix.net>)
Список pgsql-hackers
Thank you both!  I had glanced at that item in the commitfest but didn't notice it would fix this issue.
I'll try to test/review this before the end of the month, much better than starting from scratch myself.   A quick glance at the patch looks logical and looks like it should work for my use case.

Thanks, 

Ryan Lambert


On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack <chap@anastigmatix.net> wrote:
On 03/16/19 17:21, Tom Lane wrote:
> Chapman Flack <chap@anastigmatix.net> writes:
>> On 03/16/19 16:55, Tom Lane wrote:
>>> What do you think of the idea I just posted about parsing off the DOCTYPE
>>> thing for ourselves, and not letting libxml see it?
>
>> The principled way of doing that would be to pre-parse to find a DOCTYPE,
>> and if there is one, leave it there and parse the input as we do for
>> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
>> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
>> 'content' is a proper superset of 'document', so if we were asked for
>> 'content' and can successfully parse it as 'document', we're good,
>> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
>> well, that's what needed to happen.
>
> Hm, so, maybe just
>
> (1) always try to parse as document.  If successful, we're done.
>
> (2) otherwise, if allowed by xmloption, try to parse using our
> current logic for the CONTENT case.

What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).

I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.

So the current structure is:

restart:
  asked for document?
    parse as document, or fail
  else asked for content:
    parse as content
    failed?
      because DOCTYPE? restart as if document
      else fail

and a pre-scan structure could be very similar:

restart:
  asked for document?
    parse as document, or fail
  else asked for content:
    pre-scan finds DOCTYPE?
      restart as if document
    else parse as content, or fail

The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.

Regards,
-Chap

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Chapman Flack
Дата:
Сообщение: Re: Fix XML handling with DOCTYPE
Следующее
От: Euler Taveira
Дата:
Сообщение: Re: proposal: pg_restore --convert-to-text