Re: Fix XML handling with DOCTYPE

Поиск

Список

Период

Сортировка

От	Ryan Lambert
Тема	Re: Fix XML handling with DOCTYPE
Дата	17 марта 2019 г. 01:43:43
Msg-id	CAN-V+g884QQLJu+guDArhmNMejgb7e5f6b7i1mfTRgHdQFzSQQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Fix XML handling with DOCTYPE (Chapman Flack <chap@anastigmatix.net>)
Список	pgsql-hackers

Дерево обсуждения

Thank you both! I had glanced at that item in the commitfest but didn't notice it would fix this issue.

I'll try to test/review this before the end of the month, much better than starting from scratch myself. A quick glance at the patch looks logical and looks like it should work for my use case.

Thanks,

Ryan Lambert

On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack <chap@anastigmatix.net> wrote:

On 03/16/19 17:21, Tom Lane wrote:
> Chapman Flack <chap@anastigmatix.net> writes:
>> On 03/16/19 16:55, Tom Lane wrote:
>>> What do you think of the idea I just posted about parsing off the DOCTYPE
>>> thing for ourselves, and not letting libxml see it?
>
>> The principled way of doing that would be to pre-parse to find a DOCTYPE,
>> and if there is one, leave it there and parse the input as we do for
>> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
>> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
>> 'content' is a proper superset of 'document', so if we were asked for
>> 'content' and can successfully parse it as 'document', we're good,
>> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
>> well, that's what needed to happen.
>
> Hm, so, maybe just
>
> (1) always try to parse as document. If successful, we're done.
>
> (2) otherwise, if allowed by xmloption, try to parse using our
> current logic for the CONTENT case.

What I don't like about that is that (a) the input could be
arbitrarily long and complex to parse (not that you can't imagine
a database populated with lots of short little XML snippets, but
at the same time, a query could quite plausibly deal in yooge ones),
and (b), step (1) could fail at the last byte of the input, followed
by total reparsing as (2).

I think the safer structure is clearly that of the current patch,
modulo whether the "has a DOCTYPE" test is done by libxml itself
(with the assumptions you don't like) or by a pre-scan.

So the current structure is:

restart:
asked for document?
parse as document, or fail
else asked for content:
parse as content
failed?
because DOCTYPE? restart as if document
else fail

and a pre-scan structure could be very similar:

restart:
asked for document?
parse as document, or fail
else asked for content:
pre-scan finds DOCTYPE?
restart as if document
else parse as content, or fail

The pre-scan is a simple linear search and will ordinarily say yes or no
within a couple dozen characters--you could *have* an input with 20k of
leading whitespace and comments, but it's hardly the norm. Just trying to
parse as 'document' first could easily parse a large fraction of the input
before discovering it's followed by something that can't follow a document
element.

Regards,
-Chap

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Chapman Flack
Дата: 17 марта 2019 г., 01:33:19
Сообщение: Re: Fix XML handling with DOCTYPE

Следующее

От: Euler Taveira
Дата: 17 марта 2019 г., 01:54:30
Сообщение: Re: proposal: pg_restore --convert-to-text

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Fix XML handling with DOCTYPE

Предыдущее

Следующее