Re: Html parsing and inline elements

Поиск
Список
Период
Сортировка
От Ryan Pedela
Тема Re: Html parsing and inline elements
Дата
Msg-id CACu89FSEhvJ451pRymAJb9ij-449o1GW9dvsu5hUPg8xGygZtg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Html parsing and inline elements  (Marcelo Zabani <mzabani@gmail.com>)
Список pgsql-hackers
On Wed, Apr 13, 2016 at 9:57 AM, Marcelo Zabani <mzabani@gmail.com> wrote:
Hi, Tom,

You're right, I don't think one can argue that the default parser should know HTML.
How about your suggestion of there being an HTML parser, is it feasible? I ask this because I think that a lot of people store HTML documents these days, and although there probably aren't lots of HTML with words written along multiple inline elements, it would certainly be nice to have a proper parser for these use cases.

What do you think?

I recommend using Apache Tika [1] for plain text extraction from HTML. There are so many weird edge cases when parsing HTML that it is easier to use something that is already mature than reinventing the wheel.


Thanks,
Ryan Pedela

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Yury Zhuravlev
Дата:
Сообщение: Re: Windows 7, Visual Studio 2010: building PgAdmin3
Следующее
От:
Дата:
Сообщение: About subxact and xact nesting level...