Обсуждение: xPath in a database with LATIN1 encoding

Поиск
Список
Период
Сортировка

xPath in a database with LATIN1 encoding

От
Jorge Silva
Дата:
Dear all,

We have a database, which is encoded with LATIN1, and we are adding some tables in it to store xml files with the “xml” type. After creating the table and populating it, we use the xPath function to retrieve some data from the XML, but it seems to not work properly, because it returns: 
 ERROR:  could not parse XML document
DETAIL:  line 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0xE3 0x6F 0x20
50550001974291546464219"><ide><cUF>31</cUF><cNF>54646421</cNF><natOp>Venda produ

The characters that it is not recognizing are both “ç” and “ã” because I think they are encoded differently in latin1 and utf-8. Is it possible to somehow use the xPath function with special characters in the XML and in a database which is not encoded with utf-8? 

Thanks in advance.

Regards,

Jorge Luiz Moreira Silva
Polimet Indústria Metalúrgica EIRELI

Re: xPath in a database with LATIN1 encoding

От
Tom Lane
Дата:
Jorge Silva <jorge.silva93@gmail.com> writes:
> The characters that it is not recognizing are both “ç” and “ã” because I think they are encoded differently in latin1
andutf-8. Is it possible to somehow use the xPath function with special characters in the XML and in a database which
isnot encoded with utf-8?  

I don't have a lot of expertise in this area, but I think you need
an explicit encoding indicator in the xml header, a la

    <?xml encoding="latin1"?> ...

On the whole, the xml type is definitely easier to use with database
encoding set to utf8.  I think you'll be paying for encoding conversion
every time we interact with libxml, for instance.

            regards, tom lane



Re: xPath in a database with LATIN1 encoding

От
Jorge Silva
Дата:
Thanks for the quick reply, Tom. I am trying something simpler. I am trying to find a way to run the xPath function with a xml file type, which has latin characters, such as:
SELECT xpath(‘//xml_test/text()’, convert_from(convert_to(‘<xml_test>çã</xml_test>','utf-8’),'utf-8')::xml)

This line returns the same error, as follows:
[Code: 0, SQL State: 2200M]  ERROR: could not parse XML document
  Detail: line 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0xE3 0x3C 0x2F
<xml_test>çã</xml_test>
          ^
This happens because the “text”, which is the output from convert_from() function is encoded with LATIN1, the database encoding set, and not UTF-8.  Is there a way that a text variable is not encoded as the database encoding set, but some other encoding set, such as UTF-8? From what I’ve searched for, it seems that it would be something similar to nvarchar that only exists in Microsoft SQL Server.



On 24 Apr 2021, at 13:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jorge Silva <jorge.silva93@gmail.com> writes:
The characters that it is not recognizing are both “ç” and “ã” because I think they are encoded differently in latin1 and utf-8. Is it possible to somehow use the xPath function with special characters in the XML and in a database which is not encoded with utf-8?

I don't have a lot of expertise in this area, but I think you need
an explicit encoding indicator in the xml header, a la

   <?xml encoding="latin1"?> ...

On the whole, the xml type is definitely easier to use with database
encoding set to utf8.  I think you'll be paying for encoding conversion
every time we interact with libxml, for instance.

regards, tom lane

Re: xPath in a database with LATIN1 encoding

От
Holger Jakobs
Дата:


Am 24.04.21 um 20:20 schrieb Jorge Silva:
Thanks for the quick reply, Tom. I am trying something simpler. I am trying to find a way to run the xPath function with a xml file type, which has latin characters, such as:
SELECT xpath(‘//xml_test/text()’, convert_from(convert_to(‘<xml_test>çã</xml_test>','utf-8’),'utf-8')::xml)

This line returns the same error, as follows:
[Code: 0, SQL State: 2200M]  ERROR: could not parse XML document
  Detail: line 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0xE3 0x3C 0x2F
<xml_test>çã</xml_test>
          ^
This happens because the “text”, which is the output from convert_from() function is encoded with LATIN1, the database encoding set, and not UTF-8.  Is there a way that a text variable is not encoded as the database encoding set, but some other encoding set, such as UTF-8? From what I’ve searched for, it seems that it would be something similar to nvarchar that only exists in Microsoft SQL Server.



On 24 Apr 2021, at 13:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jorge Silva <jorge.silva93@gmail.com> writes:
The characters that it is not recognizing are both “ç” and “ã” because I think they are encoded differently in latin1 and utf-8. Is it possible to somehow use the xPath function with special characters in the XML and in a database which is not encoded with utf-8?

I don't have a lot of expertise in this area, but I think you need
an explicit encoding indicator in the xml header, a la

   <?xml encoding="latin1"?> ...

On the whole, the xml type is definitely easier to use with database
encoding set to utf8.  I think you'll be paying for encoding conversion
every time we interact with libxml, for instance.

regards, tom lane

The point is that the XML standard dictates that every document without an explicit encoding, must be encodied in UTF-8. So, whenever you want to use a different encoding, this has to be stated in the XML header, as indicated in Tom Lane's answer.

So the encoding of XML columns isn't actually the one of the database (although UTF-8 is best in any case), but simply is the XML standard. Your client encoding may vary, though, but this would limit the characters which can be transferred.

Regards,

Holger


-- 
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012
Вложения

Re: xPath in a database with LATIN1 encoding

От
Peter Eisentraut
Дата:
On 25.04.21 10:44, Holger Jakobs wrote:
> The point is that the XML standard dictates that every document without 
> an explicit encoding, must be encodied in UTF-8. So, whenever you want 
> to use a different encoding, this has to be stated in the XML header, as 
> indicated in Tom Lane's answer.

If you are transmitting an XML datum from the client to the server in 
the text protocol, then the encoding declaration is ignored.  See this 
documentation:

https://www.postgresql.org/docs/devel/datatype-xml.html#id-1.5.7.21.7



Re: xPath in a database with LATIN1 encoding

От
Peter Eisentraut
Дата:
On 24.04.21 17:20, Jorge Silva wrote:
> We have a database, which is encoded with LATIN1, and we are adding some 
> tables in it to store xml files with the “xml” type. After creating the 
> table and populating it, we use the xPath function to retrieve some data 
> from the XML, but it seems to not work properly, because it returns:

As documented at 
<https://www.postgresql.org/docs/devel/datatype-xml.html#id-1.5.7.21.7>, 
xpath only works if the server encoding is UTF8.