Обсуждение: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

Поиск
Список
Период
Сортировка

invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
Rural Hunter
Дата:
My db is in utf-8, I have a row in my table say tmp_article and I wanted
to generate ts_vector from the article content:
select to_tsvector(content) from tmp_article;
But I got this error:
ERROR:  invalid byte sequence for encoding "UTF8": 0xf481

I am wondering how this could happen. I think if there was invalid UTF8
bytes in the content, it shouldn't have been able to inserted into the
tmp_article table as I sometimes see similar errors when inserting
records to tmp_article. Am I right?

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
raghu ram
Дата:


2012/4/14 Rural Hunter <ruralhunter@gmail.com>
My db is in utf-8, I have a row in my table say tmp_article and I wanted to generate ts_vector from the article content:
select to_tsvector(content) from tmp_article;
But I got this error:
ERROR:  invalid byte sequence for encoding "UTF8": 0xf481

I am wondering how this could happen. I think if there was invalid UTF8 bytes in the content, it shouldn't have been able to inserted into the tmp_article table as I sometimes see similar errors when inserting records to tmp_article. Am I right?


This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

Try to set client_encoding='LATIN1' 

and then execute 

select to_tsvector(content) from tmp_article;

--

Thanks & Regards,

Raghu Ram

EnterpriseDB: http://www.enterprisedb.com

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
Rural Hunter
Дата:
doesn't work either.<br /><br /> db=# show client_encoding;<br />  client_encoding <br /> -----------------<br />
 UTF8<br/> (1 row)<br /><br /> db=# set client_encoding='LATIN1';<br /> SET<br /> db=# show client_encoding;<br />
 client_encoding<br /> -----------------<br />  LATIN1<br /> (1 row)<br /><br /> db=# select to_tsvector(content) from
tmp_article;<br/> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481<br /><br /> 于 2012/4/14 10:15, raghu ram
写道:<blockquote cite="mid:CALnrrJTPzsNvaFP_qHisMS=wwp2bkvK=gk_wPDzC-fdaXEkypA@mail.gmail.com" type="cite"><br /><br
/><divclass="gmail_quote">2012/4/14 Rural Hunter <span dir="ltr"><<a href="mailto:ruralhunter@gmail.com"
moz-do-not-send="true">ruralhunter@gmail.com</a>></span><br/><blockquote class="gmail_quote" style="margin:0 0 0
   .8ex;border-left:1px #ccc solid;padding-left:1ex"> My db is in utf-8, I have a row in my table say tmp_article and I
wantedto generate ts_vector from the article content:<br /> select to_tsvector(content) from tmp_article;<br /> But I
gotthis error:<br /> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481<br /><br /> I am wondering how this
couldhappen. I think if there was invalid UTF8 bytes in the content, it shouldn't have been able to inserted into the
tmp_articletable as I sometimes see similar errors when inserting records to tmp_article. Am I right?<span
class="HOEnZb"><fontcolor="#888888"><br /><br /></font></span></blockquote><div><br /></div><div><p class="p1">This
errorcan also happen if the <span class="s1">byte</span> <span class="s1">sequence</span> does not match the <span
class="s1">encoding</span>expected by the server, which is controlled by "client_encoding".</div><div>Try to set
client_encoding='LATIN1' </div><div><br/></div><div>and then execute </div></div><div><br /></div> select
to_tsvector(content)from tmp_article;<br clear="all" /><div><br /></div> -- <br /><p>Thanks & Regards,<p>Raghu
Ram<p>EnterpriseDB:<a href="http://www.enterprisedb.com/" moz-do-not-send="true"
target="_blank"><span>http://www.enterprisedb.com</span></a><br/></blockquote><br /> 

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
raghu ram
Дата:


On Sat, Apr 14, 2012 at 9:31 AM, Rural Hunter <ruralhunter@gmail.com> wrote:
doesn't work either.

db=# show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

db=# set client_encoding='LATIN1';
SET
db=# show client_encoding;
 client_encoding
-----------------
 LATIN1
(1 row)

db=# select to_tsvector(content) from tmp_article;

ERROR:  invalid byte sequence for encoding "UTF8": 0xf481

Try to set client_encoding='SQL_ASCII' 

and then execute 

select to_tsvector(content) from tmp_article;

--

Thanks & Regards,

Raghu Ram

EnterpriseDB: http://www.enterprisedb.com

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
Rural Hunter
Дата:
still doesn't work.<br /> db=# set client_encoding='SQL_ASCII' ;<br /> SET<br /> db=# select to_tsvector(content) from
tmp_article;<br/> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481<br /><br /><br /> 于 2012/4/15 1:38, raghu
ram写道: <blockquote cite="mid:CALnrrJSCuGpJTRx_Vg8efcs+o-t0=PABpcOz5AJ+jkZ=jTNq+g@mail.gmail.com" type="cite"><br /><br
/><divclass="gmail_quote">On Sat, Apr 14, 2012 at 9:31 AM, Rural Hunter <span dir="ltr"><<a
href="mailto:ruralhunter@gmail.com"moz-do-not-send="true">ruralhunter@gmail.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0         .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF"
text="#000000">doesn't work either.<br /><br /> db=# show client_encoding;<br />  client_encoding <br />
-----------------<br/>  UTF8<br /> (1 row)<br /><br /> db=# set client_encoding='LATIN1';<br /> SET<br /> db=# show
client_encoding;<br/>  client_encoding <br /> -----------------<br />  LATIN1<br /> (1 row)<br /><br /> db=# select
to_tsvector(content)from tmp_article; <div class="im"><br /> ERROR:  invalid byte sequence for encoding "UTF8":
0xf481<br/></div></div></blockquote><div><br /></div><div><blockquote type="cite"><div class="gmail_quote"><div>Try to
setclient_encoding='SQL_ASCII' </div><div><br /></div><div>and then execute </div></div><div><br /></div> select
to_tsvector(content)from tmp_article;<br clear="all" /></blockquote></div><div><br /></div></div> -- <br /><p>Thanks
&Regards,<p>Raghu Ram<p>EnterpriseDB: <a href="http://www.enterprisedb.com/" moz-do-not-send="true"
target="_blank"><span>http://www.enterprisedb.com</span></a><br/></blockquote><br /> 

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
"Albe Laurenz"
Дата:
Rural Hunter wrote:
> My db is in utf-8, I have a row in my table say tmp_article and I
wanted
> to generate ts_vector from the article content:
> select to_tsvector(content) from tmp_article;
> But I got this error:
> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481
>
> I am wondering how this could happen. I think if there was invalid
UTF8
> bytes in the content, it shouldn't have been able to inserted into the
> tmp_article table as I sometimes see similar errors when inserting
> records to tmp_article. Am I right?

You are right in theory.  A lot depends on your PostgreSQL version,
because
the efforts to prevent invalid strings from entering the database have
led to changes over the versions.  Older versions are more permissive.

To test the theory that the contents of the table are bad, you can
test if the same happens if you

SELECT convert_to(content, 'UTF8') FROM tmp_article;

Yours,
Laurenz Albe

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
Rural Hunter
Дата:
=>SELECT convert_to(content, 'UTF8') FROM tmp_article;
This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have
millions of data in the db but this is the only one we met the problem.
The bad data is inserted in recent days and we upgraded to 9.1.3 right
after it was released.

于 2012/4/16 16:31, Albe Laurenz 写道:
> Rural Hunter wrote:
>> My db is in utf-8, I have a row in my table say tmp_article and I
> wanted
>> to generate ts_vector from the article content:
>> select to_tsvector(content) from tmp_article;
>> But I got this error:
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481
>>
>> I am wondering how this could happen. I think if there was invalid
> UTF8
>> bytes in the content, it shouldn't have been able to inserted into the
>> tmp_article table as I sometimes see similar errors when inserting
>> records to tmp_article. Am I right?
> You are right in theory.  A lot depends on your PostgreSQL version,
> because
> the efforts to prevent invalid strings from entering the database have
> led to changes over the versions.  Older versions are more permissive.
>
> To test the theory that the contents of the table are bad, you can
> test if the same happens if you
>
> SELECT convert_to(content, 'UTF8') FROM tmp_article;
>
> Yours,
> Laurenz Albe
>


Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
"Albe Laurenz"
Дата:
Please don't top post.

Rural Hunter wrote:
>>> My db is in utf-8, I have a row in my table say tmp_article and I wanted
>>> to generate ts_vector from the article content:
>>> select to_tsvector(content) from tmp_article;
>>> But I got this error:
>>> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481
>>>
>>> I am wondering how this could happen. I think if there was invalid UTF8
>>> bytes in the content, it shouldn't have been able to inserted into the
>>> tmp_article table as I sometimes see similar errors when inserting
>>> records to tmp_article. Am I right?

>> You are right in theory.  A lot depends on your PostgreSQL version,
>> because
>> the efforts to prevent invalid strings from entering the database have
>> led to changes over the versions.  Older versions are more permissive.
>>
>> To test the theory that the contents of the table are bad, you can
>> test if the same happens if you
>>
>> SELECT convert_to(content, 'UTF8') FROM tmp_article;

> =>SELECT convert_to(content, 'UTF8') FROM tmp_article;
> This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have
> millions of data in the db but this is the only one we met the problem.
> The bad data is inserted in recent days and we upgraded to 9.1.3 right
> after it was released.

Hmm, that is strange.
Could you find the row that causes the problem and post the
result of "CAST (content AS bytea)" for this row?

Do you use any nonstandard text search parsers or dictionaries?
What is the text search configuration you use (parameter
default_text_search_config)?

Yours,
Laurenz Albe

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
Rural Hunter
Дата:
Hi Albe,
Yes, I'm using a Chinese tsearch2 plugin called nlpbamboo and the
default default_text_search_config is 'chinesecfg'. I tested "select
to_tsvector('english',content) from tmp_article" and this works. So it's
the problem of nlpbamboo? The result of 'CAST (content AS bytea)'  is
shown below. btw, what's the meaning of "Please don't top post."? I
don't understand but sorry if I caused any problem.


\xe6a0b8e5bf83e68f90e7a4baefbc9ae69caae69da5e4b889e5b9b4e5b086e698afe694bfe5ba9ce99c80e6b182e9878ae694beefbc8ce8a18ce4b89ae9ab98e9809fe68890e995bfe79a84e697b6e69c9fefbc8ce585ac

e58fb8e79baee5898de582a8e5a487e9a1b9e79baee4bc97e5a49aefbc8ce4b89ae7bba9e5ad98e59ca8e59091e4b88ae8b685e9a284e69c9fe79a84e58fafe883bdefbc8ce5bd93e5898de79a84e4bcb0e580bce6b0b4e5b

9b3e4b88de9ab98efbc8ce68c89e785a73230313220e5b9b43235efbd9e333020e5808de79a84502f45efbc8ce585ace58fb8e4bcb0e580bce58cbae997b4e59ca832322e3439efbd9e32362e393920e58583e58cbae997b4

efbc8ce8be83e79baee5898de882a1e4bbb7e7baa6e69c893425efbd9e323525e79a84e4b88ae58d87e7a9bae997b4efbc8ce7bbb4e68c81e2809ce68ea8e88d90e2809de79a84e68a95e8b584e8af84e7baa7e380820ae8b

083e7a094e58685e5aeb90ae8bf91e697a5efbc8ce58f82e58aa0e4ba86e585ace58fb8e7bb84e7bb87e79a84e68a95e8b584e88085e68ea5e5be85e6b4bbe58aa8efbc8ce4b88ee585ace58fb8e9ab98e7aea1e8bf9be8a1

8ce4ba86e4baa4e6b581e380820ae4b8bbe8a681e8a782e782b90ae2809ce58d81e4ba8cc2b7e4ba94e2809de69c9fe997b4e694bfe5ba9ce99c80e6b182e6988ee698bee5a29ee58aa0efbc8ce585ace58fb8e58aa0e9809

fe8b791e9a9ace59c88e59cb0e380820ae7bb8fe8bf87e58c97e4baace38081e4b88ae6b5b7e38081e69dade5b79ee38081e5ae81e6b3a2e7ad89e58588e8bf9be59f8ee5b882e794b5e5ad90e694bfe58aa1e5bbbae8aebe

e79a84e7a4bae88c83e4bd9ce794a8efbc8ce59084e59cb0e694bfe5ba9ce68488e58f91e58585e58886e8aea4e8af86e588b0e695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe5afb9e694bfe5ba9ce59

f8ee5b882e7aea1e79086e6b0b4e5b9b3e68f90e58d87e68980e8b5b7e588b0e79a84e9878de5a4a7e4bd9ce794a8efbc8ce8808ce99a8fe79d80e68890e58a9fe6a188e4be8be79a84e4b88de696ade7a7afe7b4afefbc8c

e694bfe5ba9ce887aae8baabe5928ce585ace58fb8e5afb9e99c80e6b182e79a84e79086e8a7a3e4b99fe983bde69bb4e58aa0e6b7b1e585a5e38082e7bb8fe8bf873230313020e5b9b4e5928c3230313120e5b9b4e79a84e

694bfe5ba9ce68da2e5b18ae4b88ee8a784e58892e588b6e5ae9aefbc8c32303132efbd9e3230313420e5b9b4e698afe59cb0e696b9e694bfe5ba9ce2809ce58d81e4ba8cc2b7e4ba94e2809de5bbbae8aebee585b7e4bd93

e5ae9ee696bde5b9b4efbc8ce79baee5898de59bbde58685e7baa63236303020e4b8aae58ebfe7baa7e68896e4bba5e4b88ae79a84e5b882efbc88e58ebfefbc89e4b8adefbc8ce5b7b2e7bb8fe5bbbae8aebee4ba86e695b

0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe79a84e4bb8533303020e69da5e5aeb6efbc8ce794b1e4ba8ee59cb0e696b9e694bfe5ba9ce59ca8e5bbbae8aebee696b9e99da2e79a84e68a95e585a5e585b7

e69c89e78886e58f91e5bc8fe79a84e789b9e782b9efbc8ce88083e89991e588b0e585ace58fb83230313220e5b9b4e4b880e5ada3e5baa6e4b89ae7bba9e79a84e5a4a7e5b985e5a29ee995bfefbc8ce79bb8e4bfa1e4bb8

ee4bb8ae5b9b4e5bc80e5a78be79a843320e5b9b4efbc8ce5b086e698afe695b4e4b8aae8a18ce4b89ae5bfabe9809fe58f91e5b195e79a84e697b6e69c9fe38082e585ace58fb8e698afe8a18ce4b89ae4b8ade79a84e9be

99e5a4b4e4bc81e4b89aefbc8ce5b882e59cbae58da0e69c89e78e87e59ca8363025e4bba5e4b88ae38082e8808ce4b894e585ace58fb8e59ca8e59084e4bba3e8a1a8e680a7e59f8ee5b882e79a84e9a1b9e79baee585b7e

69c89e6a0b7e69dbfe69588e5ba94efbc8ce58fafe4bba5e4b8bae585ace58fb8e5b8a6e69da5e69bb4e5a49ae79a84e694bfe5ba9ce5aea2e688b7efbc8ce5b9b6e4b894e5ae9ee78eb0e4ba86e99c80e6b182e79086e8a7

a32de99c80e6b182e5ae9ee78eb02de99c80e6b182e5bc95e5afbce79a84e889afe680a7e5beaae78eafe38082e59ba0e6ada4efbc8ce8aea4e4b8bae59ca8e69c80e8bf91e79a84e4b880e4b8a4e5b9b4efbc8ce585ace58

fb8e5b086e4bba5e9ab98e4ba8ee8a18ce4b89ae6b0b4e5b9b3e79a84e5a29ee9809fe8bf85e78c9be58f91e5b195efbc8ce4b99fe58db3e8bf9be585a5e8b791e9a9ace59c88e59cb0efbc8ce5b7a9e59bbae5b882e59cba

e58da0e69c89e78e87e79a84e697b6e69c9fe38082e6ada4e5908eefbc8ce99a8fe79d80e694bfe5ba9ce5908ee7bbade68a95e585a5e58fafe883bde587bae78eb0e79a84e591a8e69c9fe680a7e6b3a2e58aa8efbc8ce58

5ace58fb8e9809ae8bf87e59084e9a1b9e4b89ae58aa1e7bb93e69e84e79a84e8b083e88a82e5ae9ee78eb0e5b9b3e7a8b3e8be83e5bfabe9809fe79a84e5a29ee995bfefbc8ce59ca8e8bf99e4b880e998b6e6aeb5efbc8c

e9a284e8aea1e5a29ee9809fe887b3e5b091e4b99fe59ca8323025efbd9e333025e58cbae997b4e38082e5a29ee58aa0e7b3bbe7bb9fe99b86e68890e4b89ae58aa1e9878fefbc8ce689a9e5a4a7e694b6e585a5e8a784e6a

8a1e5b9b6e58aa0e5bcbae5aea2e688b7e7b298e680a7e7b297e795a5e4bcb0e7ae97efbc8ce79bb4e8be96e5b882e38081e79c81e4bc9ae7baa7e59f8ee5b882e38081e59cb0e7baa7e5b882efbc88e4b88de590abe79c81

e4bc9ae59f8ee5b882efbc89e38081e5b882e8be96e58cbae38081e58ebfe7baa7e5b882efbc88e58ebfefbc89e695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe79a84e5bbbae8aebee8b4b9e794a8e7b

aa6e4b8ba3530303020e4b887e58583e380813330303020e4b887e58583e380813235303020e4b887e58583e380813130303020e4b887e58583e5928c38303020e4b887e58583e38082e68c89e785a7e79baee5898de68891

e59bbde59f8ee5b882e695b0e9878fe8aea1e7ae97efbc8ce4b88de7ae97e5908ee69c9fe58d87e7baa7e7bbb4e68aa4e79a84e8b4b9e794a8efbc8ce695b4e4b8aae5b882e59cbae8a784e6a8a1e59ca832353020e4babfe

4bba5e4b88ae38082e9a284e8aea1e887b33230313520e5b9b4e5ba95efbc8ce695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe5b086e887b3e5b091e8a686e79b96e585a8e983a8e59cb0e7baa7e4bba5

e4b88ae59f8ee5b882e38081333025e79a84e5b882e8be96e58cbae5928c313025e79a84e58ebfe7baa7e5b88228e58ebf29e38082e784b6e8808ce79bb8e5afb9e4ba8ee5a682e6ada4e5a4a7e79a84e5b882e59cbaefbc8

ce585ace58fb8e8bdafe4bbb6e4baa7e59381e79a84e4bbbde9a29de4b990e8a782e4bcb0e8aea1e4bb85e59ca8333025e79a84e6af94e4be8be38082e794b1e4ba8ee694bfe5ba9ce8aea2e58d95e69bb4e5a49ae580bee5

9091e4ba8ee4bba5e7a1ace4bbb6e98787e8b4ade4b8bae789b9e782b9e79a84e7b3bbe7bb9fe99b86e68890efbc8ce585ace58fb8e58f82e4b88ee7b3bbe7bb9fe99b86e68890e4b89ae58aa1e4b880e696b9e99da2e58fa

fe4bba5e9809ae8bf87e68f90e4be9be695b4e4bd93e8a7a3e586b3e696b9e6a188e68f90e58d87e7ab9ee4ba89e4bc98e58abfefbc8ce5a29ee58aa0e4b8ade6a087e6a682e78e87efbc8ce4bb8ee8808ce689a9e5a4a7e6

94b6e585a5e8a784e6a8a1efbc8ce58fa6e4b880e696b9e99da2e4b99fe58fafe4bba5e5a29ee58aa0e5aea2e688b7e7b298e680a7efbc8ce7a8b3e5ae9ae69caae69da5e79a84e5a29ee995bfe38082e59ba0e6ada4efbc8

ce9a284e8aea1e585ace58fb8e69caae69da5e4bc9ae58aa0e5bcbae7b3bbe7bb9fe99b86e68890e4b89ae58aa1efbc8ce69bb4e58aa0e58585e58886e79a84e4baabe58f97e5b882e59cbae79a84e5a29ee995bfe380820a

e695b0e68daee699aee69fa5e4b88ee4bfa1e681afe98787e99b86e58db3e5b086e78886e58f91efbc8ce69caae69da5e5b086e68890e9878de8a681e588a9e6b6a6e69da5e6ba90e4b98be4b880e38082e887aa323031302

0e5b9b4e5bc80e5a78befbc8ce585ace58fb8e5bc80e5a78be4b8bae694bfe5ba9ce79bb8e585b3e983a8e997a8e68f90e4be9be695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe5bbbae8aebee5928ce8

bf90e890a5e4b8ade79a84e79bb8e585b3e695b0e68daee699aee69fa5e5928ce4bfa1e681afe98787e99b86e69c8de58aa1efbc8ce5b9b6e4b8bae6ada4e68a95e585a53430303020e4bd99e4b887e58583e7949fe4baa73

13020e58fb0e59fbae4ba8e4c6164796275673320e79a84e8bda6e8bdbde6bf80e58589e689abe68f8fe7b3bbe7bb9fe5928c3320e58fb0e4bebfe690bae5bc8fe6bf80e58589e689abe68f8fe7b3bbe7bb9fe38082323031

3120e5b9b4e5b7b2e5ae8ce688903220e58fb0e8bda6efbc8ce9a284e8aea13230313220e5b9b4e4b88ae58d8ae5b9b4e69c8934efbd9e3520e58fb0e8bda6e58fafe68a95e585a53320e7bbb4e5ae9ee699afe6b58be9878

fe79a84e5ba94e794a8e38082e5afb9e4ba8ee8bf99e4b880e696b0e79a84e5b882e59cbaefbc8ce6ada6e6b189e7ab8be5be97e7a9bae997b4e4bd9ce4b8bae58588e8a18ce88085efbc8ce5b882e59cbae58da0e69c89e7

8e87e8bebe363025e5b7a6e58fb3efbc8ce585ace58fb8e58899e4b88de588b0333025efbc8ce4bd86e585ace58fb8e794b1e4ba8ee58fafe68f90e4be9be59fbae4ba8ee695b0e5ad97e59f8ee5b882e7aea1e79086e7b3b

be7bb9fe79a84e4b880e4bd93e58c96e8a7a3e586b3e696b9e6a188efbc8ce4b894e59ca8e6b58be9878fe7b2bee5baa6e4b88ae79bb8e6af94e5afb9e6898be585b7e69c89e7ab9ee4ba89e4bc98e58abfefbc8ce9a284e8

aea1e8b685e8b68ae5afb9e6898be68890e4b8bae9be99e5a4b4e4bc81e4b89ae5b086e698afe5be88e5bfabe4bc9ae58f91e7949fe79a84e4ba8be68385e38082e9a284e8aea13230313220e5b9b4e8afa5e4b89ae58aa1e

8bf9be585a5e78886e58f91e69c9fefbc8ce4b894e69caae69da533efbd9e3520e5b9b4e58fafe4bf9de68c81353025e5b7a6e58fb3e79a84e5b9b4e59d87e5a48de59088e5a29ee995bfe380820ae4babae58a9be68890e6

9cace4b88ae58d87e58e8be58a9be69c89e68980e7bc93e8a7a3efbc8ce5a4a7e9a29de9a1b9e79baee6918ae4bd8ee7a094e58f91e68890e69cace38082e794b1e4ba8ee8bf91e4b8a4e5b9b4e9809ae88380e78e87e9ab9

8e4bc81efbc8ce4bba5e58f8ae7a7bbe58aa8e4ba92e88194e7bd91e7ad89e696b0e585b4495420e7bb86e58886e8a18ce4b89ae79a84e585b4e8b5b7efbc8ce5afb9e4bca0e7bb9fe8bdafe4bbb6e4bc81e4b89ae980a0e6

8890e4babae58a9be68890e69cace4b88ae6b6a8e79a84e5b7a8e5a4a7e58e8be58a9befbc8ce585ace58fb8e4babae58a9be68890e69cace4b99fe5a4a7e5b985e4b88ae58d87efbc8ce4bcb0e7ae97e585b6e5a29ee5b98

5e887b3e5b091e59ca8333025e4bba5e4b88ae38082e8808ce99a8fe79d80e5ae8fe8a782e7bb8fe6b58ee5bda2e58abfe79a84e58f98e58c96efbc8ce4bba5e58f8ae585ace58fb8e6ada6e6b189e7a094e58f91e4b8ade5

bf83e79a84e5bbbae8aebee5ae8ce68890efbc8ce9a284e8aea13230313220e5b9b4e4babae58a9be68890e69cace4b88ae58d87e79a84e58e8be58a9be5b086e5be97e588b0e8be83e5a4a7e7bc93e8a7a3e38082e8808ce

794b1e4ba8ee585ace58fb8e8bdafe4bbb6e7b3bbe7bb9fe79a84e4baa7e59381e58c96e7a88be5baa6e8be83e9ab98efbc8ce7a094e58f91e4b88ee5ae9ee696bde79a84e880a6e59088e5baa6e8be83e69dbeefbc8ce99a

8fe79d80e59088e5908ce9a29de79a84e4b88ae58d87efbc8ce8bf99e4b880e4bc98e58abfe5b086e68488e58f91e6988ee698bee79a84e6918ae4bd8ee7a094e58f91e68890e69cace38082e5b0bde7aea1e7b3bbe7bb9fe

99b86e68890e9a1b9e79baee5b086e69c89e68980e5a29ee58aa0efbc8ce4bd86e7bbbce59088e4b88ae8bfb0e4b8a4e696b9e99da2e59ba0e7b4a0efbc8ce4bf9de5ae88e4bcb0e8aea1efbc8ce585ace58fb8e6af9be588

a9e78e87e6b0b4e5b9b3e887b3e5b091e58fafe4bba5e7bb93e69d9fe4b8a4e5b9b4e69da5e79a84e4b88be9998de8b68be58abfe8808ce7a8b3e5ae9ae59ca8363025e4bba5e4b88ae380820af48186bff483bfb2f48393b

ff4828dbe203ff484998ff4848ebbf4839583f480aabcf481adaff4838686f4819997203ff4838d90f482b2b4f4848d8ff484b692f48196a1f480abbd0ae69cace69687e794b1e882a1e7a5a8e68a95e8b584e5b7a5e585b7

3ae9b9b0e79cbce4b8aae882a1e699bae883bde8b79fe8b8aae7b3bbe7bb9fe68f90e4be9befbc8ce6aca2e8bf8ee8bdace8bdbdefbc8ce8afb7e6b3a8e6988ee587bae5a484e380820ae79bb8e585b3e696b0e997bbefbc9
a0ae8b4a3e4bbbbe7bc96e8be91efbc9ae5bb96e6b3bde5878c

于 2012/4/16 21:34, Albe Laurenz 写道:
> Please don't top post.
>
> Rural Hunter wrote:
>>>> My db is in utf-8, I have a row in my table say tmp_article and I wanted
>>>> to generate ts_vector from the article content:
>>>> select to_tsvector(content) from tmp_article;
>>>> But I got this error:
>>>> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481
>>>>
>>>> I am wondering how this could happen. I think if there was invalid UTF8
>>>> bytes in the content, it shouldn't have been able to inserted into the
>>>> tmp_article table as I sometimes see similar errors when inserting
>>>> records to tmp_article. Am I right?
>>> You are right in theory.  A lot depends on your PostgreSQL version,
>>> because
>>> the efforts to prevent invalid strings from entering the database have
>>> led to changes over the versions.  Older versions are more permissive.
>>>
>>> To test the theory that the contents of the table are bad, you can
>>> test if the same happens if you
>>>
>>> SELECT convert_to(content, 'UTF8') FROM tmp_article;
>> =>SELECT convert_to(content, 'UTF8') FROM tmp_article;
>> This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have
>> millions of data in the db but this is the only one we met the problem.
>> The bad data is inserted in recent days and we upgraded to 9.1.3 right
>> after it was released.
> Hmm, that is strange.
> Could you find the row that causes the problem and post the
> result of "CAST (content AS bytea)" for this row?
>
> Do you use any nonstandard text search parsers or dictionaries?
> What is the text search configuration you use (parameter
> default_text_search_config)?
>
> Yours,
> Laurenz Albe


Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
"Albe Laurenz"
Дата:
Rural Hunter wrote:
>>>>> My db is in utf-8, I have a row in my table say tmp_article and I wanted
>>>>> to generate ts_vector from the article content:
>>>>> select to_tsvector(content) from tmp_article;
>>>>> But I got this error:
>>>>> ERROR:  invalid byte sequence for encoding "UTF8": 0xf481

>> Do you use any nonstandard text search parsers or dictionaries?
>> What is the text search configuration you use (parameter
>> default_text_search_config)?

> Yes, I'm using a Chinese tsearch2 plugin called nlpbamboo and the
> default default_text_search_config is 'chinesecfg'. I tested "select
> to_tsvector('english',content) from tmp_article" and this works. So it's
> the problem of nlpbamboo?

That's probably the problem - it seems to emit something that is
not proper UTF-8 sometimes.

Do you get the error if you try Chinese settings without nlpbamboo?

>              btw, what's the meaning of "Please don't top post."? I
> don't understand but sorry if I caused any problem.

http://en.wikipedia.org/wiki/Posting_style

Yours,
Laurenz Albe

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
Rural Hunter
Дата:
于 2012/4/17 18:06, Albe Laurenz 写道:
> Rural Hunter wrote:
> That's probably the problem - it seems to emit something that is
> not proper UTF-8 sometimes.
>
> Do you get the error if you try Chinese settings without nlpbamboo?
How can I do this? The Chinese processing is provided by nlpbamboo.

Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?

От
"Albe Laurenz"
Дата:
Rural Hunter wrote:
>> Do you get the error if you try Chinese settings without nlpbamboo?

> How can I do this? The Chinese processing is provided by nlpbamboo.

Er, sorry, forget what I said.  I have no experience with Chinese.

I guess you should try to ask the nlpbamboo people.
Is there anything about encodings in the documentation?

Yours,
Laurenz Albe