Обсуждение: [GENERAL] Postgres, apps, special characters and UTF-8 encoding

Поиск
Список
Период
Сортировка

[GENERAL] Postgres, apps, special characters and UTF-8 encoding

От
Ken Tanzer
Дата:
Hi.  I've got a recurring problem with character encoding for a Postgres-based web PHP app, and am hoping someone can clue me in or at least point me in the right direction.  I'll confess upfront my understanding of encoding issues is extremely limited.  Here goes.

The app uses a Postgres database, UTF-8 encoded.  Through their browsers, users can add and edit records often including text.  Most of the time this works fine.  Though sometimes this will fail with Postgres complaining, for example, "Could query with ... , The error text was: ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20 0x67"

So this generally happens when people copy and paste things out of their word documents and such.

As I understand it, those are likely encoded in something non-UTF-8, like WIN-1251 or something.  And that one way or another, the encoding needs to be translated before it can be placed into the database.  I'm not clear how this is supposed to happen though.  Automatically by the browser?  Done in the app?  Some other way?  And if in the app, how is one supposed to know what the incoming encoding is?

Thanks in advance for any help or pointers.

Ken


--
AGENCY Software  
A Free Software data system
By and for non-profits
(253) 245-3801

learn more about AGENCY or
follow the discussion.

Re: [GENERAL] Postgres, apps, special characters and UTF-8 encoding

От
"David G. Johnston"
Дата:
On Tue, Mar 7, 2017 at 4:20 PM, Ken Tanzer <ken.tanzer@gmail.com> wrote:
As I understand it, those are likely encoded in something non-UTF-8, like WIN-1251 or something.  And that one way or another, the encoding needs to be translated before it can be placed into the database.  I'm not clear how this is supposed to happen though.  Automatically by the browser?  Done in the app?  Some other way?  And if in the app, how is one supposed to know what the incoming encoding is?

​Haven't run into this problem personally, probably accidentally lucky, but ISTM that you need to tell the browser what character set you are working in.  It is in the best position to mediate between the user and the server.

Ideally, on the server, you can examine HTTP headers to learn about the incoming data charset/encoding (I may not be using these terms precisely but you should get the idea).

Googling "html input field encoding" seems to provide a decent start.

Note that technically the data encoding issues can occur without HTML, its really an HTTP layer thing, but the medium of use you care about is HTTP/Browsers.

David J.

Re: [GENERAL] Postgres, apps, special characters and UTF-8 encoding

От
Adrian Klaver
Дата:
On 03/07/2017 03:20 PM, Ken Tanzer wrote:
> Hi.  I've got a recurring problem with character encoding for a
> Postgres-based web PHP app, and am hoping someone can clue me in or at
> least point me in the right direction.  I'll confess upfront my
> understanding of encoding issues is extremely limited.  Here goes.
>
> The app uses a Postgres database, UTF-8 encoded.  Through their
> browsers, users can add and edit records often including text.  Most of
> the time this works fine.  Though sometimes this will fail with Postgres
> complaining, for example, "Could query with ... , The error text was:
> ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20 0x67"
>
> So this generally happens when people copy and paste things out of their
> word documents and such.
>
> As I understand it, those are likely encoded in something non-UTF-8,
> like WIN-1251 or something.  And that one way or another, the encoding
> needs to be translated before it can be placed into the database.  I'm
> not clear how this is supposed to happen though.  Automatically by the
> browser?  Done in the app?  Some other way?  And if in the app, how is
> one supposed to know what the incoming encoding is?

I don't use PHP, but found this:

http://www.php.net/manual/en/function.mb-detect-encoding.php

and this:

http://php.net/manual/en/function.mb-convert-encoding.php


>
> Thanks in advance for any help or pointers.
>
> Ken
>
>
> --
> AGENCY Software
> A Free Software data system
> By and for non-profits
> /http://agency-software.org//
> /https://agency-software.org/demo/client/
> ken.tanzer@agency-software.org <mailto:ken.tanzer@agency-software.org>
> (253) 245-3801
>
> Subscribe to the mailing list
> <mailto:agency-general-request@lists.sourceforge.net?body=subscribe> to
> learn more about AGENCY or
> follow the discussion.


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] Postgres, apps, special characters and UTF-8 encoding

От
rob stone
Дата:
Hi Ken,

On Tue, 2017-03-07 at 15:20 -0800, Ken Tanzer wrote:
> Hi.  I've got a recurring problem with character encoding for a
> Postgres-based web PHP app, and am hoping someone can clue me in or
> at least point me in the right direction.  I'll confess upfront my
> understanding of encoding issues is extremely limited.  Here goes.
>
> The app uses a Postgres database, UTF-8 encoded.  Through their
> browsers, users can add and edit records often including text.  Most
> of the time this works fine.  Though sometimes this will fail with
> Postgres complaining, for example, "Could query with ... , The error
> text was: ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20
> 0x67"
>
> So this generally happens when people copy and paste things out of
> their word documents and such.
>
> As I understand it, those are likely encoded in something non-UTF-8,
> like WIN-1251 or something.  And that one way or another, the
> encoding needs to be translated before it can be placed into the
> database.  I'm not clear how this is supposed to happen though. 
> Automatically by the browser?  Done in the app?  Some other way?  And
> if in the app, how is one supposed to know what the incoming encoding
> is?
>
> Thanks in advance for any help or pointers.
>
> Ken
>
>
>

1) Make sure the text editor you use to create your pages, etc. uses
UTF-8 as its encoding. That way the file's BOM is set correctly.
2) Make sure your headers contain the following:-

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

or

<meta charset="UTF-8"/> which is HTML5, however the long version is
still recognised by HTML5.

I understand that some versions of IE have issues with correctly
determining the character set, so, unfortunately, you still have to
verify that user input is UTF-8 compatible.

HTH,
Rob


Re: [GENERAL] Postgres, apps, special characters and UTF-8 encoding

От
vinny
Дата:
On 2017-03-08 00:20, Ken Tanzer wrote:
> Hi.  I've got a recurring problem with character encoding for a
> Postgres-based web PHP app, and am hoping someone can clue me in or at
> least point me in the right direction.  I'll confess upfront my
> understanding of encoding issues is extremely limited.  Here goes.
>

>   And that one way or another, the encoding
> needs to be translated before it can be placed into the database.

> Ken
>
> --
>

You don't really have to translate the encoding, because all parts of
the system
are capable of dealing with all encodings.

What you have to make sure that that they are indeed all working in the
same encoding.
You have to set the encoding of the HTML document, the database, and the
database connection
to the same encoding, like utf8. People tend to forget the "set names"
on the database connection,
which can make the database think you are sending latin1, but you are
really sending utf-8, and presto problemo.

Then the only problem left is that PHP doesn't do utf-8 very well
internally
  so if you receive data from an UTF-8 page and want to substring etc
then you have to use the multibyte variants
of those functions. You could convert everything back to latin1 first,
but then
you might as well just do everything in latin1 in the first place.


Re: [GENERAL] Postgres, apps, special characters and UTF-8 encoding

От
Albe Laurenz
Дата:
Ken Tanzer wrote:
> Hi.  I've got a recurring problem with character encoding for a Postgres-based web PHP app, and am
> hoping someone can clue me in or at least point me in the right direction.  I'll confess upfront my
> understanding of encoding issues is extremely limited.  Here goes.
> 
> The app uses a Postgres database, UTF-8 encoded.  Through their browsers, users can add and edit
> records often including text.  Most of the time this works fine.  Though sometimes this will fail with
> Postgres complaining, for example, "Could query with ... , The error text was: ERROR: invalid byte
> sequence for encoding "UTF8": 0xe9 0x20 0x67"
> 
> So this generally happens when people copy and paste things out of their word documents and such.
> 
> As I understand it, those are likely encoded in something non-UTF-8, like WIN-1251 or something.  And
> that one way or another, the encoding needs to be translated before it can be placed into the
> database.  I'm not clear how this is supposed to happen though.  Automatically by the browser?  Done
> in the app?  Some other way?  And if in the app, how is one supposed to know what the incoming
> encoding is?
> 
> Thanks in advance for any help or pointers.

The byte sequence 0xe9 0x20 0x67 means "é g" in ISO-8859-1 and WINDOWS-1252,
so I think that your setup is as follows:

- The PHP application gets data encoded in ISO-8859-1 or WINDOWS-1252
  and tries to store it in a database.
- The PHP application has a database connection with client_encoding
  set to UTF8.

Then the database thinks it gets UTF-8 and will choke if it gets something
different.

The solution:

- Make sure that your web application gets data in only one encoding.
- Set client_encoding to that encoding.

Yours,
Laurenz Albe