Обсуждение: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

Поиск
Список
Период
Сортировка

FInding "corrupt" values in UTF-8 tables (regexp question, I think)

От
"Phoenix Kiula"
Дата:
I'm noticing that some of my data has been imported as junk text:

For instance:

    klciã«"

What would be the SQL to find data of this nature? My column can only
have alphanumeric data, and the only symbols allowed are "-" and "_",
so I tried this regexp query:

    select id, t_code
    from traders
    where t_code ~ '[^A-Za-z1-9\-]'
    limit 100;

But this starts to return values such as "181xn-807199" which is valid
as per the above regexp? Also, when I try to include the underscore,
as follows...

    select id, t_code
    from traders
    where t_code ~ '[^A-Za-z1-9\-\_]'
    limit 100;

This gives me an error: "ERROR:  invalid regular expression: invalid
character range".

What am I missing? Does this have something to do with erroneous
encodings? I want my data to be utf-8 but I do want to find it with
latin1 queries when the text in columns is supposed to be only latin1
characters! Or is "a-z" in utf-8 considered different from "a-z" in
latin1?

Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

От
Manuel Sugawara
Дата:
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes:

>
>     select id, t_code
>     from traders
>     where t_code ~ '[^A-Za-z1-9\-\_]'
>     limit 100;
>
> This gives me an error: "ERROR:  invalid regular expression: invalid
> character range".

Put the dash at the start of the character class: [^-A-Za-z1-9_]

>
> What am I missing?

In a character class expression the dash has an special meaning. If
you need to match a dash it has to be the first character.

Regards,
Manuel.

Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

От
Michael Glaesemann
Дата:
On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:

> What would be the SQL to find data of this nature? My column can only
> have alphanumeric data, and the only symbols allowed are "-" and "_",
> so I tried this regexp query:
>
>     select id, t_code
>     from traders
>     where t_code ~ '[^A-Za-z1-9\-]'

If you're including - in a range as a character, doesn't it have to
go first?
Try this:

WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$

Michael Glaesemann
grzm seespotcode net



Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

От
"Belinda M. Giardine"
Дата:

On Fri, 17 Aug 2007, Michael Glaesemann wrote:

>
> On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:
>
>> What would be the SQL to find data of this nature? My column can only
>> have alphanumeric data, and the only symbols allowed are "-" and "_",
>> so I tried this regexp query:
>>
>>    select id, t_code
>>    from traders
>>    where t_code ~ '[^A-Za-z1-9\-]'
>
> If you're including - in a range as a character, doesn't it have to go first?
> Try this:
>
> WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$
>
> Michael Glaesemann
> grzm seespotcode net
>

How about
WHERE t_code ~ $re$[^-A-Za-z0-9_]$re$

So that zeros are allowed?

Belinda


Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

От
Michael Glaesemann
Дата:
[Please reply to the list so that others may benefit from and
participate in the discussion.]

On Aug 17, 2007, at 12:50 , Phoenix Kiula wrote:


> On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
>
>>
>> On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote:
>>
>>
>>> What would be the SQL to find data of this nature? My column can
>>> only
>>> have alphanumeric data, and the only symbols allowed are "-" and
>>> "_",
>>> so I tried this regexp query:
>>>
>>>     select id, t_code
>>>     from traders
>>>     where t_code ~ '[^A-Za-z1-9\-]'
>>>
>>
>> If you're including - in a range as a character, doesn't it have to
>> go first?
>> Try this:
>>
>> WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$
>>
>>
>
>
>
> Thanks, yes, this is sweet!
>
> If I include this into a check constraint on the table, would that be
> very resource intensive for INSERTs and UPDATEs?
>

Maybe. I don't know. What's very? Measure, change, and measure again.
Premature optimization and all that. :) First I make it correct.

Michael Glaesemann
grzm seespotcode net




Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)

От
"Phoenix Kiula"
Дата:
On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote:
> [Please reply to the list so that others may benefit from and
> participate in the discussion.]
>
> >> If you're including - in a range as a character, doesn't it have to
> >> go first?
> >> Try this:
> >>
> >> WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$
> >>
> >>
> >
> >
> >
> > Thanks, yes, this is sweet!
> >
> > If I include this into a check constraint on the table, would that be
> > very resource intensive for INSERTs and UPDATEs?
> >
>
> Maybe. I don't know. What's very? Measure, change, and measure again.
> Premature optimization and all that. :) First I make it correct.



Yes, I meant to reply all. Thanks much for your help. I'll play around
with the check constraint.