Обсуждение: FInding "corrupt" values in UTF-8 tables (regexp question, I think)
I'm noticing that some of my data has been imported as junk text: For instance: klciã«" What would be the SQL to find data of this nature? My column can only have alphanumeric data, and the only symbols allowed are "-" and "_", so I tried this regexp query: select id, t_code from traders where t_code ~ '[^A-Za-z1-9\-]' limit 100; But this starts to return values such as "181xn-807199" which is valid as per the above regexp? Also, when I try to include the underscore, as follows... select id, t_code from traders where t_code ~ '[^A-Za-z1-9\-\_]' limit 100; This gives me an error: "ERROR: invalid regular expression: invalid character range". What am I missing? Does this have something to do with erroneous encodings? I want my data to be utf-8 but I do want to find it with latin1 queries when the text in columns is supposed to be only latin1 characters! Or is "a-z" in utf-8 considered different from "a-z" in latin1?
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > > select id, t_code > from traders > where t_code ~ '[^A-Za-z1-9\-\_]' > limit 100; > > This gives me an error: "ERROR: invalid regular expression: invalid > character range". Put the dash at the start of the character class: [^-A-Za-z1-9_] > > What am I missing? In a character class expression the dash has an special meaning. If you need to match a dash it has to be the first character. Regards, Manuel.
On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote: > What would be the SQL to find data of this nature? My column can only > have alphanumeric data, and the only symbols allowed are "-" and "_", > so I tried this regexp query: > > select id, t_code > from traders > where t_code ~ '[^A-Za-z1-9\-]' If you're including - in a range as a character, doesn't it have to go first? Try this: WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$ Michael Glaesemann grzm seespotcode net
Re: FInding "corrupt" values in UTF-8 tables (regexp question, I think)
От
"Belinda M. Giardine"
Дата:
On Fri, 17 Aug 2007, Michael Glaesemann wrote: > > On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote: > >> What would be the SQL to find data of this nature? My column can only >> have alphanumeric data, and the only symbols allowed are "-" and "_", >> so I tried this regexp query: >> >> select id, t_code >> from traders >> where t_code ~ '[^A-Za-z1-9\-]' > > If you're including - in a range as a character, doesn't it have to go first? > Try this: > > WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$ > > Michael Glaesemann > grzm seespotcode net > How about WHERE t_code ~ $re$[^-A-Za-z0-9_]$re$ So that zeros are allowed? Belinda
[Please reply to the list so that others may benefit from and participate in the discussion.] On Aug 17, 2007, at 12:50 , Phoenix Kiula wrote: > On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote: > >> >> On Aug 17, 2007, at 10:58 , Phoenix Kiula wrote: >> >> >>> What would be the SQL to find data of this nature? My column can >>> only >>> have alphanumeric data, and the only symbols allowed are "-" and >>> "_", >>> so I tried this regexp query: >>> >>> select id, t_code >>> from traders >>> where t_code ~ '[^A-Za-z1-9\-]' >>> >> >> If you're including - in a range as a character, doesn't it have to >> go first? >> Try this: >> >> WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$ >> >> > > > > Thanks, yes, this is sweet! > > If I include this into a check constraint on the table, would that be > very resource intensive for INSERTs and UPDATEs? > Maybe. I don't know. What's very? Measure, change, and measure again. Premature optimization and all that. :) First I make it correct. Michael Glaesemann grzm seespotcode net
On 18/08/07, Michael Glaesemann <grzm@seespotcode.net> wrote: > [Please reply to the list so that others may benefit from and > participate in the discussion.] > > >> If you're including - in a range as a character, doesn't it have to > >> go first? > >> Try this: > >> > >> WHERE t_code ~ $re$[^-A-Za-z1-9_]$re$ > >> > >> > > > > > > > > Thanks, yes, this is sweet! > > > > If I include this into a check constraint on the table, would that be > > very resource intensive for INSERTs and UPDATEs? > > > > Maybe. I don't know. What's very? Measure, change, and measure again. > Premature optimization and all that. :) First I make it correct. Yes, I meant to reply all. Thanks much for your help. I'll play around with the check constraint.