"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>
>>> ISTM that perhaps a more generally useful definition would be
>>>
>>> lword Only ASCII letters
>>> nlword Entirely letters per iswalpha(), but not lword
>>> word Entirely alphanumeric per iswalnum(), but not nlword
>>> (hence, includes at least one digit)
>> ...
>> I am not sure if there are any western european languages were words can
>> only be formed with non-ascii chars.
>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.
For what it's worth I did the same search last night and found three French
words including "çà" -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with "íð" and "óð".
> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?
I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect "ascii", "word", and "token"
for the three categories Tom describes.
> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.
I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.
-- Gregory Stark EnterpriseDB http://www.enterprisedb.com