Обсуждение: Initial ugly reverse-translator
Hi all I've chucked together a quick and very ugly script to read the .po files from the backend and produce a simple database to map translations back to the original strings and their source locations. It's a very dirty .po reader that doesn't try to parse the format properly, but it does the job. There's no search interface yet, this is just intended to get to the point where useful queries can be run on the data and the most effective queries can be figured out. Right now queries against errors without format-string substitutions work ok, if not great, with pg_tgrm based lookups, eg: test=# SELECT message_id, is_format, message, translation test-# FROM po_translation INNER JOIN po_message ON po_translation.message_id = po_message.id INNER JOIN test-# WHERE 'el valor de array debe comenzar con «{» o información de dimensión' % translation test-# ORDER BY similarity('el valor de array debe comenzar con «{» o información de dimensión', translation) desc; message_id | is_format | message | translation ------------+-----------+------------------------------------------------------------+--------------------------------------------------------------------- 4470 | f | array value must start with \"{\" or dimension information | el valor de array debe comenzar con «{» o información de dimensión" 4437 | f | argument must be empty or one-dimensional array | el argumento debe ser vacío o un array unidimensional" (2 rows) test=# SELECT DISTINCT srcfile, srcline FROM po_location WHERE message_id = 4437; srcfile | srcline -------------------------------------------------------------+--------- /a/pgsql/HEAD/pgtst/src/backend/utils/adt/array_userfuncs.c | 121 utils/adt/array_userfuncs.c | 99 utils/adt/array_userfuncs.c | 121 utils/adt/array_userfuncs.c | 124 (4 rows) It's also useful for format-string based messages, but more thought is needed on how best to handle them. A LIKE query using the format-string message as the pattern (after converting the pattern syntax to SQL style) would be (a) slow and (b) very sensitive to formatting and other variation. I haven't spent any time on that bit yet, but if anybody has any ideas I'd be glad to hear them. Anyway, the initial version of the script can be found at: http://www.postnewspapers.com.au/~craig/poread.py Consider running it in a new database as it's extremely poorly tested, written very quickly and dirtily, and contains DDL commands. The schema can be found inline in the script. The psycopg2 Python module is required, and the pg_tgrm contrib module must be loaded in the database you use the script with. Once I'm happy with the queries for translation lookups I'll bang together a quick web interface for the script and clean it up. At that point it might start being useful to people here. -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > It's also useful for format-string based messages, but more thought is > needed on how best to handle them. A LIKE query using the format-string > message as the pattern (after converting the pattern syntax to SQL > style) would be (a) slow and (b) very sensitive to formatting and other > variation. I haven't spent any time on that bit yet, but if anybody has > any ideas I'd be glad to hear them. I don't really see the problem. I assume from your reference to pg_trgm that you're using trigram similarity as the prefilter for potential matches, so a slow final LIKE match shouldn't be an issue really. (And besides, speed doesn't seem like the be-all and end-all here.) AFAICS you just need to translate %-string format escapes to %, quote any other % or _, and away you go. One thing that might be worth doing is avoiding spacing sensitivity, since whitespace is frequently mangled in copy-and-paste. Perhaps strip all spaces from both strings before matching? regards, tom lane
Tom Lane wrote: > I don't really see the problem. I assume from your reference to pg_trgm > that you're using trigram similarity as the prefilter for potential > matches It turns out that's no good anyway, as it appears to ignore characters outside the ASCII range. Rather less than useful for searching a database of translated strings ;-) > so a slow final LIKE match shouldn't be an issue really. > (And besides, speed doesn't seem like the be-all and end-all here.) True. It's not so much the speed as the fragility when faced with small changes to formatting. In addition to whitespace, some clients mangle punctuation with features like automatic "curly"-quoting. > AFAICS you just need to translate %-string format escapes to %, quote > any other % or _, and away you go. > > One thing that might be worth doing is avoiding spacing sensitivity, > since whitespace is frequently mangled in copy-and-paste. Perhaps > strip all spaces from both strings before matching? Yep, that sounds pretty reasonable. As usual I'm making things more complicated than they need to be. I suspect it'll be necessary to strip quotes and some other punctuation too, but that's not a big deal. -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > Tom Lane wrote: >> I don't really see the problem. I assume from your reference to pg_trgm >> that you're using trigram similarity as the prefilter for potential >> matches > It turns out that's no good anyway, as it appears to ignore characters > outside the ASCII range. Rather less than useful for searching a > database of translated strings ;-) A quick look at the pg_trgm code suggests that it is only prepared to deal with single-byte encodings; if you're working in UTF8, which I suppose you'd have to be, it's dead in the water :-(. Perhaps fixing that should be on the TODO list. But in any case maybe the full-text-search stuff would be more useful as a prefilter? Although honestly, for the speed we need here, I'm not sure a prefilter is needed at all. Full text might be useful if a LIKE-based match fails, though. >> (And besides, speed doesn't seem like the be-all and end-all here.) > True. It's not so much the speed as the fragility when faced with small > changes to formatting. In addition to whitespace, some clients mangle > punctuation with features like automatic "curly"-quoting. Yeah. I was wondering whether encoding differences wouldn't be a huge problem in practice, as well. regards, tom lane
On Sat, 19 Apr 2008, Tom Lane wrote: > Craig Ringer <craig@postnewspapers.com.au> writes: >> Tom Lane wrote: >>> I don't really see the problem. I assume from your reference to pg_trgm >>> that you're using trigram similarity as the prefilter for potential >>> matches > >> It turns out that's no good anyway, as it appears to ignore characters >> outside the ASCII range. Rather less than useful for searching a >> database of translated strings ;-) > > A quick look at the pg_trgm code suggests that it is only prepared to > deal with single-byte encodings; if you're working in UTF8, which I > suppose you'd have to be, it's dead in the water :-(. Perhaps fixing > that should be on the TODO list. as well as ltree. they are in our todo list: http://www.sai.msu.su/~megera/wiki/TODO > > But in any case maybe the full-text-search stuff would be more useful > as a prefilter? Although honestly, for the speed we need here, I'm > not sure a prefilter is needed at all. Full text might be useful > if a LIKE-based match fails, though. > >>> (And besides, speed doesn't seem like the be-all and end-all here.) > >> True. It's not so much the speed as the fragility when faced with small >> changes to formatting. In addition to whitespace, some clients mangle >> punctuation with features like automatic "curly"-quoting. > > Yeah. I was wondering whether encoding differences wouldn't be a huge > problem in practice, as well. > > regards, tom lane > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Tom Lane wrote: >> True. It's not so much the speed as the fragility when faced with small >> changes to formatting. In addition to whitespace, some clients mangle >> punctuation with features like automatic "curly"-quoting. > > Yeah. I was wondering whether encoding differences wouldn't be a huge > problem in practice, as well. I'm not *too* worried about text encoding issues. In general it's very obvious when text has been mangled due to bad encoding handling, and it's extremely rare to see anything subtle like an app that transforms accented chars to their base variants. Demangling strings damaged by bad encoding handling is way out of scope, and sometimes not possible anyway. I guess that UTF-8's delightful support for various composed and decomposed forms of same glyph might be a problem. It's something I may face in some other works I'm doing too, so I might have to see how hard it'd be to put together a DB function that normalizes a UTF-8 string to its fully composed variant. I don't think the decomposed forms see much use in the wild though; they mostly come up as a security issue for path/URL matching and the like. http://unicode.org/reports/tr15/ http://msdn2.microsoft.com/en-us/library/ms776393(VS.85).aspx http://earthlingsoft.net/ssp/blog/2006/07/unicode_normalisation I don't know much about the CJK text representations, though, either in Unicode or in other encodings like Big5 . I *hope* the Unicode normalization rules will be enough there but I'm not sure. All strings must be converted from their original encoding to utf-8 for queries of course. That might be troublesome when using something like a web form where it might be hard to know the encoding of the input text (and where browser bugs are the rule rather than the exception) but it's thankfully not necessary to cater to every weird and broken browser. So in this case I don't think encodings will be *too* much trouble unless alternate unicode normalization forms turn out to be more common than I think they are. -- Craig Ringer
On Sat, Apr 19, 2008 at 6:10 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: > On Sat, 19 Apr 2008, Tom Lane wrote: > >> Craig Ringer <craig@postnewspapers.com.au> writes: >>> >>> Tom Lane wrote: >>>> >>>> I don't really see the problem. I assume from your reference to pg_trgm >>>> that you're using trigram similarity as the prefilter for potential >>>> matches >> >>> It turns out that's no good anyway, as it appears to ignore characters >>> outside the ASCII range. Rather less than useful for searching a >>> database of translated strings ;-) >> >> A quick look at the pg_trgm code suggests that it is only prepared to >> deal with single-byte encodings; if you're working in UTF8, which I >> suppose you'd have to be, it's dead in the water :-(. Perhaps fixing >> that should be on the TODO list. > > as well as ltree. they are in our todo list: > http://www.sai.msu.su/~megera/wiki/TODO > Hi Oleg In your TODO list says that UTF8 was added to ltree, is this code currently available for download? Regards, José >> >> But in any case maybe the full-text-search stuff would be more useful >> as a prefilter? Although honestly, for the speed we need here, I'm >> not sure a prefilter is needed at all. Full text might be useful >> if a LIKE-based match fails, though. >> >>>> (And besides, speed doesn't seem like the be-all and end-all here.) >> >>> True. It's not so much the speed as the fragility when faced with small >>> changes to formatting. In addition to whitespace, some clients mangle >>> punctuation with features like automatic "curly"-quoting. >> >> Yeah. I was wondering whether encoding differences wouldn't be a huge >> problem in practice, as well. >> >> regards, tom lane >> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
Hi, ltree and pg_trgm with UTF8 support are available from CVS HEAD, see See http://archives.postgresql.org/pgsql-committers/2008-06/msg00356.php http://archives.postgresql.org/pgsql-committers/2008-11/msg00139.php Oleg On Fri, 16 Jan 2009, pepone.onrez wrote: > On Sat, Apr 19, 2008 at 6:10 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: >> On Sat, 19 Apr 2008, Tom Lane wrote: >> >>> Craig Ringer <craig@postnewspapers.com.au> writes: >>>> >>>> Tom Lane wrote: >>>>> >>>>> I don't really see the problem. I assume from your reference to pg_trgm >>>>> that you're using trigram similarity as the prefilter for potential >>>>> matches >>> >>>> It turns out that's no good anyway, as it appears to ignore characters >>>> outside the ASCII range. Rather less than useful for searching a >>>> database of translated strings ;-) >>> >>> A quick look at the pg_trgm code suggests that it is only prepared to >>> deal with single-byte encodings; if you're working in UTF8, which I >>> suppose you'd have to be, it's dead in the water :-(. Perhaps fixing >>> that should be on the TODO list. >> >> as well as ltree. they are in our todo list: >> http://www.sai.msu.su/~megera/wiki/TODO >> > > Hi Oleg > > In your TODO list says that UTF8 was added to ltree, is this code > currently available for download? > > Regards, > JosЪЪ >>> >>> But in any case maybe the full-text-search stuff would be more useful >>> as a prefilter? Although honestly, for the speed we need here, I'm >>> not sure a prefilter is needed at all. Full text might be useful >>> if a LIKE-based match fails, though. >>> >>>>> (And besides, speed doesn't seem like the be-all and end-all here.) >>> >>>> True. It's not so much the speed as the fragility when faced with small >>>> changes to formatting. In addition to whitespace, some clients mangle >>>> punctuation with features like automatic "curly"-quoting. >>> >>> Yeah. I was wondering whether encoding differences wouldn't be a huge >>> problem in practice, as well. >>> >>> regards, tom lane >>> >>> >> >> Regards, >> Oleg >> _____________________________________________________________ >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >> Sternberg Astronomical Institute, Moscow University, Russia >> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >> phone: +007(495)939-16-83, +007(495)939-23-83 >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >> > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83