Обсуждение: OCR
Hello,
This might seem a Little off topic, but maybe someone here can help me.
I need to extract toponomical data from old digitized paper maps. I wish to explore Optical character recognition (OCR).
Does anyone has a suggestion/experience with this kind of challenge?
Thank you,
André Mano
--
---------------------------------------------------------------
Associação Leonel Trindade
SOCIEDADE DE HISTÓRIA NATURAL
Apartado 25 2564-909 Torres Vedras Portugal
Sede e Biblioteca: rua Cavaleiros da Espora Dourada, 27A 2560 Torres Vedras
Laboratório de Paleontologia e Paleoecologia: Polígono Industrial do Alto do Ameal 2565-641 Ramalhal
http://alt-shn.blogspot.com
www.alt-shn.org
#avg_ls_inline_popup { position:absolute; z-index:9999; padding: 0px 0px; margin-left: 0px; margin-top: 0px; width: 240px; overflow: hidden; word-wrap: break-word; color: black; font-size: 10px; text-align: left; line-height: 13px;}
This might seem a Little off topic, but maybe someone here can help me.
I need to extract toponomical data from old digitized paper maps. I wish to explore Optical character recognition (OCR).
Does anyone has a suggestion/experience with this kind of challenge?
Thank you,
André Mano
--
---------------------------------------------------------------
Associação Leonel Trindade
SOCIEDADE DE HISTÓRIA NATURAL
Apartado 25 2564-909 Torres Vedras Portugal
Sede e Biblioteca: rua Cavaleiros da Espora Dourada, 27A 2560 Torres Vedras
Laboratório de Paleontologia e Paleoecologia: Polígono Industrial do Alto do Ameal 2565-641 Ramalhal
http://alt-shn.blogspot.com
www.alt-shn.org
On 04/05/2011, at 17:18 , ALT SHN wrote: > Hello, > > This might seem a Little off topic, but maybe someone here can help me. > > I need to extract toponomical data from old digitized paper maps. I wish to explore Optical character recognition (OCR). > > Does anyone has a suggestion/experience with this kind of challenge? > > Thank you, > > André Mano I'd argue about the "little". I've spent a lot of the last few weeks arguing with OCR software about tables of sample datafrom old reports, that become point data to plot. Perfectly relevant :) I've never tried getting data from digitized maps, but I'll offer the following generalisations in case it helps. Generally,I have OCR programmes pass me the results as plain text. I lose the formatting, but I don't have to fix stupidguesses about the formatting. This is from my experience, so you may find different. 1. OCR loves paragraphs. 2. Different OCR programmes handle column text differently. Some understand columns, some just assume L->R straight acrossboth columns. 3. OCR does not get along with handwritten anything. (Unless the person was extra-extra neat and consistent in their writing,and even then it's a maybe.) 4. OCR on tabular data works best if the data is lined up in columns, and doesn't have random big gaps. 5. OCR will almost certainly be confused if there is a line on your map running through or near a word. 5a. Actually lines could confuse it quite a bit - I remember one that tried to recreate an in-line sketch map out of asciicharacters. Quite amusing. 6. You *will* need to check the results. I'd love to hear what OCR makes of maps. Very curious. -ramon.
On Wed, May 4, 2011 at 6:04 AM, Ramon Andinach <custard@westnet.com.au> wrote: > > On 04/05/2011, at 17:18 , ALT SHN wrote: > >> Hello, >> >> This might seem a Little off topic, but maybe someone here can help me. >> >> I need to extract toponomical data from old digitized paper maps. I wish to explore Optical character recognition (OCR). >> >> Does anyone has a suggestion/experience with this kind of challenge? >> >> Thank you, >> >> André Mano > > I'd argue about the "little". I've spent a lot of the last few weeks arguing with OCR software about tables of sample datafrom old reports, that become point data to plot. Perfectly relevant :) > > I've never tried getting data from digitized maps, but I'll offer the following generalisations in case it helps. Generally,I have OCR programmes pass me the results as plain text. I lose the formatting, but I don't have to fix stupidguesses about the formatting. > > This is from my experience, so you may find different. > > 1. OCR loves paragraphs. > 2. Different OCR programmes handle column text differently. Some understand columns, some just assume L->R straight acrossboth columns. > 3. OCR does not get along with handwritten anything. (Unless the person was extra-extra neat and consistent in their writing,and even then it's a maybe.) > 4. OCR on tabular data works best if the data is lined up in columns, and doesn't have random big gaps. > 5. OCR will almost certainly be confused if there is a line on your map running through or near a word. > 5a. Actually lines could confuse it quite a bit - I remember one that tried to recreate an in-line sketch map out of asciicharacters. Quite amusing. > 6. You *will* need to check the results. > > I'd love to hear what OCR makes of maps. Very curious. > > -ramon. > -- Going along with the poster above, I think you may have problems if any of your text is skewed or 'wraps' in any way such that any character is not horizontal to the scan. Recognizing 7 out of 10, then cleaning up, may be more time consuming than data entry. You may be able to 1) scan 2) drag a 'box' over the various data points and assign an x,y for the box 3) assign each box an identifier 4) then data enter per box. At least you'll have a loose representation of the data. I know one thing I'd google for is 'OCR skewed text' and see if there is an app that can handle that potential problem before I went much further. -- Mike Ellsworth
On Wed, May 4, 2011 at 11:18 AM, ALT SHN <i.geografica@alt-shn.org> wrote: > Hello, > > This might seem a Little off topic, but maybe someone here can help me. > > I need to extract toponomical data from old digitized paper maps. I wish to > explore Optical character recognition (OCR). > > Does anyone has a suggestion/experience with this kind of challenge? You could try with this software: http://www.gnu.org/software/ocrad/ocrad.html Markus
Thank you for your suggestions!
I'll begin to explore Ocrad (thanks for the tip Markus).
If I achieve relevant results will share them here.
best regards,
André
--
---------------------------------------------------------------
Associação Leonel Trindade
SOCIEDADE DE HISTÓRIA NATURAL
Apartado 25 2564-909 Torres Vedras Portugal
Sede e Biblioteca: rua Cavaleiros da Espora Dourada, 27A 2560 Torres Vedras
Laboratório de Paleontologia e Paleoecologia: Polígono Industrial do Alto do Ameal 2565-641 Ramalhal
http://alt-shn.blogspot.com
www.alt-shn.org
#avg_ls_inline_popup { position:absolute; z-index:9999; padding: 0px 0px; margin-left: 0px; margin-top: 0px; width: 240px; overflow: hidden; word-wrap: break-word; color: black; font-size: 10px; text-align: left; line-height: 13px;}
I'll begin to explore Ocrad (thanks for the tip Markus).
If I achieve relevant results will share them here.
best regards,
André
2011/5/5 Markus Neteler <neteler@osgeo.org>
On Wed, May 4, 2011 at 11:18 AM, ALT SHN <i.geografica@alt-shn.org> wrote:You could try with this software:
> Hello,
>
> This might seem a Little off topic, but maybe someone here can help me.
>
> I need to extract toponomical data from old digitized paper maps. I wish to
> explore Optical character recognition (OCR).
>
> Does anyone has a suggestion/experience with this kind of challenge?
http://www.gnu.org/software/ocrad/ocrad.html
Markus
--
---------------------------------------------------------------
Associação Leonel Trindade
SOCIEDADE DE HISTÓRIA NATURAL
Apartado 25 2564-909 Torres Vedras Portugal
Sede e Biblioteca: rua Cavaleiros da Espora Dourada, 27A 2560 Torres Vedras
Laboratório de Paleontologia e Paleoecologia: Polígono Industrial do Alto do Ameal 2565-641 Ramalhal
http://alt-shn.blogspot.com
www.alt-shn.org