Got you worried there? Unlike a lot of other letter combinations OCR is very nice to have.
Optical character recognition (optical character reader) (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into text that you can edit on your computer.
In this posting I will stick to conversion from typed and printed material.
Sometimes we come across information in a typed or printed format that would be really nice to include in our genealogy data. I often feel like the guy in the illustration when I have to type a whole lot of text. In these cases I find it very convenient to just put the book on the scanner, hit a few buttons and in a minute have the text saved as a text document that I can edit.
In the heading I write that many genealogist may have OCR capabilities (without knowing it). If you have bought a scanner in recent years there is a great chance that the software that came with your scanner has the OCR feature included. You may have installed it on your computer when you installed the drivers and user-interface. You might want to have a look at the documentation and CD’s that came with your scanner. (Yes, I know. Reading manuals is boring 🙂 )
If your scanner software does not include OCR capabilities, there is no reason to despair, because there are many free options to choose from.
Even though my scanner software includes OCR I have chosen to use a program that is called FreeOCR. It can be downloaded here. The reason is simply that it is very easy to use (And obviously FREE!). The rest of this article is about FreeOCR, but I will also give you some advice that is relevant no matter which OCR software you use.
Here is the opening page in FreeOCR
You can open an image with text through the “open” button and open and transcribe PDF files with the “Open PDF” button. When you hit the “Scan” button in the upper left part this menu appear:
In the menu “FreeOCR Scanning” your scanner should appear (In my case it says CanoScan 8800F). You may have to choose your scanner from “Select Scan Device”. If your scanner does not show up in this menu, chances are that it is not installed properly and you will have to deal with that first. In this menu I suggest you choose the “Use Devices Scan Interface” and I will explain why in a little bit. To move on you hit scan.
Above is the menu that appear when you choose the “Use Devices Scan Interface”. The menus I have seen while using different scanners, are fairly similar. (This is in Norwegian) Her I have already used the “pre-scan feature” and have scanned about one and a half page of text. When you use the default scan option (i.e don’t use “Devices Scan Interface”), the half page of text can get mixed up with the rest of the text. This can also happen if there is a picture with captions or a table on the page. Same with top texts and bottom texts. (In all fairness; FreeOCR does a pretty good job in automatic mode, but this has been a problem with other OCR software I have used). When you use the “Devices Scan Interface” you get the option to choose what part of the page to scan. On a page where the text is split up by tables and/or pictures, it can be smart to scan the text in several portions. In the picture below I have defined the part of the picture I want to scan by including a crop frame that is chosen from the upper left part of the menu (Square with arrows in the corner). It is smart to make sure the text in the book or document is as horizontal as possible. This will make the OCR easier and more accurate.
You can now hit the Scan button. After the scanning is done you are automatically brought back to the FreeOCR interface
You are now ready to perform the actual text recognition. Before you do you should choose the language of the original document. In the picture above I have not yet chosen Norwegian as is the language of the book I have scanned. The text recognition is performed by hitting “OCR” button. A little menu appear aand you can choose “current page” or “all pages”. If you have scanned one page you hit “Current page”. If you want to recognize a pdf document with several pages you hit “all pages”.
Here the software has done it’s magic and the text appears in the right window. The interpretation is good. It does however, not recognize headlines. In the bar between the windows, you can choose from several options. The red cross discards the OCR. Next button (disk) is to save the text as a simple text file. The “Enter arrow” gives you the opportunity to remove line breaks in the text. Next two options are to save the text in Word format or as a Rich Text file. The “A” gives you the opportunity to choose text style and/or size.
As I hope I have been able to convey in this little article, OCR is very convenient. It is also easy and quick to use.
If you have questions or comments about this article, don’t hesitate to comment below or send me a message through the contact page.