Scanning Documents with OCR
Chris | Live Tech Support | Video Help | Add to iTunes
http://live.pirillo.com/ - I received an email the other day from someone wondering why documents they scan into their computer rearranges the text and doesn’t retain formatting. Well user-whose-name-I-can’t-pronounce, what you’re asking about deals with OCR.
OCR, or Optical Character Recognition software allows you to scan a document and edit it. If the OCR software isn’t very good, then it probably won’t retain the layout that the original document had. You are probably using whatever software came with your all-in-one machine, and I don’t know what brand that is.
The problem is, you’re not really going to get really good OCR software unless you pay virtually hundreds of dollars. Probably the best I know of is OmniPage. OmniPage Professional 16 is the fastest and most precise way to convert high volumes of paper, PDF and forms into files you can edit and search.
Unfortunately, I don’t know of any open source or cheaper software titles that are on the same level with OmniPage. Do you? If you use something that works just as well, without the price tag, I’d love to hear about it. I’m sure that user-whose-name-I-can’t-pronounce would love to hear your suggestions, as well.
Want to embed this video in your blog? Use this code:
Formats Available: MPEG4 Video (.mp4) Flash Video (.flv) MP3 Audio (.mp3)
Need a new domain name? See why GoDaddy is the #1 domain registrar worldwide. Now with your domain registration, you'll get hosting, a free blog, complete email system, and much more! Plus, as a listener of The Chris Pirillo Show, enter code CHRIS3 and get your .COM domain name for just $6.95 a year. Get your piece of the internet at GoDaddy!









14 Comments
Left Of Center
September 29th, 2007
at 3:24am
Chris Pirillo »Scanning Documents with OCRPosted 63 minutes ago
Coyette
September 29th, 2007
at 9:41am
II run into similar problems with OCR. I have tried quite a few application, also OmniPage (version 14) which I did not like at all. OmniPage-14 does not retain the original settings of a document. The best solution that I found is Solid Converter PDF Professional (www.solidpdf.com/) . Gene should do the following:
1. Scan the document (at least at 300 dpi)
2. Save the scan as a pdf file
3. Open that pdf file in MS Word, after you have installed Solid Converter PDF Professional.
Solid Converter retains all the original settings of the document. I do not know any better solution at an affordable price. Highly recommended.
ANON INDIAN
September 29th, 2007
at 10:02am
Hi,
Here are a few other OCR software:
1) This is one of the best: ABBYY FineReader, an award-winning Optical Character Recognition (OCR) software that allows users to convert paper documents, PDF files, and various images including photographs taken by a digital camera to editable formats for changing and repurposing.
http://www.abbyy.com/
2) Tesseract is a free optical character recognition engine. It was originally developed at Hewlett-Packard from 1985 until 1995. After ten years with no development, Hewlett Packard and UNLV released it in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0. The current version of Tesseract is 2.01, released August 30, 2007. http://code.google.com/p/tesseract-ocr/
3) The SimpleOCR freeware demonstrates the power of our engine and is the only OCR application that is completely free. http://www.simpleocr.com/
Hope this is of help to the person whose name you cannot pronounce…
MildBill
September 29th, 2007
at 5:17pm
As good as Omnipage is, I find that ReadIris Pro is excellent. Amazingly, I sometimes use the older TextBridge Pro (precurser to Omnipage) because it formats complex documents really well. By complex, I mean columns and offsets with fotos.
Leo
September 30th, 2007
at 9:04pm
Personally, I really like the ABBYY FINE READER software package that came bundled with my Epson 6400. The program is awesome!
R. Bassett Jr.
September 30th, 2007
at 9:17pm
A basic version of Read-Iris is bundled with most HP Inkjet all-in-one printers and the full version of Read-Iris Pro is bundled with select HP Scanjet scanners. The pro version is excellent, as it is the culmination of something like 30 years of research and development. It costs under $130 now, which is well worth the money if you have hoards of documents you need to update on a regular basis.
As for the bundled copy that comes with the HP printers, it’s basicly just something that is there to show the customer that indeed it is possible for their AiO to do OCR. As many all-in-ones can be had for a whopping $40, it wouldn’t add any value to the products for HP to include niche-market software with them. This is especially true now that the full software is so inexpensive (it used to REALLY expensive not all that long ago). However, with great free programs like HP Photosmart Premier (photo management and manipulation program) and their snapfish online service, it’s not like HP is just selling you the hardware and throwing you to the wolves!
Your friendly HP tech support personel, at 1-800-HPINVENT will be more than happy to show how to install and test the functionality of the OCR software bundled with an in warranty HP all-in-one printer. However, they’ll also politely set your expectations and point you in the direction of Read-Iris Pro if OCR is something you are serious about doing on a regular basis. Again, Read-Iris Pro is worth the money if OCR is something you need - and it works great in combination with an all-in-one printer with an automatic document feeder!
Read-Iris. Check it out,
http://www.irislink.com/c2-480-189/Readiris-Pro-11-OCR-software.aspx
Incidently, if you don’t need to edit a document and you’re just making a digital archive, your best bet is to scan it to a pdf (Postscript Document Format) file, as it will retain the formatting and store as single file with multiple pages in high quality. The pdf format is great, as it is compatible with all major operating systems via free software and will likely be supported for many years to come in its current state.
Your Support Website - Technology podcasts, netcasts, vidcasts, and more from the best in the tech media industry!
September 30th, 2007
at 9:24pm
WebsiteShow NotesSubscribe Copyright
SFCurley
October 1st, 2007
at 6:00am
I also use ABY Fine Reader, and it is excellent. I use it to OCR entire books and then convert them to audio using TextAloud (nextup.com) and when scanned at 300 or 400dpi, the accuracy is amazing. (TextAloud is fantastic, too, for converting text to audio.)
Kevin C. Tofel
October 1st, 2007
at 6:10am
This might be a stretch, but I think Microsoft OneNote is worth a mention here. You can import pics and OneNote will OCR them in the background, plus the data is searchable. OneNote Mobile comes with OneNote for free and works on a Windows Mobile device: using that app and your device camera, you can snap and sych pics which then get OCR’d like any other imported pics. This definitely wouldn’t work if you need to manipulate the OCR’d text, but it’s a great solution in certain circumstances.
John Harrower
October 1st, 2007
at 8:57am
I agree with Leo. ABBYY Fine Reader is very good. I got my copy free with PC Plus a couple of years ago, fully registered!
Renee
October 11th, 2007
at 8:34am
Thank you! This was very helpful and was exactly the information I needed.
Prof. Dr. Laszlo Jamf
October 28th, 2007
at 5:14pm
A really good and free OCR package is TopOCR. A very interesting feature of this software is that it not only works with scanners but also digital cameras. It also has a text to speech interface that I use to convert images to MP3 files that I then listen to on an iPod on the train every morning. You can find TopOCR at http://www.topocr.com
Charly Victor
July 11th, 2008
at 4:45am
If you are not interested in Layout Retention the most powerfull and acurate OCR engine is RecoStar Full Page Reader.
RecoStar Full Page Reader makes scanned documents and faxes searchable. He is capable of processing all types of documents, but is particularly suited for the processing of business-related documents.
For more than 15 years, RecoStar is reknown for its robustness and reliability. RecoStar is standard in almost all applications defined as “mission-critical”.
For further information see:
http://www.captaris-dt.com/product/recostar-fullpagereader/en/
bala
October 6th, 2008
at 2:22pm
i have scan document convert to text document