OCR is abbreviation for Optical Character Recognition and stands for technology of converting printed images into editable files by means of optically identifying words.
Why we must use OCR?
Suppose that you've got a paper text such as a deed, book, RFP, ... and you have to enter it as an editable text file to computer in order to use it in your research, report or ...
The first way coming to mind and unfortunately usual for Farsi texts is to retype the text manually.
It's obvious that this is a time consuming process. This shows itself more obviously when there are several pages to retype. Another approach which has come to life by extension of information technology is to scan and get digital images from documents.
However this method by producing and electronic archive improves the process of archiving and eliminates need for large office spaces for archiving paper documents, there is no way of searching the texts of these documents and exploiting computer technologies such as data mining.
OCR software does the conversion of scanned images to searchable files. These software creates digital files by identifying different parts of document images and converting text parts to editable file.
If we look at OCR software as a black-box, its an entity which gets images of documents and generates editable and searchable digital files.
After getting the image, the first step is analyzing the layout of the image. The image layout is divided to table, text and image blocks.
Afterwards according to zone type ARAX does required steps and recovers information
- Text zones are processed and their content and font information is read.
- Images are kept as is.
- Tables are read cell by cell and put in output as a table preserving layout.
In next stage, ARAX shows read document in a WYSIWYG editor. You can correct any mistakes by use of a spellchecker.
At the end of process, ARAX generates files with your desired format with all the information from document which can be put in the file.
Comparison between Farsi and Latin OCRs?
For Latin languages such as English and French there has been OCR softwares for years and has passed a history of change and improvement, but unfortunately there has not been a suitable OCR for Farsi despite of 2000 history of life of this language.
One of the reasons for this is the high complexity and complex structure of Farsi writing in comparison to Latin. For example where in Latin texts characters are written separately, making identification very easy, in Farsi first the words must be identified. Each word must be broken to segments creating it. This part, according to different fonts in Farsi is the most difficult part.
ARAXPage which is a result of continuous effort in research and development department of HODA System, has solved many of problems facing Farsi OCR systems and after years has made Farsi language equipped with a powerful OCR system. Currently for providing users with as much capability as possible, ARAXPage can read English texts as well as English OCR softwares. Added to this, ARAXPage can identify English words and phrases in midst of Farsi texts.
As the business and office processes are still based on paper documents, OCR can be utilized in every part of governmental and private organizations. In this section some of the OCR applications are described.
OCR as the optimal way of entering information
Typing information manually from printed documents, is a common task that is done every day in office activities of many organizations. This job is time consuming, and costly. Added to this, the typed information always has a percentage of operator mistakes. These errors are reduced by means of several stages of reviews, in some cases errors will remain after all the stages. ARAX as the most powerful OCR software can eliminate this boring procedure and automate it.
- Some of OCR applications in this regard
- -Fast recovery of letters, contracts,... that are available printed.
- -Completion of tender documents and answers to RFPs more fast.
- -Completion and update of technical and financial reports, marketing plans and more, using available printed papers.
OCR as the only way of creating digital libraries
Farsi language as the oldest language in life not only is a pride for Iran but also has gained a most valuable place in the world literature. Despite this and despite the fact that there are several books written in this language, absence of a good digital library has put a serious problem in front of expansion of this language.
ARAXPage as the most powerful Farsi OCR can eliminate the gap between current situation and creating rich digital libraries with a high speed and accuracy.