Fast and Accurate Word Spotting in images of Arabic handwriting from Text Queries

Project PI: Dr. Mohamed E. Hussein, Dr. Marwan Torki
Project Team: Eng. Mahmoud Fayyaz, and Eng. Ahmed ElSallamy
Google scholar:
Dr. Elsayed’s http://scholar.google.com/citations?user=jCUt0o0AAAAJ
Dr. Torki’s http://scholar.google.com/citations?user=aYLNZT4AAAAJ&hl=en
Research area: ICT

Abstract

Optical Character Recognition (OCR) is a challenging task, whose difficulty depends on the nature of the document and on the document’s language. In that context, handwritten documents are more difficult than printed onesو and some language scripting systems are more difficult than others. The Arabic scripting system is one of the most difficult for OCR. Moreover the research work on Arabic is generally relatively limited compared to Latin and Asian languages. One of the obstacles facing research in Arabic OCR is the lack of publicly available large datasets for training intelligent OCR systems, along with a standard benchmark tests on these datasets.

The goal of this project is to advance the state of the art in recognizing and Arabic handwritten documents. The specific objects are the following:

Collect a very large dataset of Arabic handwritten documents with ground truth.
Create a benchmark evaluation framework for the collected dataset and make it publicly available for the research community.
Conduct a thorough comparative study with state of the art techniques on the collected dataset.

Use the dataset to experiment with multiple modern machine learning techniques that have never been applied before to Arabic OCR.