Introduction
Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.
In this article, I am going to introduce you to Optical Character Recognition (OCR) to convert images to text. I developed Just Another Tesseract Interface (JATI) to convert images into text files, and consolidate them into a set of text data for text mining and natural language processing.
JATI interface with Tesseract OCR engine to convert image into text. I have included the source code. In this article, I am going to explain interfacing of the popular open source Tesseract OCR engine using C#.
Selecting the Image Portion to Convert
To OCR the whole image, it is easy, but I want to select a portion of the image to OCR. This can improve the accuracy of the result also. Hence, in JATI, user can click on the picturebox
image and drag to draw a rectangle to select the portion. The selected area will then be cropped. The following are the steps to accomplish this.
References:
- http://www.c-sharpcorner.com/UploadFile/hirendra_singh/how-to-make-image-editor-tool-in-C-Sharp-cropping-image/
- https://stackoverflow.com/questions/34551800/get-the-exact-size-of-the-zoomed-image-inside-the-picturebox
Include the System.Drawing
library:
using System.Drawing;
Mouse Down event for PictureBox1
:
void PictureBox1MouseDown(object sender, MouseEventArgs e)
{
try {
if (e.Button == System.Windows.Forms.MouseButtons.Left)
{
Cursor = Cursors.Cross;
startX = e.X;
startY = e.Y;
selPen = new Pen(Color.Red, 1);
}
pictureBox1.Refresh();
}
catch(Exception ex) {
}
}
Mouse Move event for PictureBox1
:
void PictureBox1MouseMove(object sender, MouseEventArgs e)
{
try {
if(e.Button == System.Windows.Forms.MouseButtons.Left) {
pictureBox1.Refresh();
curX = e.X;
curY = e.Y;
Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
pictureBox1.CreateGraphics().DrawRectangle(selPen, rect);
}
}
catch(Exception ex) {
}
}
Mouse Up event for PictureBox1
:
void PictureBox1MouseUp(object sender, MouseEventArgs e)
{
try {
Cursor = Cursors.Arrow;
Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);
Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
Bitmap _img = new Bitmap(curX-startX, curY-startY);
Graphics g = Graphics.FromImage(_img);
g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;
g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;
pictureBox2.SizeMode = PictureBoxSizeMode.Zoom;
pictureBox2.Width = _img.Width;
pictureBox2.Height = _img.Height;
}
catch(Exception ex) {
}
}
The above code crops the selected image portion and places it into picturebox2
. Following is the detailed explanation.
Create a new rectangle
object for the selection:
Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);
Save the original image into a Bitmap
object:
Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
Create a new Bitmap
Object:
Bitmap _img = new Bitmap(curX-startX, curY-startY);
Create a Graphics
Object based on the new Bitmap
Object:
Graphics g = Graphics.FromImage(_img);
Settings of Graphics
Object:
g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;
Cropped the image based on selection and put into pictureBox2
:
g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;
To get the selected coordinates for the image, I use:
string selCoordinates = "(" + startX.ToString() + "," + startY.ToString() +
"," + curX.ToString() + "," + curY.ToString() + ")";
Image to Text Recognition using Tesseract
I use Tesseract OCR engine to convert images into text. To interface with Tesseract OCR engine, include System.Diagnostic
library:
using System.Diagnostics;
Save the cropped image selection from pictureBox2
into a temporary directory:
pictureBox2.Image.Save(Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png");
Set the input file and output file for Tesseract OCR engine:
string input = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png";
string output = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".txt";
Create the Process and put in the arguments:
Process myProcess = Process.Start(Directory.GetCurrentDirectory() +
"/JATI/tesseract.exe", "--tessdata-dir ./JATI/ " + input + " " +
output.Replace(".txt", "") + " -l " + languageTextBox.Text + " -psm " + psmTextBox.Text);
Wait for the process to exit:
myProcess.WaitForExit();
Eric Goh is a data scientist, software engineer, adjunct faculty and entrepreneur with years of experiences in multiple industries. His varied career includes data science, data and text mining, natural language processing, machine learning, intelligent system development, and engineering product design. He founded SVBook Pte. Ltd. and extended it with DSTK.Tech and EMHAcademy.com. DSTK.Tech is where Eric develops his own DSTK data science softwares (public version). Eric also published “Learn R for Applied Statistics” at Apress, and published some books at LeanPub, Google Books, Amazon kindle, and SVBook Pte. Ltd. He teaches the content at EMHAcademy.com, Udemy, SkillShare, BitDegree, Simpliv, and developed 28 courses, 7 advanced certificates. Eric is also an adjunct faculty at Universities and Institutions.
Eric Goh has been leading his teams for various industrial projects, including the advanced product code classification system project which automates Singapore Custom’s trade facilitation process, and Nanyang Technological University's data science projects where he develop his own DSTK data science software (NTU version). He has years of experience in C#, Java, C/C++, SPSS Statistics and Modeller, SAS Enterprise Miner, R, Python, Excel, Excel VBA and etc. He won Tan Kah Kee Young Inventors' Merit Award 2007, and Shortlisted Entry for TelR Data Mining Challenge.
Eric holds a Masters of Technology degree from the National University of Singapore, an Executive MBA degree from U21Global (currently GlobalNxt) and IGNOU, a Graduate Diploma in Mechatronics from A*STAR SIMTech (a national research institute located in Nanyang Technological University), Coursera Specialization Certificate in Business Statistics and Analysis (Excel) from Rice University, IBM Data Science Professional Certificate (Python, SQL), and Coursera Verified Certificate in R Programming from Johns Hopkins University. He possessed a Bachelor of Science degree in Computing fr