Data Scraping from Image using Tesseract

Eric M. H. Goh

5.00/5 (7 votes)

31 Mar 2018Apache2 min read

23.6K

1.8K

Scrape data from image using Tesseract OCR engine

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this article, I am going to introduce you to Optical Character Recognition (OCR) to convert images to text. I developed Just Another Tesseract Interface (JATI) to convert images into text files, and consolidate them into a set of text data for text mining and natural language processing.

JATI interface with Tesseract OCR engine to convert image into text. I have included the source code. In this article, I am going to explain interfacing of the popular open source Tesseract OCR engine using C#.

Selecting the Image Portion to Convert

To OCR the whole image, it is easy, but I want to select a portion of the image to OCR. This can improve the accuracy of the result also. Hence, in JATI, user can click on the picturebox image and drag to draw a rectangle to select the portion. The selected area will then be cropped. The following are the steps to accomplish this.

References:

Include the System.Drawing library:

using System.Drawing;

Mouse Down event for PictureBox1:

void PictureBox1MouseDown(object sender, MouseEventArgs e)
        {
            try {
           
             if (e.Button == System.Windows.Forms.MouseButtons.Left)
             {
                 Cursor = Cursors.Cross;
                startX = e.X;
                startY = e.Y;
               
                selPen = new Pen(Color.Red, 1);
              }
             
             pictureBox1.Refresh();
            }
           
            catch(Exception ex) {
               
            }
        }

Mouse Move event for PictureBox1:

void PictureBox1MouseMove(object sender, MouseEventArgs e)
        {
            try {
            if(e.Button == System.Windows.Forms.MouseButtons.Left) {
                pictureBox1.Refresh();   
                //Cursor = Cursors.Cross;
                curX = e.X;
                curY = e.Y;
               
                Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
                pictureBox1.CreateGraphics().DrawRectangle(selPen, rect);               
            }
            }
           
            catch(Exception ex) {
               
            }
           
        }

Mouse Up event for PictureBox1:

void PictureBox1MouseUp(object sender, MouseEventArgs e)
        {
            try {
            Cursor = Cursors.Arrow;
       
            Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);
          
            Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
            Bitmap _img = new Bitmap(curX-startX, curY-startY);

            Graphics g = Graphics.FromImage(_img);

            g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
            g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
            g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

            g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
 
            pictureBox2.Image = _img;
            pictureBox2.SizeMode = PictureBoxSizeMode.Zoom;
            pictureBox2.Width = _img.Width;
            pictureBox2.Height = _img.Height;
              
            }
           
            catch(Exception ex) {
               
            }
        }

The above code crops the selected image portion and places it into picturebox2. Following is the detailed explanation.

Create a new rectangle object for the selection:

Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);

Save the original image into a Bitmap object:

Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);

Create a new Bitmap Object:

Bitmap _img = new Bitmap(curX-startX, curY-startY);

Create a Graphics Object based on the new Bitmap Object:

Graphics g = Graphics.FromImage(_img);

Settings of Graphics Object:

g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

Cropped the image based on selection and put into pictureBox2:

g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;

To get the selected coordinates for the image, I use:

string selCoordinates = "(" + startX.ToString() + "," + startY.ToString() + 
                        "," + curX.ToString() + "," + curY.ToString() + ")";

Image to Text Recognition using Tesseract

I use Tesseract OCR engine to convert images into text. To interface with Tesseract OCR engine, include System.Diagnostic library:

using System.Diagnostics;

Save the cropped image selection from pictureBox2 into a temporary directory:

pictureBox2.Image.Save(Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png");

Set the input file and output file for Tesseract OCR engine:

string input = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png";
string output = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".txt";

Create the Process and put in the arguments:

Process myProcess = Process.Start(Directory.GetCurrentDirectory() + 
"/JATI/tesseract.exe", "--tessdata-dir ./JATI/ " + input + " " + 
output.Replace(".txt", "") + " -l " + languageTextBox.Text + " -psm " + psmTextBox.Text);

Wait for the process to exit:

myProcess.WaitForExit();

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0