Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / web / Apache

PDF Extraction Tool (PET)

4.82/5 (7 votes)
3 Dec 2015CPOL2 min read 28.9K   583  
It is a tool to extract desired information from Pdf documents, we have developed it in context of extracting information of an individual from E-Aadhar (it is a unique identity issued by Govt. of India). User can modify the code according to desired Pdf documents.

Introduction

The developed tool is a Java application. It can run on any platform supporting Java Run time Environment (JRE) Framework. The developed tool extracts text from password protected Pdf documents and can store the required information in database. (We have developed this tool with the feature of extracting an individuals information from E-Aadhar and storing it in Microsoft Access database.

Therefore, the main objectives of PET are as follows:

  • Extraction of text and images: It allows to extract text and images from password protected Pdf documents.
  • Storing data in database: It stores the data extracted in database (Microsoft Access) which creates automatically as user clicks save button.

Overview of PET

The proposed and developed PET is a tool that allows user to extract the desired information and images from Pdf documents (with or without password) and stores it in database. The user can also retrieve the desired information from the database. We have developed this in context of extracting an individuals' information from E-Aadhar (protected with password) and storing it in Access Database.

Operations of PET

As the user installs the application, the following interface will appear:

Image 1

Fig. i. Main interface of application

The main operations of PET involve the following steps:

  1. Selecting the Pdf file: As the user clicks on File button, a dialog box will open allowing the user to select the file. (Fig. ii)

    Image 2

    Fig. ii. Selecting a Pdf file to open

    The user will have to provide the password if the file is password protected. In our case, the password is Pin-Code of the user. (Fig. iii)

    Image 3

    Fig. iii. Entering credentials
  2. Digitization of Information from Pdf: As the file is uploaded, the application will extract the desired information from the file and will show it in the new window.

    In our case, we are extracting the information of an individual from E-Aadhar as shown in fig. iv.

    Image 4

    Fig. iv. Displaying Extracted information from Pdf
  3. Storing the information in Database: If the user wants to save the data, he can click on save button, as the user clicks on save button, a database will be created in the folder (Aadhar Data in our case).

    Image 5

    Fig. v. Location where database is created. (C:\Aadhardata)
  4. Retrieving the data from Database: If the user wants to retrieve the information, he can fill in the Unique key value to get the desired data, we have taken the Aadhar no. of user as the Key value. As in Fig vi, user inputs the value and the desired data is fetched from the database (Fig. vii).

    Image 6

    Fig. vi

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)