(untagged)

From pdf files to plain text in a WebMatrix site

Gianmaria Gregori

0.00/5 (No votes)

14 Mar 2013

How to use the PDFBox Java library in an ASP.NET Web Pages project

Download Pdf2Text.zip - 4 KB

Introduction

If you want to add to your site the ability of searching the stored documents by content, the first task that you must accomplish is to convert formatted documents into plain text.

If your documents are pdf files and your site uses the ASP.NET framework, you have some possible option for this: the Converting PDF to Text in C# article lists some solutions you could try.

According with the article, from which I have grasped some useful information, I have opted for the PDFBox library.

Even if PDFBox is a Java library, you can use it in the .NET framework with the help of IKVM.NET, an implementation of Java for Mono and the Microsoft .NET Framework.

Through IKVM it’s possible to build a .NET version of PDFBox: I have used an unofficial .NET release based on the official PDFBox 1.7.0 library, that is hosted at http://pdfbox.lehmi.de/.

As the site I want to implement is developed with Web Pages, I have created a sample site in WebMatrix to experiment the process.

Code description

This site is based on a PdfFile class that fulfil the task of getting all the available properties from the pdf file meta-data and extracting the pdf file text with the use of the PDFBox library:

using System;
using System.Collections.Generic;
using System.Web;
using java.util;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

public class PdfFile
{

    public string Author { get; set; }

    public string Content { get; set; }

    public DateTime Created { get; set; }

    public string Creator { get; set; }

    public string Keywords { get; set; }

    public DateTime Modified { get; set; }

    public string Producer { get; set; }

    public string Subject { get; set; }

    public string Title { get; set; }

    public string Trapped { get; set; }

    public static DateTime CalendarToDateTime(Calendar calendar)
    {
        if (calendar != null)
        {
            int year = calendar.get(Calendar.YEAR);
            int month = calendar.get(Calendar.MONTH) + 1;
            int day = calendar.get(Calendar.DAY_OF_MONTH);
            int hour = calendar.get(Calendar.HOUR_OF_DAY);
            int minute = calendar.get(Calendar.MINUTE);
            int second = calendar.get(Calendar.SECOND);
            int millis = calendar.get(Calendar.MILLISECOND);

            var date = new DateTime(year, month, day, hour, minute, second, millis);

            return date;
        }

        else {
            return DateTime.MinValue;
        }
    }
    
    
    public PdfFile(string FilePath)
    {
        PDDocument PdfDoc = PDDocument.load(FilePath);
        PDDocumentInformation PdfInfo = PdfDoc.getDocumentInformation();

        Title = (PdfInfo.getTitle() ?? "");
        Subject = (PdfInfo.getSubject() ?? "");
        Author = (PdfInfo.getAuthor() ?? "");
        Creator = (PdfInfo.getCreator() ?? "");
        Producer = (PdfInfo.getProducer() ?? "");
        Keywords = (PdfInfo.getKeywords() ?? "");
        Trapped = (PdfInfo.getTrapped() ?? "");
        Created = CalendarToDateTime(PdfInfo.getCreationDate());
        Modified = CalendarToDateTime(PdfInfo.getModificationDate());

        PDFTextStripper stripper = new PDFTextStripper();
        Content = stripper.getText(PdfDoc);
    }
}

The simple home page enables you to upload a .pdf file, save it into the UploadedFiles folder, pass it to the PdfFile class and save its content into the Temp folder as .txt file:

@using Microsoft.Web.Helpers; 

@{
    TimeSpan elapsed = TimeSpan.Zero;
    var fileName = ""; 
    var fileTitle = "";
    var fileSubject = "";
    var fileAuthor = "";
    var fileCreator = "";
    var fileProducer = "";
    var fileKeywords = "";
    DateTime fileCreation = DateTime.MinValue;
    DateTime fileModify = DateTime.MinValue;
    long fileLength = 0;


    if (IsPost){
        var start = DateTime.Now;
        var fileSavePath = ""; 
        var uploadedFile = Request.Files[0]; 
        fileName = Path.GetFileName(uploadedFile.FileName); 
        fileSavePath = Server.MapPath("~/UploadedFiles/" + fileName); 
        uploadedFile.SaveAs(fileSavePath);

        PdfFile file = new PdfFile(fileSavePath);
        fileTitle = file.Title;
        fileSubject = file.Subject;
        fileAuthor = file.Author;
        fileCreator = file.Creator;
        fileProducer = file.Producer;
        fileKeywords = file.Keywords;
        fileCreation = file.Created;
        fileModify = file.Modified;
        fileLength = file.Content.Length;
 
        var destFile = Server.MapPath("~/Temp/Content.txt");
        using (StreamWriter sw = new StreamWriter(destFile)){
            sw.WriteLine(file.Content);
        }
        elapsed = (DateTime.Now - start);
    }   
}

<!DOCTYPE html>

<html lang="en">
    <head>
        <meta charset="utf-8" />
        <title>From PDF to Text</title>
        <link href="~/favicon.ico" rel="shortcut icon" type="image/x-icon" />
        <link href="~/Content/Style.css" rel="stylesheet" type="text/css" />
        <script type="text/javascript">
            function myFunction()
            {
                alert("Hello World!");
            }
        </script>
    </head>
    <body>
        <h2>From PDF to Text</h2>
        <div>
            <form enctype="multipart/form-data" method="post">
                <p><label for="fileUpload">PDF file</label></p>
                @FileUpload.GetHtml( 
                    initialNumberOfFiles:1, 
                    allowMoreFilesToBeAdded:false, 
                    includeFormTag:false, 
                    uploadText:"")
                <div>
                    <input type="submit" name="action" value="Upload" />
                </div>
            </form>
        </div>
        <hr>
        @if(IsPost){
            <div>
                <h3>Uploaded file: @fileName</h3>
                <p>Title: @fileTitle</p>
                <p>Subject: @fileSubject</p>
                <p>Author: @fileAuthor</p>
                <p>Creator: @fileCreator</p>
                <p>Producer: @fileProducer</p>
                <p>Keywords: @fileKeywords</p>
                <p>Created: @fileCreation</p>
                <p>Modified: @fileModify</p>
            </div>
            <hr>
            <div>
                <h3>@fileLength characters extracted in @elapsed</h3>
                @if (fileLength > 0) {
                    var fname = "Content.txt";
                    <input type="button" 
                        onclick="location.href('download.cshtml?filename=/Temp/@fname');" value="Open">
                }
            </div>
        }
    </body>
</html>

At the end of the process, the user can download the text file by requesting it to a handler page (download.cshtml):

@{ 
    if(!Request["filename"].IsEmpty()){ 
        var filename = Request["filename"]; 
        Functions.DownloadFile(Server.MapPath(filename));
    } 
}

Another Point of Interest

Another little point of interest is the function used by the handler page for the download of the file: I got it from a DotNetSlackers blog and it is a useful solution for the download of any kind of file:

@functions {
    public static void DownloadFile(string filePath)
    {
        // Create new instance of FileInfo class to get the properties of the file being downloaded
        FileInfo file = new FileInfo(filePath);

        // Checking if file exists
        if (file.Exists)
        {
            // Clear the content of the response
            Response.ClearContent();

            // Add the file name and attachment, which will force the open/cancel/save dialog
            // to show, to the header
            Response.AddHeader("Content-Disposition", "attachment; filename=" + file.Name);

            // Add the file size into the response header
            Response.AddHeader("Content-Length", file.Length.ToString());

            // Set the ContentType
            Response.ContentType = ReturnExtension(file.Extension.ToLower());

            // Write the file into the response
            Response.TransmitFile(file.FullName);

            // End the response
            Response.End();

        }
    }

    private static string ReturnExtension(string fileExtension)
    {
        switch (fileExtension)
        {
            case ".htm":
            case ".html":
            case ".log":
                return "text/HTML";
            case ".txt":
                return "text/plain";
            case ".doc":
                return "application/ms-word";
            case ".tiff":
            case ".tif":
                return "image/tiff";
            case ".asf":
                return "video/x-ms-asf";
            case ".avi":
                return "video/avi";
            case ".zip":
                return "application/zip";
            case ".xls":
            case ".csv":
                return "application/vnd.ms-excel";
            case ".gif":
                return "image/gif";
            case ".jpg":
            case "jpeg":
                return "image/jpeg";
            case ".bmp":
                return "image/bmp";
            case ".wav":
                return "audio/wav";
            case ".mp3":
                return "audio/mpeg3";
            case ".mpg":
            case "mpeg":
                return "video/mpeg";
            case ".rtf":
                return "application/rtf";
            case ".asp":
                return "text/asp";
            case ".pdf":
                return "application/pdf";
            case ".fdf":
                return "application/vnd.fdf";
            case ".ppt":
                return "application/mspowerpoint";
            case ".dwg":
                return "image/vnd.dwg";
            case ".msg":
                return "application/msoutlook";
            case ".xml":
            case ".sdxl":
                return "application/xml";
            case ".xdp":
                return "application/vnd.adobe.xdp+xml";
            default:
                return "application/octet-stream";
        }
    }
}

Using the sample

The steps to make the accompanying sample working are:

download and decompress the Pdf2Text.zip file;
start WebMatrix 2 and choose Folder as Site from the Open Site menu;
select the Pdf2TextSite folder and choose "Upgrade to newer version" from the following dialog box;
download pdfbox-1.7.0-dlls.zip from http://pdfbox.lehmi.de/ and copy from this file into the bin folder of your new site commons-logging.dll, fontbox-1.7.0.dll and pdfbox-1.7.0.dll;
download ikvmbin-7.2.4630.5.zip from http://sourceforge.net/projects/ikvm/files/ and copy from the bin folder of this file into the bin folder of your site: IKVM.OpenJDK.Core.dll, IKVM.OpenJDK.SwingAWT.dll, IKVM.OpenJDK.Text.dll, IKVM.OpenJDK.Util.dll and IKVM.Runtime.dll;
in WebMatrix 2 load from the NuGet Gallery the APS.NET Web Helpers Library.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here