Introduction
If you want to add to your site the ability of searching the stored documents by content, the first task that you must accomplish is to convert formatted documents into plain text.
If your documents are pdf files and your site uses the ASP.NET framework, you have some possible option for this: the Converting PDF to Text in C# article lists some solutions you could try.
According with the article, from which I have grasped some useful information, I have opted for the PDFBox library.
Even if PDFBox is a Java library, you can use it in the .NET framework with the help of IKVM.NET, an implementation of Java for Mono and the Microsoft .NET Framework.
Through IKVM it’s possible to build a .NET version of PDFBox: I have used an unofficial .NET release based on the official PDFBox 1.7.0 library, that is hosted at http://pdfbox.lehmi.de/.
As the site I want to implement is developed with Web Pages, I have created a sample site in WebMatrix to experiment the process.
Code description
This site is based on a PdfFile class that fulfil the task of getting all the available properties from the pdf file meta-data and extracting the pdf file text with the use of the PDFBox library:
using System;
using System.Collections.Generic;
using System.Web;
using java.util;
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
public class PdfFile
{
public string Author { get; set; }
public string Content { get; set; }
public DateTime Created { get; set; }
public string Creator { get; set; }
public string Keywords { get; set; }
public DateTime Modified { get; set; }
public string Producer { get; set; }
public string Subject { get; set; }
public string Title { get; set; }
public string Trapped { get; set; }
public static DateTime CalendarToDateTime(Calendar calendar)
{
if (calendar != null)
{
int year = calendar.get(Calendar.YEAR);
int month = calendar.get(Calendar.MONTH) + 1;
int day = calendar.get(Calendar.DAY_OF_MONTH);
int hour = calendar.get(Calendar.HOUR_OF_DAY);
int minute = calendar.get(Calendar.MINUTE);
int second = calendar.get(Calendar.SECOND);
int millis = calendar.get(Calendar.MILLISECOND);
var date = new DateTime(year, month, day, hour, minute, second, millis);
return date;
}
else {
return DateTime.MinValue;
}
}
public PdfFile(string FilePath)
{
PDDocument PdfDoc = PDDocument.load(FilePath);
PDDocumentInformation PdfInfo = PdfDoc.getDocumentInformation();
Title = (PdfInfo.getTitle() ?? "");
Subject = (PdfInfo.getSubject() ?? "");
Author = (PdfInfo.getAuthor() ?? "");
Creator = (PdfInfo.getCreator() ?? "");
Producer = (PdfInfo.getProducer() ?? "");
Keywords = (PdfInfo.getKeywords() ?? "");
Trapped = (PdfInfo.getTrapped() ?? "");
Created = CalendarToDateTime(PdfInfo.getCreationDate());
Modified = CalendarToDateTime(PdfInfo.getModificationDate());
PDFTextStripper stripper = new PDFTextStripper();
Content = stripper.getText(PdfDoc);
}
}
The simple home page enables you to upload a .pdf file, save it into the UploadedFiles folder, pass it to the PdfFile class and save its content into the Temp folder as .txt file:
@using Microsoft.Web.Helpers;
@{
TimeSpan elapsed = TimeSpan.Zero;
var fileName = "";
var fileTitle = "";
var fileSubject = "";
var fileAuthor = "";
var fileCreator = "";
var fileProducer = "";
var fileKeywords = "";
DateTime fileCreation = DateTime.MinValue;
DateTime fileModify = DateTime.MinValue;
long fileLength = 0;
if (IsPost){
var start = DateTime.Now;
var fileSavePath = "";
var uploadedFile = Request.Files[0];
fileName = Path.GetFileName(uploadedFile.FileName);
fileSavePath = Server.MapPath("~/UploadedFiles/" + fileName);
uploadedFile.SaveAs(fileSavePath);
PdfFile file = new PdfFile(fileSavePath);
fileTitle = file.Title;
fileSubject = file.Subject;
fileAuthor = file.Author;
fileCreator = file.Creator;
fileProducer = file.Producer;
fileKeywords = file.Keywords;
fileCreation = file.Created;
fileModify = file.Modified;
fileLength = file.Content.Length;
var destFile = Server.MapPath("~/Temp/Content.txt");
using (StreamWriter sw = new StreamWriter(destFile)){
sw.WriteLine(file.Content);
}
elapsed = (DateTime.Now - start);
}
}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>From PDF to Text</title>
<link href="~/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<link href="~/Content/Style.css" rel="stylesheet" type="text/css" />
<script type="text/javascript">
function myFunction()
{
alert("Hello World!");
}
</script>
</head>
<body>
<h2>From PDF to Text</h2>
<div>
<form enctype="multipart/form-data" method="post">
<p><label for="fileUpload">PDF file</label></p>
@FileUpload.GetHtml(
initialNumberOfFiles:1,
allowMoreFilesToBeAdded:false,
includeFormTag:false,
uploadText:"")
<div>
<input type="submit" name="action" value="Upload" />
</div>
</form>
</div>
<hr>
@if(IsPost){
<div>
<h3>Uploaded file: @fileName</h3>
<p>Title: @fileTitle</p>
<p>Subject: @fileSubject</p>
<p>Author: @fileAuthor</p>
<p>Creator: @fileCreator</p>
<p>Producer: @fileProducer</p>
<p>Keywords: @fileKeywords</p>
<p>Created: @fileCreation</p>
<p>Modified: @fileModify</p>
</div>
<hr>
<div>
<h3>@fileLength characters extracted in @elapsed</h3>
@if (fileLength > 0) {
var fname = "Content.txt";
<input type="button"
onclick="location.href('download.cshtml?filename=/Temp/@fname');" value="Open">
}
</div>
}
</body>
</html>
At the end of the process, the user can download the text file by requesting it to a handler page (download.cshtml):
@{
if(!Request["filename"].IsEmpty()){
var filename = Request["filename"];
Functions.DownloadFile(Server.MapPath(filename));
}
}
Another Point of Interest
Another little point of interest is the function used by the handler page for the download of the file: I got it from a DotNetSlackers blog and it is a useful solution for the download of any kind of file:
@functions {
public static void DownloadFile(string filePath)
{
FileInfo file = new FileInfo(filePath);
if (file.Exists)
{
Response.ClearContent();
Response.AddHeader("Content-Disposition", "attachment; filename=" + file.Name);
Response.AddHeader("Content-Length", file.Length.ToString());
Response.ContentType = ReturnExtension(file.Extension.ToLower());
Response.TransmitFile(file.FullName);
Response.End();
}
}
private static string ReturnExtension(string fileExtension)
{
switch (fileExtension)
{
case ".htm":
case ".html":
case ".log":
return "text/HTML";
case ".txt":
return "text/plain";
case ".doc":
return "application/ms-word";
case ".tiff":
case ".tif":
return "image/tiff";
case ".asf":
return "video/x-ms-asf";
case ".avi":
return "video/avi";
case ".zip":
return "application/zip";
case ".xls":
case ".csv":
return "application/vnd.ms-excel";
case ".gif":
return "image/gif";
case ".jpg":
case "jpeg":
return "image/jpeg";
case ".bmp":
return "image/bmp";
case ".wav":
return "audio/wav";
case ".mp3":
return "audio/mpeg3";
case ".mpg":
case "mpeg":
return "video/mpeg";
case ".rtf":
return "application/rtf";
case ".asp":
return "text/asp";
case ".pdf":
return "application/pdf";
case ".fdf":
return "application/vnd.fdf";
case ".ppt":
return "application/mspowerpoint";
case ".dwg":
return "image/vnd.dwg";
case ".msg":
return "application/msoutlook";
case ".xml":
case ".sdxl":
return "application/xml";
case ".xdp":
return "application/vnd.adobe.xdp+xml";
default:
return "application/octet-stream";
}
}
}
Using the sample
The steps to make the accompanying sample working are:
- download and decompress the
Pdf2Text.zip
file;
- start WebMatrix 2 and choose Folder as Site from the Open Site menu;
- select the
Pdf2TextSite
folder and choose "Upgrade to newer version" from the following dialog box;
- download
pdfbox-1.7.0-dlls.zip
from http://pdfbox.lehmi.de/ and copy from this file into the bin
folder of your new site commons-logging.dll
, fontbox-1.7.0.dll
and pdfbox-1.7.0.dll
;
- download
ikvmbin-7.2.4630.5.zip
from http://sourceforge.net/projects/ikvm/files/ and copy from the bin
folder of this file into the bin
folder of your site: IKVM.OpenJDK.Core.dll
, IKVM.OpenJDK.SwingAWT.dll
, IKVM.OpenJDK.Text.dll
, IKVM.OpenJDK.Util.dll
and IKVM.Runtime.dll
;
- in WebMatrix 2 load from the NuGet Gallery the
APS.NET Web Helpers Library
.