Introduction
Looking around trying to find examples of how to extract text out of a PDF, I didn't find much. Well, there are a few, but cost money. I found an example done in Java, and converted it to VB.NET with add-ons and a different logic. The code in this application is very incomplete, and it will be eventually used in an automated process using a file watcher to extract text out of PDFs and then format the text to put it into a SQL Server database. I hope that some one finds this code and the recommend changes or updates useful.
Using the code
The code is pretty easy to use. Both the test functions are stored in a class ExtractPDF
. The function to extract the text requires a PDF file name and a password. The password can be Nothing
and will be ignored. If the PDF file has a password, a valid password needs to be converted to Byte
s and then passed. ItextSharp.dll needs to be referenced. The source code files for itextsharp.dll are also available.
I have two Case
statements in the function, so new or more options/formats or whatever else comes in a PDF file can be read and the appropriate action taken.
While (Token.NextToken)
Select Case Token.TokenType
Case Token.TK_STRING
StrBuf.Append(Token.StringValue)
Case Token.TK_OTHER
Select Case Token.StringValue
Case "ET"
StrBuf.Append(vbCrLf)
End Select
End Select
End While
Update
I have updated the program and figured out why I was getting the cast error. Sometimes the object is returned as an array and not individually. There is probably a smarter way to get it right with one loop, but I store the streams in an ArrayList
and process it later:
Dim Stream As New ArrayList
If objectref.IsArray Then
Dim Counter As Integer
For Counter = 0 To objectref.ArrayList.Count - 1
Stream.Add(Reader.GetPdfObject(objectref.arraylist(Counter)))
Next
Else
Stream.Add(Reader.GetPdfObject(objectref))
End If
This code is far from complete, but I thought that it would help some VB programmer out there as the other examples I found where in C# (funny that ItextSharp.dll is all written in C#). If any body has any additions, please feel free to use the code.