Introduction
This articles show how to extract bookmark from PDF files and show it as a tree style.
Background
Each PDF file encapsulates a complete description of a fixed-layout flat
document, including the text, fonts, graphics, and other information
needed to display it. Bookmark in PDF files kept as object.
Using the code
This application uses iTextSharp
DLL to extract raw bookmark from PDF files. You need to add reference
iTextSharp
DLL to your application.
iTextSharp gives raw bookmark or bookmark in xml format. First you need this raw bookmark to process.
This can be done easily with this code.
IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);
This
book_mark
variable stores all the bookmark as dictionary style. Each bookmark has it's own property such as colour, page number, have child or not etc.
Here
I will discuss how to extract each bookmark and it's corresponding page number only.
If bookmark has child then it's keyvalue will be child.
Bookmark's name is saved as title in keyvalue pair of
book_mark
variable.
Bookmark's page number is saved as page in keyvalue pair of
book_mark
variable.
So the code will be
{
foreach (KeyValuePair<string, object> kvr in bk)
{
if (kvr.Key == "Kids" || kvr.Key == "kids")
{
}
else if (kvr.Key == "Title" || kvr.Key == "title")
{
string name= new System.Windows.Forms.TreeNode(kvr.Value.ToString());
}
else if (kvr.Key == "Page" || kvr.Key == "page")
{
int page number = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
}
}
}
Recursive search also will be the same.
Now the bookmark name and corresponding page number have to show as tree style. I use .net's TreeView to perform this.
When I found each bookmark's name and page number i added it to the treeview and when i found each child i perform a recursive search.
To perform the searching and adding to treeview is a slow process and the user interface may be frozen.
So I used BackgroundWorker
to do the work.
The whole code to do this is given below......
public void recursive_search(IList<Dictionary<string, object>> ilist, TreeNode tnt)
{
foreach (Dictionary<string, object> bk in ilist)
{
foreach (KeyValuePair<string, object> kvr in bk)
{
if (kvr.Key == "Kids" || kvr.Key == "kids")
{
IList<Dictionary<string, object>> child =
(IList<Dictionary<string, object>>)kvr.Value;
recursive_search(child, tn);
}
else if (kvr.Key == "Title" || kvr.Key == "title")
{
tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
}
else if (kvr.Key == "Page" || kvr.Key == "page")
{
tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
tnt.Nodes.Add(tn);
}
}
}
}
void bgw_DoWork(string reader_name)
{
reader = new iTextSharp.text.pdf.PdfReader(reader_name);
IList<Dictionary<string, object>> book_mark = SimpleBookmark.GetBookmark(reader);
foreach (Dictionary<string, object> bk in book_mark)
{
foreach (KeyValuePair<string, object> kvr in bk)
{
if (kvr.Key == "Kids" || kvr.Key == "kids")
{
IList<Dictionary<string, object>> child =
(IList<Dictionary<string, object>>)kvr.Value;
treeView1.Invoke((MethodInvoker)(() => recursive_search(child, tn)));
}
else if (kvr.Key == "Title" || kvr.Key == "title")
{
tn = new System.Windows.Forms.TreeNode(kvr.Value.ToString());
}
else if (kvr.Key == "Page" || kvr.Key == "page")
{
tn.ToolTipText = Regex.Match(kvr.Value.ToString(), "[0-9]+").Value;
treeView1.Invoke((MethodInvoker)(() => treeView1.Nodes.Add(tn)));
}
}
}
}
Points of Interest
PDF file format is really interesting. It keeps all the data as object.
Extracting data from PDF is easy but you have to know the file format very well.