I want to extract text which is highlighted in a box from a pdf file or a image in C#

Question

1.00/5 (1 vote)

See more:

C#3.5

ASP.NET4

i want to get specific text from pdf file to excel using c#.

Posted 27-Feb-15 22:21pm

Manoj Sawant

Add a Solution

Comments

Zoltán Zörgő 28-Feb-15 4:24am

You can use OCR for that. Actually only OCR.

2 solutions

Solution 1

You will need to use a library such as iTextSharp[^] to read the content.

Posted 27-Feb-15 22:39pm

Richard MacCutchan

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

**Zoltán Zörgő** · Accepted Answer · 2015-02-27T22:59:00

There are libraries you can use to get content from PDF as from a document. But it is not as straightforward as it looks. You will not be able to read encrypted and obfuscated PDF files. Even more, there is no guarantee that the PDF content you are looking for is actually text and not raster image or vector graphics. On the other hand, even if you extract the text somehow, it will be hard to match your ROI.
So, if you don't use OCR, you will end up with a solution that it is working only in specific situations, but no general one for sure.
There are some OCR engines you could use for free, like Tesseract[^]. But as it has no native PDF support[^], you will need some pre-processing.
So I suggest you look for a good but not expensive commertial solution, like this one: http://www.abbyy.com/ocr_sdk_windows/[^] (Abbyy is really great).
On the other hand you could try Adobe PDF IFilter[^] with C# (Using IFilter in C#[^]).
In newer windows version there is an OCR engine[^] which could be used, but I have no further knowlede about it's capabilities.