Extract paragraphs from a PDF

Question

1.00/5 (1 vote)

See more:

I need to extract paragraphs from a PDF using a free library for C# and VS. If a paragraph continues on the next page, should it be returned as one paragraph and not two? Do you have an example in C#? I'm trying to find out if iText8 does this but I can't find the answer.

What I have tried:

(Nothing, apparently; typing the word "itext" 28 times doesn't count.)

Posted 1-Aug-24 13:34pm

Member 14890678

Updated 1-Aug-24 21:24pm

Richard Deeming

v2

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2024-08-01T19:01:00

Solution 1

Try NuGet Gallery | iTextSharp 5.5.13.4[^] - it's free and pretty much the standard.
It's a C# port of the Java iText and there is plenty of documentation on Google.

Posted 1-Aug-24 19:01pm

OriginalGriff

Pete O'Hanlon · Answer 2 · 2024-08-01T20:54:00

When you look at a PDF, it is tempting to think that what you see on the screen conforms to what you would understand from looking at a paper page. So you would think that a PDF file understood the concept of paragraphs and so on. Unfortunately this is not the case; PDFs do not support layouts such as paragraphs. Instead, they have text streams that are laid out into individual pages, as series of text runs.

One of the confusing things is that iTextSharp (and other libraries) use the idea of Paragraph as an abstraction when writing documents. This gives the impression that a paragraph is an actual property of PDFs. So sorry, there's no foolproof way to work out paragraphs.