Re: Import pdf to sql - C# Discussion Boards

First Prev Next

Import pdf to sql

4anusha415-Nov-10 20:06

4anusha4

15-Nov-10 20:06

How to import data from pdf file to sql server 2005

Re: Import pdf to sql

Goutam Patra15-Nov-10 20:42

Goutam Patra

15-Nov-10 20:42

4anusha4 wrote:
import data

What do you mean by Data? Text data inside PDF file of the whole PDF file?

Re: Import pdf to sql

4anusha415-Nov-10 21:50

4anusha4

15-Nov-10 21:50

yes

PDF consists of the data columns wise, I want to import the whole data in to sql

Re: Import pdf to sql

Goutam Patra15-Nov-10 22:40

Goutam Patra

15-Nov-10 22:40

Well if so, It will not be so easy.

First you need to find out how to extract text from PDF file.
A quick google will give you a idea how to do so.

And also here are few links you can go through

1[^]
2 [^]
3[^]

Re: Import pdf to sql

Richard MacCutchan15-Nov-10 22:43

Richard MacCutchan

15-Nov-10 22:43

PDF files cannot be imported direct in this way, you will need to convert the PDF table to some text form that can be imported. The iTextSharp[^] library can help you.

Just say 'NO' to evaluated arguments for diadic functions! Ash

Re: Import pdf to sql

Abhinav S16-Nov-10 2:19

Abhinav S

16-Nov-10 2:19

You will need to store the entire pdf as a BLOB field.

The funniest thing about this particular signature is that by the time you realise it doesn't say anything it's too late to stop reading it.

My latest tip/trick

Visit the Hindi forum here.

Re: Import pdf to sql

4anusha416-Nov-10 17:47

4anusha4

16-Nov-10 17:47

how to store the pdf as the blob filed

Re: Import pdf to sql

thatraja16-Nov-10 18:47

thatraja

16-Nov-10 18:47

Try this

Files inserting into database[^]

thatraja |Chennai|India|

Brainbench certifications
Down-votes are like kid's kisses don't reject it Smile | :)

Do what you want quickly because the Doomsday on 2012 Smile | :)

Network Programming

Hum Dum15-Nov-10 19:55

Hum Dum

15-Nov-10 19:55

Could anyone please suggest me good book on

a) Network Programming
b) Security

Both for beginner and advance level.

regards

Re: Network Programming

Jacob D Dixon16-Nov-10 6:54

Jacob D Dixon

16-Nov-10 6:54

I wouldn't dive right into .NET remoting. First I would learn that basics of C#. Anyways I personally like the Wrox books. They give you enough to get going. For network programming you can find plenty of articles right here at codeproject and across the net

Re: Network Programming

Hum Dum17-Nov-10 21:48

Hum Dum

17-Nov-10 21:48

Jacob D Dixon wrote:
I personally like the Wrox books.

Me too. I have c# wrox prfessional 4.0 .

But their not much in detail about Network, Threading.

Jacob D Dixon wrote:
plenty of articles right here at codeproject and across the net

Yes i had, but they just discuss only that particular part which is involved in that article.
Not the whole part.
Where as book start topic by scratch, progress bit by bit.

So, if you can suggest me good book on Network programming, Threading, Security.

Re: Network Programming

RobCroll17-Nov-10 4:23

RobCroll

17-Nov-10 4:23

Go to amazon.com and read the reviews and work it out for yourself. I could recommend a book but I like to drive in there and have a go and then develop a deeper understanding. You may be the complete opposite. Everyone has there own learning styles.

Re: Network Programming

Hum Dum17-Nov-10 21:43

Hum Dum

17-Nov-10 21:43

Robert Croll wrote:
I could recommend a book

Then please do so.

I have wrox professional but in threading it just give intro. When comes to synchronization of threads they had a note like

Whole Threading can't be covered in this chapter, A whole book is required. But hadn't mentioned which book

Same applicable to network programming, and nothing about SECURITY.

It's not like i hadn't searched.

This what i found?

But its back in 2005, i had thru some of its online chapters. Some(more) typing mistakes are there. Also code over their for 1.1

It will helpful if you guide where/how to find good book.

Re: Network Programming

RobCroll18-Nov-10 20:21

RobCroll

18-Nov-10 20:21

Well there isn't much out there in print on networking for .NET. That link was the book I was thinking about. Mixed reviews, I guess you need to try and get your hands on the book and see what you think because some thought it was good.

BTW I'm guess you know this but threading isn't networking.

Maybe try and find some articles online regarding .NET and TCP/IP

A Question Of Efficiency

Roger Wright15-Nov-10 17:11

Roger Wright

15-Nov-10 17:11

I've been playing...

A short while back I posted about writing an app which reads csv data from a folder of files I save from emailed reports, and adds each record to a SQL Server DB. Due to other work tasks, I haven't done much with it lately, but I'm back to it now. Previously I'd managed to select the files, strip off the ten lines or so of descriptive data at the beginning of each, identify and classify data lines using regexes, then write the cleaned up data to a text file. That's about where I got distracted, and it's just as well, since I was stuck on the best way to transfer the information to a SQL Server instance. The problem is duplicates...

My data source sometimes sends consolidated reports that duplicate data already received, and it's not particularly easy to tell from the emails which files contain duplicates. My code thus far doesn't have that capability, and I really don't look forward to writing it, but neither do I want to deal with SQL Server bouncing my transactions because they contain duplicate data. So I did some digging and found a gem.

I found that the Queue collection has a .Contains method which looked interesting. I dumped the output text file from my previous version, and just enqueued each cleaned up record as I processed it; that worked great. Then I added code to skip a record if OutQueue.Contains(record) returned true. Wow! It worked like a charm, and took all of ten minutes to implement.

Whee! It's fun trying new stuff. But I'm curious, having never used a Queue before... My application will never see more than a few hundred records at a time, so efficiency isn't a big issue. But I'm thinking that, at some level of complexity - thousands, hundred thousands, millions perhaps - the Contains method will become less efficient than other methods of detecting duplicate items. Does anyone have any idea at what point it becomes too inefficient, and any suggestions about what to implement in its stead?

I'm somewhat tempted to build this thing using only techniques I've never used before, and rarely seen in the wild, then publish it just for the fun of it. Big Grin | :-D

Will Rogers never met me.

Re: A Question Of Efficiency

Luc Pattyn15-Nov-10 17:53

Luc Pattyn

15-Nov-10 17:53

Hi Roger,

there is more for you to explore!

all kinds of collections have a Contains() method, in fact it is part of the IList interface. It typically works by comparing the collection elements one by one with the foreign object, so yes that would slow down for huge collections.

There is one kind of collections that does search more efficiently, it is the Dictionary kind as they use hashing to quickly look up through the key. So if you (1) don't care about the order of the elements in the collection, and (2) have one field in your elements that must be unique, you can simply look that up, like so:

class myItem
    int unique;
    byte[] data;
    string text;
...
}

Dictionary<int, myItem> dict=new Dictionary<int, myItem>();
foreach(myItem mi in incomingItems) {
    if (!dict.ContainsKey(mi.unique)) dict.Add(mi.unique, mi);
}

And then there is HashSet() which is like a list that does not care about duplicates, i.e. it calculates a hash for your elements and it only remembers the youngest one of conflicting hashed entries, so you could do:

HashSet<myItem> set=new HashSet<myItem>();
foreach(myItem mi in incomingItems) {
    set[mi.unique]=mi;
}

with a small chance that different elements happen to have conflicting hash values and all but the latest one of them get lost.

So it all depends on how large your collections grow, whether your items have a unique field, and how well you can accept the odd hash conflict.

Smile | :)

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

Re: A Question Of Efficiency

Roger Wright15-Nov-10 18:15

Roger Wright

15-Nov-10 18:15

It appears that I may have stumbled upon the best possible collection for my purpose, as there is no unique field in the data, and avoiding duplicates is paramount. How often is that likely to happen? Big Grin | :-D

My input dataset contains about ten lines of garbage at the beginning, then 5 to 9 lines of csv data to be processed. If I had a full year's worth of data to process at one go, that would still only be about 35,000 records. If I tried, instead, to capture and eliminate duplicates as I INSERTed the records to SQL Server, rather than eliminating them when I enqueue them in the buffer before insertion, do you think that it would make a noticable difference? I know that's a technique I should learn, for future reference, but is it really useful for this app? It seems to me that there would be a lot of overhead, making connections and retrieving error messages, in order to cull out duplicates. That just seems wasteful of a scarce resource. Besides, SQL transactions would have to be carried by a network, which is always subject to collisions and dropouts. That could badly affect reliability, though hopefully such events would be very rare.

Thanks, as always, for your valuable guidance! Big Grin | :-D

Will Rogers never met me.

Re: A Question Of Efficiency

Luc Pattyn16-Nov-10 2:27

Luc Pattyn

16-Nov-10 2:27

Roger Wright wrote:
do you think that it would make a noticable difference?

Hmm. As your number of new records isn't particularly high, sending all of them to the DB wouldn't be an obstacle, even when there would be several duplicates. And you have to take precautions against duplicates in the DB anyway. So I'd go for one of the auxiliary-table techniques PIEBALD hinted to.

Smile | :)

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

Re: A Question Of Efficiency

_Erik_16-Nov-10 1:58

_Erik_

16-Nov-10 1:58

Luc's answer is perfect. You might want to have a look at this[^]. Algorithm complexity analysis becomes crucial sometimes.

Re: A Question Of Efficiency

Roger Wright17-Nov-10 14:14

Roger Wright

17-Nov-10 14:14

Interesting stuff, but far beyond my present ability. Thirty years ago I would have lapped it up, since it was my job to write super efficient code. Maybe one day I will again - I bookmarked it. Big Grin | :-D

Will Rogers never met me.

Re: A Question Of Efficiency

PIEBALDconsult16-Nov-10 2:17

PIEBALDconsult

16-Nov-10 2:17

I don't do it that way. I prefer not to load a collection with data that I don't need for very long.

I recommend loading (with bcp) the data into tables that are designed specifically to hold the "raw" data -- these would have all text fields and allow duplicates. Then you can use SQL to move any non-duplicate rows to where they need to be -- this can be done with a trigger, but you need to enable triggers in bcp. Then clean up any left-over data.

Another option is, after the data is in the "raw" table, use a DataReader to read the data, copying and deleting the rows one-by-one, ignoring any duplicate exceptions you may receive. Unfortunately, .net doesn't make distinguishing different types of database error easy.

Re: A Question Of Efficiency

Roger Wright17-Nov-10 14:11

Roger Wright

17-Nov-10 14:11

That sounds like way more complexity than this trivial exercise deserves. Besides, I have no idea what bcp is. Smile | :)

Will Rogers never met me.

Re: A Question Of Efficiency

PIEBALDconsult18-Nov-10 2:14

PIEBALDconsult

18-Nov-10 2:14

bcp is the command-line Bulk CoPy utility that comes with Sql Server. You write a format file to tell bcp what to do with the file contents, and maybe some SQL, but otherwise you don't any any code. It can be confusing and takes a little getting used to, but I can probably send you some examples.

Re: A Question Of Efficiency

RobCroll17-Nov-10 4:12

RobCroll

17-Nov-10 4:12

One way of improving performance when dealing with collections is by setting the capacity of the list when initialising the list. If you know the size of the list when instantiating the class, use the overloaded constructor (int capacity).

The reason for doing this is that under the hood, lists are just arrays. I can't remember what the default capacity is but lets say the underlying array has a capacity of 100,000, when you add the 100,001 item, the list will re-dimension to 200,000. Basically it doubles in size each time it re-dimensions. Re-dimensioning the array exponentially is expensive, if you know the capacity in advance, you are going to maximise performance.

Re: A Question Of Efficiency

Roger Wright17-Nov-10 14:10

Roger Wright

17-Nov-10 14:10

That's a great idea! I can count the lines in each of the selected files, delete about 10 from each for the useless header info, then use the total when I instantiate the queue. Cool! Big Grin | :-D

Will Rogers never met me.

Last Visit: 31-Dec-99 18:00 Last Update: 1-Oct-24 22:24

Refresh

ᐊ Prev 1...2893 2894 2895 289628972898 2899 2900 2901 2902 Next ᐅ

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.