|
How to import data from pdf file to sql server 2005
|
|
|
|
|
4anusha4 wrote: import data
What do you mean by Data? Text data inside PDF file of the whole PDF file?
|
|
|
|
|
yes
PDF consists of the data columns wise, I want to import the whole data in to sql
|
|
|
|
|
Well if so, It will not be so easy.
First you need to find out how to extract text from PDF file.
A quick google will give you a idea how to do so.
And also here are few links you can go through
1[^]
2 [^]
3[^]
|
|
|
|
|
PDF files cannot be imported direct in this way, you will need to convert the PDF table to some text form that can be imported. The iTextSharp[^] library can help you.
Just say 'NO' to evaluated arguments for diadic functions! Ash
|
|
|
|
|
You will need to store the entire pdf as a BLOB field.
The funniest thing about this particular signature is that by the time you realise it doesn't say anything it's too late to stop reading it.
My latest tip/trick
Visit the Hindi forum here.
|
|
|
|
|
how to store the pdf as the blob filed
|
|
|
|
|
|
Could anyone please suggest me good book on
a) Network Programming
b) Security
Both for beginner and advance level.
regards
|
|
|
|
|
I wouldn't dive right into .NET remoting. First I would learn that basics of C#. Anyways I personally like the Wrox books. They give you enough to get going. For network programming you can find plenty of articles right here at codeproject and across the net
|
|
|
|
|
Jacob D Dixon wrote: I personally like the Wrox books.
Me too. I have c# wrox prfessional 4.0 .
But their not much in detail about Network, Threading.
Jacob D Dixon wrote: plenty of articles right here at codeproject and across the net
Yes i had, but they just discuss only that particular part which is involved in that article.
Not the whole part.
Where as book start topic by scratch, progress bit by bit.
So, if you can suggest me good book on Network programming, Threading, Security.
|
|
|
|
|
Go to amazon.com and read the reviews and work it out for yourself. I could recommend a book but I like to drive in there and have a go and then develop a deeper understanding. You may be the complete opposite. Everyone has there own learning styles.
|
|
|
|
|
Robert Croll wrote: I could recommend a book
Then please do so.
I have wrox professional but in threading it just give intro. When comes to synchronization of threads they had a note like
Whole Threading can't be covered in this chapter, A whole book is required. But hadn't mentioned which book
Same applicable to network programming, and nothing about SECURITY.
It's not like i hadn't searched.
This what i found?
But its back in 2005, i had thru some of its online chapters. Some(more) typing mistakes are there. Also code over their for 1.1
It will helpful if you guide where/how to find good book.
|
|
|
|
|
Well there isn't much out there in print on networking for .NET. That link was the book I was thinking about. Mixed reviews, I guess you need to try and get your hands on the book and see what you think because some thought it was good.
BTW I'm guess you know this but threading isn't networking.
Maybe try and find some articles online regarding .NET and TCP/IP
|
|
|
|
|
I've been playing...
A short while back I posted about writing an app which reads csv data from a folder of files I save from emailed reports, and adds each record to a SQL Server DB. Due to other work tasks, I haven't done much with it lately, but I'm back to it now. Previously I'd managed to select the files, strip off the ten lines or so of descriptive data at the beginning of each, identify and classify data lines using regexes, then write the cleaned up data to a text file. That's about where I got distracted, and it's just as well, since I was stuck on the best way to transfer the information to a SQL Server instance. The problem is duplicates...
My data source sometimes sends consolidated reports that duplicate data already received, and it's not particularly easy to tell from the emails which files contain duplicates. My code thus far doesn't have that capability, and I really don't look forward to writing it, but neither do I want to deal with SQL Server bouncing my transactions because they contain duplicate data. So I did some digging and found a gem.
I found that the Queue collection has a .Contains method which looked interesting. I dumped the output text file from my previous version, and just enqueued each cleaned up record as I processed it; that worked great. Then I added code to skip a record if OutQueue.Contains(record) returned true. Wow! It worked like a charm, and took all of ten minutes to implement.
Whee! It's fun trying new stuff. But I'm curious, having never used a Queue before... My application will never see more than a few hundred records at a time, so efficiency isn't a big issue. But I'm thinking that, at some level of complexity - thousands, hundred thousands, millions perhaps - the Contains method will become less efficient than other methods of detecting duplicate items. Does anyone have any idea at what point it becomes too inefficient, and any suggestions about what to implement in its stead?
I'm somewhat tempted to build this thing using only techniques I've never used before, and rarely seen in the wild, then publish it just for the fun of it.
Will Rogers never met me.
|
|
|
|
|
Hi Roger,
there is more for you to explore!
all kinds of collections have a Contains() method, in fact it is part of the IList interface. It typically works by comparing the collection elements one by one with the foreign object, so yes that would slow down for huge collections.
There is one kind of collections that does search more efficiently, it is the Dictionary kind as they use hashing to quickly look up through the key. So if you (1) don't care about the order of the elements in the collection, and (2) have one field in your elements that must be unique, you can simply look that up, like so:
class myItem
int unique;
byte[] data;
string text;
...
}
Dictionary<int, myItem> dict=new Dictionary<int, myItem>();
foreach(myItem mi in incomingItems) {
if (!dict.ContainsKey(mi.unique)) dict.Add(mi.unique, mi);
}
And then there is HashSet() which is like a list that does not care about duplicates, i.e. it calculates a hash for your elements and it only remembers the youngest one of conflicting hashed entries, so you could do:
HashSet<myItem> set=new HashSet<myItem>();
foreach(myItem mi in incomingItems) {
set[mi.unique]=mi;
}
with a small chance that different elements happen to have conflicting hash values and all but the latest one of them get lost.
So it all depends on how large your collections grow, whether your items have a unique field, and how well you can accept the odd hash conflict.
|
|
|
|
|
It appears that I may have stumbled upon the best possible collection for my purpose, as there is no unique field in the data, and avoiding duplicates is paramount. How often is that likely to happen?
My input dataset contains about ten lines of garbage at the beginning, then 5 to 9 lines of csv data to be processed. If I had a full year's worth of data to process at one go, that would still only be about 35,000 records. If I tried, instead, to capture and eliminate duplicates as I INSERTed the records to SQL Server, rather than eliminating them when I enqueue them in the buffer before insertion, do you think that it would make a noticable difference? I know that's a technique I should learn, for future reference, but is it really useful for this app? It seems to me that there would be a lot of overhead, making connections and retrieving error messages, in order to cull out duplicates. That just seems wasteful of a scarce resource. Besides, SQL transactions would have to be carried by a network, which is always subject to collisions and dropouts. That could badly affect reliability, though hopefully such events would be very rare.
Thanks, as always, for your valuable guidance!
Will Rogers never met me.
|
|
|
|
|
Roger Wright wrote: do you think that it would make a noticable difference?
Hmm. As your number of new records isn't particularly high, sending all of them to the DB wouldn't be an obstacle, even when there would be several duplicates. And you have to take precautions against duplicates in the DB anyway. So I'd go for one of the auxiliary-table techniques PIEBALD hinted to.
|
|
|
|
|
Luc's answer is perfect. You might want to have a look at this[^]. Algorithm complexity analysis becomes crucial sometimes.
|
|
|
|
|
Interesting stuff, but far beyond my present ability. Thirty years ago I would have lapped it up, since it was my job to write super efficient code. Maybe one day I will again - I bookmarked it.
Will Rogers never met me.
|
|
|
|
|
I don't do it that way. I prefer not to load a collection with data that I don't need for very long.
I recommend loading (with bcp) the data into tables that are designed specifically to hold the "raw" data -- these would have all text fields and allow duplicates. Then you can use SQL to move any non-duplicate rows to where they need to be -- this can be done with a trigger, but you need to enable triggers in bcp. Then clean up any left-over data.
Another option is, after the data is in the "raw" table, use a DataReader to read the data, copying and deleting the rows one-by-one, ignoring any duplicate exceptions you may receive. Unfortunately, .net doesn't make distinguishing different types of database error easy.
|
|
|
|
|
That sounds like way more complexity than this trivial exercise deserves. Besides, I have no idea what bcp is.
Will Rogers never met me.
|
|
|
|
|
bcp is the command-line Bulk CoPy utility that comes with Sql Server. You write a format file to tell bcp what to do with the file contents, and maybe some SQL, but otherwise you don't any any code. It can be confusing and takes a little getting used to, but I can probably send you some examples.
|
|
|
|
|
One way of improving performance when dealing with collections is by setting the capacity of the list when initialising the list. If you know the size of the list when instantiating the class, use the overloaded constructor (int capacity).
The reason for doing this is that under the hood, lists are just arrays. I can't remember what the default capacity is but lets say the underlying array has a capacity of 100,000, when you add the 100,001 item, the list will re-dimension to 200,000. Basically it doubles in size each time it re-dimensions. Re-dimensioning the array exponentially is expensive, if you know the capacity in advance, you are going to maximise performance.
|
|
|
|
|
That's a great idea! I can count the lines in each of the selected files, delete about 10 from each for the useless header info, then use the total when I instantiate the queue. Cool!
Will Rogers never met me.
|
|
|
|
|