|
Sebastien,
Thanks for posting your work. I have been using the CsvReader for quite some time with great results. One thing I've noticed is that an Uncaught Exception occurs when a file is read that contains 2 headers with the same names. I believe the error is happening when the duplicate name is being added to the _fieldHeaderIndexes Dictionary. Here are lines 1480 - 1487 of CsvReader.cs.
<br />
_fieldHeaders = new string[_fieldCount];<br />
_fieldHeaderIndexes = new Dictionary<string, int="">(_fieldCount, _fieldHeaderComparer);<br />
<br />
for (int i = 0; i < _fields.Length; i++)<br />
{<br />
_fieldHeaders[i] = _fields[i];<br />
_fieldHeaderIndexes.Add(_fields[i], i);<br />
}<br />
</string,>
Ideally it would be nice if this type of error threw some kind of "DuplicateHeaderException", or even better, didn't throw an exception at all until a field was referenced by name. It could be quite a bit of code to fix, but just thought I would bring it to your attention.
|
|
|
|
|
I agree this can be considered a leaky abstraction, but I disagree I should wait before throwing the exception as I strongly believe in fail fast philosophy. Thanks for the suggestion, I will make a DuplicateHeaderException or some similar name.
|
|
|
|
|
I've been using your CsvReader for quite a while now and I've stumbled across this duplicated headers problem.
I don't think that throwing an exception is the ideal solution here because the user parsing the .csv file might not have control over the data (which is my case ). Also, there's no "rules" concerning the fact that a header must be unique (at least in the csv specification, see: http://en.wikipedia.org/wiki/Comma-separated_values[^]).
The best way would be to have the choice on how to handle it: do nothing, raise an exception, raise an event, etc. Pretty much like the missing field action.
Vorgaad.
|
|
|
|
|
Humm, I have to disagree with you on that point. How having multiple headers with the same name is making sense ? How are you going to distinguish between fields beside their index ? If you are really facing this situation, then you can turn off the optional parsing of headers (it's a parameter in the constructor). Or you can name your headers "foo1", "foo2", etc.
The whole point of using the first line as headers is to let you access fields by their name. Logically, the names must be unique for that to be possible.
By the way, while not mentionned explicitly, please note that the example in the wikipedia article conveniently do not use duplicate header names.
|
|
|
|
|
I agree with you that headers should be unique. Unfortunately, I don't have access on how the .csv file is generated (getting data from a 3rd party.). So the data sent to me contains headers, and some are duplicates. I do access them using the numeric index, but the parsing fails (and will continue to fail if an exception is thrown).
Correct me if I'm wrong, but the hasHeaders constructor parameter won't solve my problem because the headers are in file and if I don't parse them, they will be considered as data (which I don't want).
Maybe I should ignore them using the hasHeaders constructor parameter and skip the 1st line? Not sure how "clean" that is though. I just thought that handling them the same way the missing field are was cleaner and more versatile.
Vorgaad.
|
|
|
|
|
Well, I cannot handle this case in the same way as say a missing field because having duplicate headers does not make sense at all and there is nothing else you can do about it except skipping the first record and accessing fields by index. Throwing a parsing error would be the same as the current behavior. The reader could report a duplicate header and then let you skip it. Then what ? Some fields would be accessed by name and others by index ? What a mess really. And then, what more can you do if the reader raises an event besides already mentionned options ?
Really, I understand your pain dealing with messy data, but skipping a record is as clean as having duplicate names in the first place ...
|
|
|
|
|
I have run into this exception as well, and perhaps you will find my use case a little more compelling than simply allowing duplicate headers. (Or perhaps you can give me another suggestion for a solution).
CSV files that are edited in apps like excel sometimes leave empty fields, e.g. ",,,,,," when items are removed. Thus, a sheet that looks good to an end user will throw an exception when it hits the library, at the place outlined above. I was reading about trimming, but I think that just trims the field itself -- it wouldn't rim off empty fields, I think.
Thoughts?
|
|
|
|
|
Hi,
Thank you so much for this VERY useful reader.
I have used earlier versions before, but I cannot find the download link for the 3.5 version. The top of this page contains links to the 1.0 version only I believe.
Thanks!
Peter.
|
|
|
|
|
Something wrong happened during the update process on the CP side ... Anyway, they just updated it with the files I sent again. Kudos to them for the fast fix
|
|
|
|
|
Thanx for posting this code!
Your csvreader works like a charm – but I run into trouble with the CachedCsvReader: Im not sure if I’m using the correct version. Just to make sure we’re talking about the same version: I’ve used CachedCsvReader with the following code fragment in the ReadNextRecord(bool,bool) method
<br />
if (base.CurrentRecordIndex > -1)<br />
CopyCurrentRecordTo(record);<br />
else<br />
{<br />
MoveTo(0);<br />
CopyCurrentRecordTo(record);<br />
MoveTo(-1);<br />
}<br />
<br />
_records.Add(record);<br />
<br />
if (!onlyReadHeaders)<br />
_currentRecordIndex++;<br />
The following code on a csv file with 4 records will produce a correct output:
<br />
private void ReadCSVFile(string file){<br />
using (CachedCsvReader csv = new CachedCsvReader( <br />
new StreamReader(file, Encoding.GetEncoding("ISO-8859-1")), true, ';', '\'', '\\', '#', true))<br />
{ <br />
foreach (string[] data2 in csv)<br />
{<br />
DoSomeOutPut(data2);<br />
} <br />
<br />
}<br />
}<br />
output:
<br />
Record1<br />
Record2 <br />
Record3<br />
Record4<br />
When you add a call to csv.GetFieldHeaders() bevor the enumeration the output looks like:
<br />
Record1<br />
Record1 <br />
Record2<br />
Record3<br />
So the first record gets enumerated twice and the last is lost.
What’s the best way to fix this?
Thank in advance.
|
|
|
|
|
Yes, this looks indeed like a bug. I will look into that tomorrow and fix it. Thank you for reporting it!
|
|
|
|
|
If you need it, leave a comment under this thread. Thank you!
|
|
|
|
|
Hi, Sébastien,
Yes, I need it. Our code is .NET 1.1 throughout and we rely on CsvReader in some of our critical applications. I may rewrite it at some point down the road, but it would be very useful if you continued to support the 1.1 version for at least another year or so.
Thank you!
Simon
|
|
|
|
|
Well, as long as there is no bug, I am not going to change anything
|
|
|
|
|
Hi, Sébastien,
It's ok if you do not change anything in the code, but could you please place the sources back online? The references to the 1.1 versions are gone now. (I am talking about the article to which this forum thread is attached.)
Thanks!
Simon
|
|
|
|
|
Hi!
I tried to use the Custom Error Handling Scenario but the first row/record is missing. I have the updated LumenWorks.Framework.IO.dll 3.0.0.0.
Header and the rest of the records are fine.
I need you help. Thanks!
|
|
|
|
|
The first record is usually the header. Have you tried reading the CSV without headers ? It is an argument in the constructor. If that is not the cause of your problem, could you post your code?
|
|
|
|
|
Made my work a lot easier
|
|
|
|
|
Your welcome! Thank you for taking time to leave a comment
|
|
|
|
|
Hi,
Could you please cleanup the ZIP-file CsvReader20_src.zip?
It contains three copies of the source and three different versions of CachedCsvReader.cs (the rest of the files are the same).
I used the tool CloneSpy to cleanup the source myself, first removing all bin directories. Finally I kept the version of CachedCsvReader.cs that had the latest modification date/time. I hope that is the correct one.
modified 27-May-21 21:01pm.
|
|
|
|
|
Hum, I don't know what happened, I double-checked my files before uploading them?! Anyway, it will be fixed soon, sorry about that. Funny you are the first to report it, seems other users are more complacent
|
|
|
|
|
Sorry about the delay, I am still waiting for the article to be updated.
|
|
|
|
|
Article has been updated.
|
|
|
|
|
Hello,
Thanks for the great parser and for keeping it up-to-date!
I have a small 80GB file the I'm working on but the header record is wrapped in quotes which generates an error. Here is the header record and first record:
"Household_Member1","Address","City_State_Zip","Area_Code_and_Phone","Phone_Area_Code","Phone_Number_without_Area_Code","City","State","ZIP","ZIP_+_4","ZIP_+_4_+_DPBC","Delivery_Point_Bar_Code","Carrier_Route","County","MSA","Household_Member_First1","Household_Member_Last1","Household_Member_Gender1","Household_Member_Age1","Household_Member2","Household_Member_First2","Household_Member_Last2","Household_Member_Gender2","Household_Member_Age2","Household_Member3","Household_Member_First3","Household_Member_Last3","Household_Member_Gender3","Household_Member_Age3","Household_Member4","Household_Member_First4","Household_Member_Last4","Household_Member_Gender4","Household_Member_Age4","Household_Member5","Household_Member_First5","Household_Member_Last5","Household_Member_Gender5","Household_Member_Age5","Household_Member6","Household_Member_First6","Household_Member_Last6","Household_Member_Gender6","Household_Member_Age6","Household_Member7","Household_Member_First7","Household_Member_Last7","Household_Member_Gender7","Household_Member_Age7","Household_Member8","Household_Member_First8","Household_Member_Last8","Household_Member_Gender8","Household_Member_Age8","Household_Member9","Household_Member_First9","Household_Member_Last9","Household_Member_Gender9","Household_Member_Age9","Household_Member10","Household_Member_First10","Household_Member_Last10","Household_Member_Gender10","Household_Member_Age10","Residence_Type","Home_Age","Est_Home_Value","Est_Income","Own/Rent","Length_of_Residence","Mortgage_Age","Mortgage_Count","Mortgage_Finance_Type","Mortgage_Loan_to_Value_Ratio","Mortgage_Loan_Type","Personal_Details1","Personal_Details2","Personal_Details3","Personal_Details4","Personal_Details5","Mail_Order_Buyer1","Mail_Order_Buyer2","Mail_Order_Buyer3","Mail_Order_Buyer4","Mail_Order_Buyer5","Mail_Order_Buyer6","Mail_Order_Buyer7","Mail_Order_Buyer8","Mail_Order_Buyer9","Mail_Order_Buyer10","Mail_Order_Buyer11","Mail_Order_Buyer12","Mail_Order_Buyer13","Mail_Order_Buyer14","Mail_Order_Buyer15","Mail_Order_Buyer16","Mail_Order_Buyer17","Mail_Order_Buyer18","Mail_Order_Buyer19","Mail_Order_Buyer20","Mail_Order_Buyer21","Mail_Order_Buyer22","Mail_Order_Buyer23","Mail_Order_Buyer24","Donor_Type1","Donor_Type2","Donor_Type3","Donor_Type4","Donor_Type5","timezone","latitude","longitude"
"Rose M Modarelli",""2822 Bears Den CT","Youngstown, OH 44511-1214","","","","Youngstown","OH","44511","44511-1214","44511-1214-223","223","C090","Mahoning","Youngstown-Warren, OH","Rose","Modarelli","F","65 and older","Dina M Modarelli","Dina","Modarelli","F","25 to 34","Dominic G Modarelli","Dominic","Modarelli","M","25 to 34","Frank J Modarelli","Frank","Modarelli","M","45 to 54","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","Single family dwelling","","","","Home Owner","26 to 30 years","26 to 30 years","Unknown","Unknown","0.01 to 0.09","Unknown","Pool","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","EST",80.696899999999999,41.069099999999999
Works fine if I take the quotes off the header record. I get a lot of these files in so if I can make the parser handle it that would be better than appending a new header to these file since they are so large.
I know you are a busy guy so if you can point me in the general direction I'll try to make the change myself. Thanks again for your time.
Chad
|
|
|
|
|
Thanks for your comment
First of all, knowing what is the exact error would help me ... That said, I can see there is an extra " in ""2822 Bears Den CT" which may cause your error. Hope this help!
|
|
|
|
|