AnyDataFileToXmlConverter Class/Utility

Chris Hambleton

4.75/5 (15 votes)

13 Apr 20073 min read

1.1K

A class/utility that converts data-files of various formats to .NET DataSet-compatible XML

Download source files - 237.8 KB

Introduction

The AnyDataFileToXmlConverter utility is a Windows application that reads and converts data-files of various formats like Access databases, Excel spreadsheets, CSV files, tabbed-delimited files, etc., to the standard DataSet-compatible XML format used in .NET.

Background

A few months ago, I had a project that required data-collection, validation, and processing of data-files from a wide variety of formats. Due to the frequency of how the validation/processing of the data-files would be changing over time, and the need for standardization of the processing format, I decided to use the .NET DataSet-compatible XML format and then later XSLT to process/re-format the data. A number of the file formats could be automatically converted to this file format anyway (like Excel spreadsheets and Microsoft Access databases), but many of the other file formats (like tab-delimited, comma-delimited, pipe-delimited text files) would need to be reprocessed into the .NET DataSet-compatible XML format. The main class and this utility is a result of that work.

Using the Application

The AnyDataFileToXmlConverter utility is quick and easy to use - just open/load the file in the utility and the file will be automatically processed to the .NET XML format. For Excel spreadsheets, the worksheets in the workbook are enumerated, and a worksheet must be selected for processing to XML. In Microsoft Access databases, a query for the specific data to convert to XML must be specified before processing.

Several sample data-files are provided with the utility, which demonstrate the various common data-file formats that can be re-processed to XML. There are additional file-formats that can be processed (like pipe-delimited files), and the class can be easily modified to handle additional formats also. XML files can be loaded into the utility, but no processing is performed since they're already in the end-result format.

Also, there is an optional data-cleanup function that automatically removes any "junk" XML-nodes that are commonly created as a result of processing an Excel spreadsheet that contains empty or cleared rows at the end of a spreadsheet.

You can also change the output/results display from XML to a grid, for easier viewing and sorting.

Data-File Processing Examples

Comma-Separated File

Tab-Delimited File

Excel Spreadsheet

Microsoft Access Database

The AnyDataFileToXmlConverter Engine - How It Works

The main engine/processor of the AnyDataFileToXmlConverter utility is the RawFileConverter class, which uses the file-extension of the loaded file to determine which processing function to use to reprocess the file. For most of the file formats (with text-files being the exception), the Microsoft.Jet.OLEDB provider is used to automatically load the file into a DataSet, and then the XML is retrieved from the DataSet.

For text files, the file contents are evaluated to find the likely character delimiter between the data columns, and then each line of the file is split into columns according to the character delimiter. A DataTable is created and then loaded with all the data in the file, with each line in the file being converted to a DataRow and each field loaded into a column in that DataRow. Once the entire file has been processed to the DataTable, the XML is retrieved from the DataTable's DataSet object.

/// <summary>
/// Converts the specified TEXT file to it's equivalent XmlDocument
/// </summary>
/// <PARAM name="sFilePath"></PARAM>
/// <returns></returns>
private static XmlDocument ConvertTextFile(string sFilePath)
{
    XmlDocument xmlRaw = null;
    StreamReader oSR = null;             

    try
    {               
        DataSet dsTextFile = new DataSet();
        DataTable dtTextFile = new DataTable();
        DataRow drRows = null;

        // check and pre-process the text file if it's a non-standard text file
        sFilePath = PreprocessNonStandardFiles(sFilePath);

        // find the correct delimiter for the file 
        // (some files have multiple delimiting chars, but only one is correct)
        char chrDelimiter = GetDelimiterCharacter(sFilePath);

        //Open the file and go to the top of the file                
        oSR = new StreamReader(sFilePath);                
        oSR.BaseStream.Seek(0, SeekOrigin.Begin);

        // read the first line 
        string sFirstLine = oSR.ReadLine();
        bool bHeaderIsDataRow = false;

        // init the columns if the file has a valid, parsible header
        string[] sColumns = sFirstLine.Split(chrDelimiter);
        if(sColumns.Length > MINIMUM_NUMBER_CSV_COLUMNS)
        {                    
            bHeaderIsDataRow = InitializeTableColumns(sColumns, 
                               ref dtTextFile, true);
            if (bHeaderIsDataRow == true)
            {
                oSR.BaseStream.Seek(0, SeekOrigin.Begin);
                oSR.Close();
                oSR = new StreamReader(sFilePath);
                oSR.BaseStream.Seek(0, SeekOrigin.Begin);
            }
        }              

        // add in the Rows for the datatable/file
        dsTextFile.DataSetName = "NewDataSet";
        dsTextFile.Tables.Clear();
        dtTextFile.TableName = "Table";
        dsTextFile.Tables.Add(dtTextFile);

        // iterate thru the file and process each line
        while (oSR.Peek() > -1)
        {                    
            int iFieldIndex = 0;
            string sLine = oSR.ReadLine();
            string sLineTrimmed = sLine.Trim();
            string[] sLineFields = sLine.Split(chrDelimiter);

            if ((sLineFields.Length <= 0) || 
                (sLineTrimmed.Length < MINIMUM_NUMBER_CSV_COLUMNS))
            {
                continue;
            }

            // if the number of fields is less that the minimum, skip the field
            if (sLineFields.Length <= MINIMUM_NUMBER_CSV_COLUMNS)
            {
                continue;
            }

            // if we suddenly have more fields than columns, 
            // we're in a header or something, so re-init the columns
            if ((sLineFields.Length > dtTextFile.Columns.Count) &&
                (sLineFields.Length > MINIMUM_NUMBER_CSV_COLUMNS))
            {
                //note: bad data?! - header/inconsistent delimiting problems?  
                if (dtTextFile.Rows.Count <= 0)
                {
                    InitializeTableColumns(sLineFields, ref dtTextFile, false);
                }
            }

            drRows = dtTextFile.NewRow();
            foreach (string strField in sLineFields)
            {
                string sField = strField.Trim();
                sField = sField.Replace("\"", "");
                sField = sField.Replace("'", "");
                sField = sField.Replace("$", "");
                sField = sField.Replace("%", "");
                sField = sField.Replace("-0-", "0");
                sField = sField.Replace("&", "and");

                // header/inconsistent file delimiting problems?  
                if (dtTextFile.Columns.Count <= iFieldIndex)
                    break;

                drRows[iFieldIndex] = sField;
                iFieldIndex = iFieldIndex + 1;
            }
            
            dtTextFile.Rows.Add(drRows);
        }        

        // load the dataset to an xmldocument                
        xmlRaw = CleanupRawXml(dsTextFile.GetXml());                
    }
    catch (Exception ex)
    {
        throw new Exception("Error: ConvertTextFile", ex);
    }
    finally
    {
        oSR.Close();
    }

    return xmlRaw;
}

Conclusion

I hope you find this article and utility useful - it has come in quite handy a number of times, and the core class was recently incorporated into a generic FileDataProvider class for use as another .NET data provider like the SqlDataProvider and OracleDataProvider providers. But that's another article for another day. Enjoy!

License

This article has no explicit license attached to it, but may contain usage terms in the article text or the download files themselves. If in doubt, please contact the author via the discussion board below.

A list of licenses authors might use can be found here.