Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / XML

Flexible text format support using Regular Expressions

5.00/5 (2 votes)
1 Jul 2009CPOL2 min read 23.7K   203  
Regular Expressions enable your application to parse text files of customized formats.

flexformat.jpg

Introduction

I was writing a small database application which allows users to import records in plain text files to an Access database. One of the problems I faced was that different users will have different formats: some are tab delimited, some are comma delimited; some have fixed width fields, while others don't. You can use a switch statement to deal with them, but when the number of formats increases, so does the ugliness index of your code. To make it more difficult, more often than not, you don't know what the format will be at coding time.

So, I needed to support customized formats; it has to be flexible enough, yet could be easily understood by the application. It looked like a daunting task, until I came across the idea of Regular Expressions.

Using the Code

It's very simple to use, since there isn't much in it other than the idea of using Regular Expressions. The demo project contains two formats described in formats.xml, and two sample input files. You need to:

  1. Add flex_format.cs to your project.
  2. Load the formats information stored in the XML file during initialization.
  3. C#
    //Read Formats Supported
    XmlSerializer s = new XmlSerializer(typeof(ArrayList), 
                          new Type[] { typeof(flex_format) });
    TextReader r = new StreamReader("formats.xml");
    formats_supported = (ArrayList)s.Deserialize(r);
    r.Close();
  4. Stuff the file filter with formats information when the user opens an OpenFileDialog.
  5. C#
    OpenFileDialog dlg = new OpenFileDialog();
    //Stuff the dialog's file filter with formats supported
    foreach (flex_format format in formats_supported)
    {
      dlg.Filter += (dlg.Filter.Length>0?"|":"")+ 
          format.description + "|*" + format.suffix;
    }
    if (dlg.ShowDialog() == DialogResult.OK)
    {
      load_list(dlg.FileName);
    }
  6. Parse the text file using the Regular Expression specified in the format description.
  7. C#
    private void load_list(string file_name)
    {
      flex_format format = null;
      //Determine the file format by file name suffix
      foreach (flex_format fmt in formats_supported)
      {
        if (file_name.EndsWith(fmt.suffix))
        {
          format = fmt;
          break;
        }
      }
      
      listView1.Clear();
      //Stuff the listview columns with field names
      foreach(string field_name in format.entries)
      {
        listView1.Columns.Add(field_name);
      }
    
      //Now read the content of the input file 
      textBox1.Text = null; 
      StreamReader reader = new StreamReader(file_name);
      while (true)
      {
        string line = reader.ReadLine();
        textBox1.Text += line+"\r\n";
        if(line == null || line.Length == 0) break;
    
    
        //This is where the regular expression is used
        Match match = new Regex(format.pattern).Match(line);
        if(!match.Success) continue;
        ListViewItem item = new ListViewItem(match.Groups[1].Value);
        for (int i = 2; i < match.Groups.Count; i++)
        {
           item.SubItems.Add(match.Groups[i].Value);
        }
        listView1.Items.Add(item);
      } 
    }
  8. To add a new format, you can either manually edit the XML file, or programmatically use XML serialization. This is part of the XML file used in the demo project. This format allows the user to use different speed units, which will be much more difficult to implement without the help of Regular Expressions.
  9. XML
    <anyType xsi:type="flex_format" 
       description="Car Speed Record Database" 
       suffix=".spd" 
       pattern="([^\t]+)\t([^\t]+)\t([\d\.]+)\s*(mph|km/h|m/s)">
    <entries>
    <entry>Make</entry>
    <entry>Model</entry>
    <entry>TopSpeed</entry>
    <entry>Unit</entry>
    </entries>
    </anyType>

Why Regular Expressions

The class flex_format stores description, file suffix, Regular Expression pattern, and an array of field names in an XML file. How flexible could the file format be? Just as flexible as Regular Expressions. Combined with XML serialization, you have a very concise solution to support flexible text formats. More importantly, you can easily add support of a new format without changing your application.

Revision History

  • [July 02 2009] Minor changes to the demo project.
  • [June 20 2009] Initial version.
Visit my blog on C# and digital imaging, if you can read Chinese :)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)