Introduction
StringPatternizer - is a simple library that allows to define custom patterns (similar to DateTime pattterns) and use them for string parsing.
Background
Recently I did an integration with a WebService that provides a location information. Location is sent in XML format and represented as two string fields: Lattitude and Longitude. The challenge was that those coordinates may be in any format but I needed them in decimal format. Here are few examples:
- -22.856944
- 25 15 30
- 25° 15' 30"
- -22.856944 (22 51' 25.0" S)
Turned out that location information is provided by a human and there is no any restrictions on an input, just a text box field. In this situation hardcoded parsing is not an option.
Naturally I came up to XML cofiguration file which suppouse to have a list of format patterns so end user be able to add missing patterns in the future. But what kind of patterns to use?
My first idea was a RegEx. But I realized that RegEx is too complex for end user. For instance, to parse string like this "25° 15' 30"", need to define following RegEx:
(?<degrees>\d*[,.]?\d*)° (?<minutes>\d*[,.]?\d*)' (?<seconds>\d*[,.]?\d*)"
Of course it is totally unacceptable. The perfect solution would be something similar to DateTime parsing patterns. Instead of complex RegEx would be nice to write pattern like this:
d° m' s"
, where 'd' - placeholder for degrees, 'm' - placeholder for minutes, 's' - placeholder for seconds. With this approach end user can take original coordinate string, replace values with a placeholder characters and he gets ready to use pattern. Much simplier than RegEx.
After some googling I didn't find any library that provides such capabilities. So I write my own and would like to share my solution with the community.
Using the code
First of all you need to compile a source code (attached) and add a reference "StringPatternizer.dll" to your project.
The major class called "StringPatternizer
" should be created first:
StringPatternizer sp = new StringPatternizer();
Next step is to define character markers, that will be used as placeholders in the pattern:
sp.Markers.Add('d', typeof(int));
sp.Markers.Add('m', typeof(int));
sp.Markers.Add('s', typeof(double));
sp.Markers.Add('D', typeof(double));
sp.Markers.Add('S', typeof(string));
UPD:
For "StringPatternizer2" it is possible to use string markers instead of character. So code above may look like this:
sp.Markers.Add("degrees", typeof(int));
sp.Markers.Add("minutes", typeof(int));
sp.Markers.Add("seconds", typeof(double));
sp.Markers.Add("Decimal", typeof(double));
sp.Markers.Add("Side", typeof(string));
The code above defines 5 markers with their expected types. During the parsing StringPatternizer
will use specified types to verify if extracted value format is correct. Of course you can register all markers with 'string' type, but the parsing will be less accurate.
Next step is to define a list of patterns:
sp.Patterns.Add("d° m' s\"");
sp.Patterns.Add("d m s");
sp.Patterns.Add("d°m's\"");
UPD:
For "StringPatternizer2" pattern may look like this:
sp.Patterns.Add("degrees° minutes' seconds\"");
Another approach is to have a list of patterns and register entire list:
var patterns = new List<string>()
{
"D",
"d m s S",
"d m s",
"d° m' s\" So",
"d° m' s\" Se",
"d° m' s\" S",
"d° m' s\"",
"d°m's\"S",
"d°m's\"",
"d?m's\"S",
"d?m's\"",
"d? m' s\"",
"D (d m' s\" S)",
"(d m' s\" S)",
"d m' s\" S\"",
"m' d° s\"",
"d m' s'' S",
"dº m' s\" S",
"dºm's"
};
sp.Patterns.AddRange(_patterns);
There are two rules for definning a pattern:
- Order of markers should reflect the order of values in incoming string
- Need to specify neighborhood characters - one character from the left side of expected value and one character from the right. For instance, for coordinate "25° 15' 30"", minute value is surounded by space char from the left and apostrophe char from the right, so pattern should include them also: "d° m' s"".
At this point initialization is completed. StringPatternizer has all data for making parsing. There are two methods that provides a parsing. First one, called 'Match
', finds the first matched pattern and use it for the parsing. Second method, called 'MatchAll
' returns all matched patterns with parsed data. Here is an example:
PatternizationResult pResult = sp.Match("25° 15' 30\"");
...
List<PatternizationResult> pResults = sp.MatchAll("25° 15' 30\"");
As a result of parsing we get PatternizationResult
class. It has following properties:
Exception
- 'null' if one of the pattern matched and parsing successfully completed; 'FormatException' if no pattern was found for specified string value. Pattern
- 'string.Empty' if no pattern matched; 'pattern value' if pattern was matched. Result
- Dictionary<char, object>, where Key - is marker symbol, Value - extracted value. UPD: for "StringPatternizer2" - Dictionary<string, object>, where Key - is marker string,
PatternizationResult
class also has following methods:
bool MarkerHasValue(char marker)
- usefull to check if specific marker has a value. TValue GetMarkerValue<TValue>(char marker)
- usefull to extract specific marker value with desired type.
UPD: for "StringPatternizer2":
bool MarkerHasValue(string marker)
TValue GetMarkerValue<TValue>(string marker)
Here is an example how to handle PatternizationResult
:
string location = "25° 15' 30\"";
var pResult = sp.Match(location);
if (pResult.Exception == null)
{
Log.DebugFormat("Value '{0}' matched with pattern '{1}'.", location, pResult.Pattern);
if (pResult.MarkerHasValue('D'))
{
return pResult.GetMarkerValue<double>('D');
}
else
{
var degrees = pResult.GetMarkerValue<int>('d');
var minutes = pResult.GetMarkerValue<int>('m');
var seconds = pResult.GetMarkerValue<double>('s');
return ConvertToDecimalCoordinate(degrees, minutes, seconds);
}
}
else
{
throw pResult.Exception;
}
That's it! Simple enought I guess.
Points of Interest
Currently library supports following data types for parsing: int, double, decimal, float, bool, string. You can easilly add missing type in StringPatternizer.ConvertToType method.
Library handles localization issue in decimal values (dot or comma separator doesn't matter).
Sometimes it is hard to specify a non-English character in a pattern. For such cases library supports 'inline character code'. Let's assume we need to parse a string like this: "43�10�12,4\"". Character '�' has a code 65533. Instead of putting this char into pattern we can use its code: "d{65533}m{65533}s\"". Library will convert a code '{65533}' into a character '�'.
One of the possible improvement could be to use Parellel.ForEach for every pattern checking to increase the speed. For now I decided to keep the code simple so even newbie in C# can understand it.
UPD:
Thank's to the Emily Heiner's comment I improved the algorithm by using RegEx internally for extracting the values. It makes simple to implement "markers" as a string instead of character. And code became much simplier. So now this library kind of a wrapper on top of RegEx.
History
03.10.2016 - first version
04.10.2016 - version 2
- use RegEx internally
- "marker" is a string now instead of character
05.10.2016 - fixed bug with RegEx empty groups, by replacing ".*" into ".+"