My interpretation of your question is that you really don't want to see a complete code solution written for you. However, I'll post some code fragments here I think will help you along the way.
My suggestion for a strategy here is:
1. as a first step: write code to analyze the two datasets to determine
duplicate categories. By definition, non-duplicate categories are going to be part of your output of "differences."
Then focus on analyzing the duplicates to determine if their list of sizes is the same. In this case the duplicates are:
Cherries Grapes Pears Strawberry
To tease out the non-duplicates:
1. process the strings in the datasets to eliminate unnecessary white-space:
private static readonly Regex MultipleSpaces = new Regex(@" {2,}", RegexOptions.Compiled);
private static string NormalizeWithRegex(string input)
{
return MultipleSpaces.Replace(input, "");
}
After you clean-up the Datasets, create Lists that will be used to analyze for non-duplicates: here we'll assume your cleaned-up datasets are in strings ds1, and ds2:
private List<string> ds1List, ds2List, duplicateCategoryList, nonDuplicateCategoryList, l1SubStrings, l2SubStrings;
private string[] splitCh1 = new string[] {"\r\n"};
private char[] splitCh2 = new char[] { ' ' };
private void MassageData()
{
ds1 = NormalizeWithRegex(ds1);
ds2 = NormalizeWithRegex(ds2);
ds1List = ds1.Split(splitCh1, StringSplitOptions.RemoveEmptyEntries).ToList<string>();
ds2List = ds2.Split(splitCh1, StringSplitOptions.RemoveEmptyEntries).ToList<string>();
ds1List.Sort();
ds2List.Sort()
};</string></string></string>
To get to the gist of how to tease-out the duplicate categories:
1. parse (for-loop) the longer of the two Lists, ds1List, ds2List
2. use the second string split (splitCh2) on the string at the for-loop index in the longer of the two lists.
3. pull-out the first element [0] of the split list (the category name), and see if that string appears in the dataset (string) which is the source of your shorter list:
a. if it appears: you have a candidate for a duplicate match.
4. you will then have to consider that there may be a category in the shorter list that is not in the longer list: you'll need to write code to check ... by checking for this only when the for-loop index of the longer list is less than the 'Count of the shorter list.
5. once you have a list of candidate duplicates, you can then compare their associated category value (sizes), and take out, from the duplicates list those where the list of sizes do not match exactly.
Hope this gets you started; the key idea is to minimize the work you do in list processing, splitting strings, etc.
... edit ... in response to OP's query about what happens if two identical categories have different numbers of size parameters:
And, in the category 'Pears in your data, you have a different number of size parameters in the two datasets.
That's an interesting "challenge" because .NET does not provide a built-in equality comparison for the contents of two List<T> Objects. Using == or .IsEqual will compare references to Objects.
.NET 3.5 offers you the Linq SequenceEqual extension:
http://msdn.microsoft.com/en-us/library/bb348567(v=vs.100).aspx
Which will compare two Lists for content equality, but it is order-dependent: that means you'd have to sort the two Lists before comparing them.
Try this:
List l1 = new List{"Grape", "Small", "Large"};
List l2 = new List {"Grape", "Large", "Small" };
List l3 = new List { "Grape", "Small" };
l1.Sort();
l2.Sort();
l3.Sort();
bool case1 = l1.SequenceEqual(l2);
bool case2 = l1.SequenceEqual(l3);
Set a break-point and examine the boolean results.
I'd write a function that took two List<string> and first tested their length (.Count Property) for equality; if the lengths were equal, then I'd Sort them and use 'SequenceEqual.
... end edit ...