How to best compare two sets of multidimensional data?

Question

0.00/5 (No votes)

See more:

Sorry this is not a direct coding question - Happy to take this offline with someone or be pointed to the "best technique". This is not a class assignment.
Writing a C# Console application.
I have two sets of data that I need to compare. The sets of data are the same type (string), but will be different lengths - as indicated in these two Input Sets.

   Input Set 1				           Input Set 2			
1  Apples	Small	Medium	Large	  1  Apricot	Small		
2  Bananas	Small	Medium	Large	  2  Blackberry		Medium	Large
3  Blueberries	Small			  3  Cherries	Small	Medium	Large
4  Cherries	Small	Medium	Large	  4  Grapes	Small		Large
5  Grapes	Small		Large	  5  Oranges	Small	Medium	
6  Pears		Medium		  6  Pears	Small	Medium	
7  Strawberry	Small	Medium	Large	  7  Strawberry	Small	Medium	Large
8  Watermelon	 		Large

I can't figure out the most efficient way to compare these two sets. Equal data may exist but be different rows, or a data row may exist in just one Input Set, or the actual values of the rows could be different. Is the best technique a looping List Array compare - or a 2d Array compare - or something else? I need to compare in a way that produces two Output Sets of the same number of rows that show the differences as follows. And I am displaying the data this way because it will be presented in an Excel sheet:

Diff?	   Output Set 1				  Output Set 2		
yes	1					1  Apricot	Small	 	 
yes	2  Apples	Small	Medium	Large	2  		
yes	3  Bananas	Small	Medium	Large	3				
yes	4					4  Blackberry		Medium	Large
yes	5  Blueberries	Small		        5  	
no	6  Cherries	Small	Medium	Large	6  Cherries	Small	Medium	Large
no	7  Grapes	Small		Large	7  Grapes	Small		Large
yes	8					8  Oranges	Small	Medium	
yes	9  Pears		Medium		9  Pears	Small	Medium	
no	10 Strawberry	Small	Medium	Large	10 Strawberry	Small	Medium	Large
yes	11 Watermelon			Large	11

Posted 11-Sep-14 9:12am

Member 9428144

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

BillWoodruff · Answer 1 · 2014-09-11T21:47:00

My interpretation of your question is that you really don't want to see a complete code solution written for you. However, I'll post some code fragments here I think will help you along the way.

My suggestion for a strategy here is:

1. as a first step: write code to analyze the two datasets to determine duplicate categories. By definition, non-duplicate categories are going to be part of your output of "differences."

Then focus on analyzing the duplicates to determine if their list of sizes is the same. In this case the duplicates are:

Cherries Grapes Pears Strawberry

To tease out the non-duplicates:

1. process the strings in the datasets to eliminate unnecessary white-space:

// code by Jon Skeet to remove multiple spaces from string
// from:  http://stackoverflow.com/a/1280227/133321
private static readonly Regex MultipleSpaces = new Regex(@" {2,}", RegexOptions.Compiled);

private static string NormalizeWithRegex(string input)
{
// Skeet's code modified here by BW:
// second parameter changed to empty string
return MultipleSpaces.Replace(input, "");
}
// end code by Jon Skeet

After you clean-up the Datasets, create Lists that will be used to analyze for non-duplicates: here we'll assume your cleaned-up datasets are in strings ds1, and ds2:

C#

private List<string> ds1List, ds2List, duplicateCategoryList, nonDuplicateCategoryList, l1SubStrings, l2SubStrings;

private string[] splitCh1 = new string[] {"\r\n"};
private char[] splitCh2 = new char[] { ' ' };

private void MassageData()
{
    ds1 = NormalizeWithRegex(ds1);
    ds2 = NormalizeWithRegex(ds2);
    
    ds1List = ds1.Split(splitCh1, StringSplitOptions.RemoveEmptyEntries).ToList<string>();
    ds2List = ds2.Split(splitCh1, StringSplitOptions.RemoveEmptyEntries).ToList<string>();

    // sorting may or may not pay-off here ?
    ds1List.Sort();
    ds2List.Sort()
};</string></string></string>

To get to the gist of how to tease-out the duplicate categories:

1. parse (for-loop) the longer of the two Lists, ds1List, ds2List

2. use the second string split (splitCh2) on the string at the for-loop index in the longer of the two lists.

3. pull-out the first element [0] of the split list (the category name), and see if that string appears in the dataset (string) which is the source of your shorter list:

a. if it appears: you have a candidate for a duplicate match.

4. you will then have to consider that there may be a category in the shorter list that is not in the longer list: you'll need to write code to check ... by checking for this only when the for-loop index of the longer list is less than the 'Count of the shorter list.

5. once you have a list of candidate duplicates, you can then compare their associated category value (sizes), and take out, from the duplicates list those where the list of sizes do not match exactly.

Hope this gets you started; the key idea is to minimize the work you do in list processing, splitting strings, etc.

... edit ... in response to OP's query about what happens if two identical categories have different numbers of size parameters:

And, in the category 'Pears in your data, you have a different number of size parameters in the two datasets.

That's an interesting "challenge" because .NET does not provide a built-in equality comparison for the contents of two List<T> Objects. Using == or .IsEqual will compare references to Objects.

.NET 3.5 offers you the Linq SequenceEqual extension:

http://msdn.microsoft.com/en-us/library/bb348567(v=vs.100).aspx

Which will compare two Lists for content equality, but it is order-dependent: that means you'd have to sort the two Lists before comparing them.

Try this:

List l1 = new List{"Grape", "Small", "Large"};
List l2 = new List {"Grape", "Large", "Small" };
List l3 = new List { "Grape", "Small" };

l1.Sort();
l2.Sort();
l3.Sort();

bool case1 = l1.SequenceEqual(l2);
bool case2 = l1.SequenceEqual(l3);

Set a break-point and examine the boolean results.

I'd write a function that took two List<string> and first tested their length (.Count Property) for equality; if the lengths were equal, then I'd Sort them and use 'SequenceEqual.

... end edit ...

Afzaal Ahmad Zeeshan · Answer 2 · 2014-09-11T09:29:00

This depends on what you use to save them, if you're using Arrays then you can use their indices to check them, for example

C#

if(arr[1] != arr2[1] && arr[2] != arr2[2]) {
  // not equal, 
  return false;
} else {
  // equal
  return true;
}

.. but remember, that the arrays through an IndexOutOfRange exception, if you try to check against a query that is not present inside the array, for example in many cases of your code there is index 3 in first array but no index 3 in second array.

Same thing applies to the List in the .NET Framework, so I guess there is no efficient way of doing this, unless you know all the dimensions of the arrays. Both your arrays are different even in side your own question.

You can figure this out using a simple paper, to define the logic. :-)

Good luck,

How to best compare two sets of multidimensional data?

2 solutions

Solution 2

Solution 1

Add your solution here

Preview 0