Comparing DataSets using LINQ

Y Kasan

4.76/5 (11 votes)

13 Oct 2008CPOL5 min read

Comparing DataSets in Visual Studio 2008 using LINQ.

Introduction

One of my recent projects included simple, and I believe, commonly done tasks. We were implementing a project utilizing the Microsoft Visual Studio Professional edition, with C# 3.0, and the task was to compare two snapshots taken from the same table within a particular time interval. The purpose of this task was to determine if any changes where done on the server side within a given time interval.

The first way that came to my mind was to use the DataSet.Merge() method. Ideally, this is how it should work:

At some point of the application initialization, we “grab” the first snapshot of data and load it into the original DataSet (let’s call it dsOriginal). Then, on the timer event, when the given interval has passed, we take a second snapshot (let’s call it dsChanged). The rest should be a fairly routine procedure:

Create an empty DataSet (we will call it dsDifferences).

DataSet dsDifferences = new Dataset();

Merge our dsOriginal with dsChanged.

dsOriginal.Merge(dsChanged);

The last step is to get the modified or added rows using the DataSet.GetChanges() method.
C#
```
dsDifferences = dsOriginal.GetChanges();
```

This is a very “stylish” and simple solution, but not so easy as in concept. We were unpleasantly surprised when we found out that merging datasets did not change the RowSatus property of the rows in the tables of the datasets. Just to make sure that we did nothing wrong, I looked at the MSDN library again, and the MSDN materials confirmed that the:

The Merge method combines rows (equivalent to a SQL Union statement) when no Primary Key is present, or updates rows in the target DataSet using the source DataSet when a Primary Key is defined.

To resolve this issue, we contacted the Microsoft Support team, and after a few days of email exchange, we got the following message:

I think the root of the issue is that Merge does not change the row state. If you create a DataSet, then manually change a row value, then the row state is tweaked and the GetChanges() behaves as expected.”

But Merge does a “dumb” merge and does not set other flags, but I am trying to find documentation to support that. If that is the case, the only option is to do is a manual compare. I will keep you posted on what I find…

As you can see from the message, the Microsoft suggestion was to go manually through both DataSets, comparing values row by row, and manually updating the settings in the resulting DataSet using the Row.SetAdded() or Row.SetModified() methods according to the situation – a solution not very elegant. With the fact that the method provided above is available for developers in Framework 2.0 and later for extra flexibility, it is hard to imagine when these methods can be used in practical terms. After a day or two of additional research and conversations with Microsoft, we received another email from the support team, where we got the clue that the issue cannot be resolved: “I have raised the issue with the appropriate personnel and I am waiting to hear back from them. I will let you know something as soon as I find out more."

The LINQ Solution

That’s when we turned our heads to LINQ. I did not have any extensive experience with this new feature, but in a few hours, found it very easy to learn and apply. There are a few ways using LINQ to achieve the same result. For our purposes, we selected the Union method. Therefore, here is the “moment of Zen”:

During the initialization of our application, we still create an initial snapshot of the data; in the sample below, we will call it dsOriginal.
On the Elapsed event of the timer, we create another snapshot of the current data state; we will call it dsChanged.
To use LINQ, we need to get the tables from both the datasets into an enumerable row collection. This can be done using the AsEnumerable() method (see the code sample below), and we are ready to use the LINQ magic.

Below is the code which completes the task:

var orig = dsOriginal.Tables[0].AsEnumerable(); 
var updated = dsChanged.Tables[0].AsEnumerable(); 
//First, getting new records if any 
var newRec = from u in dsChanged 
where !(from o in orig 
select o.Field<decimal>("PRIMARY_KEY_FIELD")) 
.Contains(u.Field<decimal>(" PRIMARY_KEY_FIELD")) 
select new 
{ 
prim_key = u.Field<decimal>("PRIMARY_KEY_FIELD"), 
field1 = u.Field<decimal>("FIELD1"), 
field2=u.Field<decimal>("FIELD2"), 
field3 = u.Field<decimal>("FIELD3"), 
field4 = u.Field<decimal>("FIELD4"), 
rec_type="A"//Added 
}; 
//Secondly, getting updated records 
var updRec = from u in updated 
join o in orig 
on u.Field<decimal>("PRIMARY_KEY_FIELD") 
equals o.Field<decimal>("PRIMARY_KEY_FIELD")
where (u.Field<decimal>("FIELD1") != 
o.Field<decimal>("FIELD1")) ||
(u.Field<decimal>("FIELD2") != 
o.Field<decimal>("FIELD2")) 
select new 
{ 
prim_key = u.Field<decimal>("PRIMARY_KEY_FIELD"), 
field1 = u.Field<decimal>("FIELD1"), 
field2=u.Field<decimal>("FIELD2"), 
field3 = u.Field<decimal>("FIELD3"), 
field4 = u.Field<decimal>("FIELD4"), 
rec_type = "M"//Mofified 
}; 
var Union = newRec.Union(updRec);

The code snippet is simple and self-explanatory. In the first select statement, we are getting new records by using the Contains() method, preceding it with the “!” operator which gives us only the newly added rows. In the second statement, we get the rows with the updated values using comparison with the where clause, and finally, we compare the results using the Union() method.

You probably noticed the variable rec_type used in both queries. When you need to know which record was modified and which is new (added), the Union operator comes in handy because it allows you to create a custom flag on the DataRow with this information, but if you only need to get the differences without knowing if the row was added or modified, you can get all the data in “one shot”, and the code snippet below demonstrates how:

var AddedAndModif = from u in updated 
where !(from o in orig 
select o.Field<decimal>("PRIMARY_KEY")) 
.Contains(u.Field<decimal>("PRIMARY_KEY")) 
|| !(from o in orig 
select o.Field<decimal>("FIELD1")) 
.Contains(u.Field<decimal>("FIELD1")) 
select new 
{ 
prim_key = u.Field<decimal>("PRIMARY_KEY"), 
field1 = u.Field<decimal>("FIELD1"), 
field2=u.Field<decimal>("FIELD2"), 
field3 = u.Field<decimal>("FIELD3"), 
field4 = u.Field<decimal>("FIELD4"), 
};

As you can see from this example, the Contains() method used twice solves the task and provides you with a collection of both, modified and newly added rows.

In our opinion, among all the other options, this solution appears to be the fastest. It is easy to read and modify as needed, and showed about 20% better performance when compared with the same result achieved by the method suggested by the MS Support professionals provided at the top of this article (comparing row by row and manually changing RowState using the Row.SetAdded() or Row.SetModified() methods).

Since the situation provided in this discussion is common, we decided to share our solution, thinking that it might make other developers' lives easier.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)