Introduction
Importing data from an external system is a common task that almost all programmers have to deal with. One of the most must frustrating aspects of this task is reconciling the structure of the foreign data with your own. very often, the data comes from a legacy system that has no notion of relations and simply dumps the data in a flat format. This article demonstrates one possible technique that can be used if you have an established system with separate business classes.
Background
When importing data, I always find myself caching certain pieces of information in order to speed up the import process and add a safety net for myself. For example, many systems not only export the active records, but ones that have been "deleted" or, more precisely, deactivated. This can lead to a large amount of data duplication and a very confusing situation for someone trying to import it. It would be nice if one record in the foreign source corresponded to one record in your database, but that will rarely be the case.
For the sake of simplicity, I will assume the external data source is a CSV file. This is the de-facto standard when all else fails and, in my experience, is a prime example of a format that is misused and often poorly formatted. I will also be using the standard Customer -> Order -> OrderLineItem example that is common in many data discussions. While using these simple standards may seem like a cop-out, I know from experience they represent not only an extremely common scenario, but one that is complex enough to validate the solution. Please note the code examples are truncated for brevity.
Step 1 - Take Care of the Basics
OK, let's start out very simply. We are going to assume that all we are doing is importing customers. Our business class looks like this:
public class Customer
{
public int CustomerID { get; set; }
public string CustomerName { get; set; }
}
And our data file looks like this:
ID,Name,City,State,Country,IsHQ,Employees
42,Acme Inc,Boston,MA,USA,0,50
28,Acme Inc,Denver,CO,USA,1,100
87,Foo Corp,Topeka,KS,USA,1,20
99,Acme Inc,Albany,NY,USA,0,50
31,Foo Corp,Newark,NJ,USA,0,10
I'm sure many of you see the problem right away. Our file contains multiple corporate branches for each company, but we only care about storing the name once. This means we can't simply import the rows on a one-to-one basis. Since there could be dozens or even hundreds of branch locations, we don't want to bother processing all the duplicate records. Because there is no way to predict how many unique companies and branches will be in the file, our only choice is to read the customer name every time. This obviously becomes the piece of data we want to cache; once "Acme" has been processed and stored, we can skip the rest of the "Acme" records. One easy solution is shown below:
List<string> customerNames = new List<string>();
IDataReader reader = ...
while (reader.Read())
{
string custName = reader.GetString(1);
if (!customerNames.Contains(custName))
{
customerNames.Add(custName);
}
}
This approach works fine when you are only importing a couple unique entities (i.e., classes). But what if we had 5, 10, or even 20 pieces of data per record? It's not really practical to keep 20 List
s around to cache all the different names. This is where generics and my DataLoadCache
class comes in.
public static class DataLoadCacheV1<T>
{
private static List<string> _names = new List<string>();
public static bool ContainsName(string name)
{
return _names.Contains(name);
}
public static void StoreName(string name)
{
_names.Add(name);
}
}
At first glance, it seems like we haven't accomplished much, but think about what kinds of "Ts" (generic types) we are going to use; yep, our existing business classes. Because we have unique classes defined and DataLoadClass
is static
, we don't have to worry about creating a unique cache instance for every business class we import. The compiler will create a unique DataLoadCache
class every time we use a new "T
". Our simple example becomes:
IDataReader reader = ...
while (reader.Read())
{
string custName = reader.GetString(1);
if (!DataLoadCacheV1<Customer>.ContainsName(custName))
{
DataLoadCacheV1<Customer>.StoreName(custName);
}
}
Notice we did not declare a List
or any local caching instances. OK, now that we have gotten the basics out of the way, we can move on to the good stuff; stay with me, I promise it will get better.
Step 2 - Adding Orders to the Mix
Keeping in mind that we are assuming to be working with a flat, legacy data format, we are now going to tackle importing the Orders. The relevant parts of our import file are now:
ID,Name,OrderNumber
42,Acme Inc,1234
87,Foo Corp,5555
31,Foo Corp,1234
28,Acme Inc,5678
As you can see, order number 1234 is repeated showing that order numbers are not unique. When we add these orders to our system, we want to make sure we have quick access to the new database IDs for the orders, not just the order number. Assume our system associates Order
s to Customer
s through a simple foreign key.
public class Order
{
public int CustomerID { get; set; }
public int OrderID { get; set; }
public string OrderNumber { get; set; }
}
Since we need to store an association, we must expand the DataLoadCache
class to handle ID lookups.
public static class DataLoadCacheV2<T>
{
private static Dictionary<string, int> _nameIDMaps =
new Dictionary<string, int>();
public static bool ContainsIDByName(string name)
{
return _nameIDMaps.ContainsKey(name);
}
public static void StoreIDByName(string name, int ID)
{
_nameIDMaps.Add(name, ID);
}
public static int GetIDByName(string name)
{
return _namesIDMaps[name];
}
}
Storing just the customer names is irrelevant now, so the methods from DataLoadCacheV1
have been replaced. We now store the actual database ID of the Customer
we added. Now we have knowledge of what customer names have been processed and what the corresponding IDs are. This comes in handy when adding the Order
s to the database.
IDataReader reader = ...
while (reader.Read())
{
string custName = reader.GetString(1);
if (!DataLoadCacheV2<Customer>.ContainsIDByName(custName))
{
int newCustID =
DataLoadCacheV2<Customer>.StoreIdByName(custName, newCustID);
}
int custID = DataLoadCacheV2<Customer>.GetIDByName(custName);
Order order = new Order()
{
CustomerID = custID,
OrderNumber = ...
};
}
Step 3 - Adding a One to Many Relationship
The last example showed why it's important to store the ID of the entity we are importing, not just the fact that it had already been processed. Since we know there are OrderLineItem
s on an Order
, we need to add a way to look up the OrderID
we just added.
public static class DataLoadCacheV3<T, U>
{
private static Dictionary<int, Dictionary<string, int>> _foreignIDsByName =
new Dictionary<int, Dictionary<string, int>>();
public static void StoreForeignIDByName(int primaryID,
string foreignName, int foreignID)
{
_foreignIDsByName[primaryID].Add(foreignName, foreignID);
}
public static bool ContainsForeignIDByName(int primaryID, string foreignName)
{
return _foreignIDsByName.ContainsKey(primaryID)
&& _foreignIDsByName[primaryID].ContainsKey(foreignName);
}
public static int GetForeignIDByName(int primaryID, string foreignName)
{
return _foreignIDsByName[primaryID][foreignName];
}
}
All we are doing here is tracking the associations between the string representations of the data we have (CustomerName
and OrderNumber
) and the new database IDs we created. The example below will make this more clear:
IDataReader reader = ...
while (reader.Read())
{
string custName = reader.GetString(1);
if (!DataLoadCacheV3<Customer>.ContainsIDByName(custName))
{
int newCustID = ...
DataLoadCacheV3<Customer>.StoreIDByName(custName, newCustID);
}
int custID = DataLoadCacheV3<Customer>.GetIDByName(custName);
string orderNumber = reader.GetString(2);
if (!DataLoadCacheV3<Customer, Order>.ContainsForeignIDByName(custID, orderNumber))
{
int newOrderID = ...
DataLoadCacheV3<Customer, Order>.StoreForeignIDByName(custID,
orderNumber, newOrderID);
}
int orderID = DataLoadCacheV3<Customer, Order>.GetForeignIDByName(custID, orderNumber);
}
By adding a Dictionary
of Dictionary
s to the DataLoadCache
, we are allowing multiple layers of mapping to take place. The DataLoadCache
class doesn't have to worry about what kinds of names and IDs we are passing, because the generic type parameters ensure we will get a unique static class for each combination of types we use.
A Quick Note
I realize that up to this point much of this has been boring and tedious. If you have gotten this far and you are confused, I encourage you to start at the beginning and try to see how using the generic types in the last example opens up a world of possibilities and takes the burden of maintaining the various cache Dictionary
instances yourself. By simply writing the line of code: DataLoadCachev3<Foo,Bar>
, you create two custom lookup mechanisms (plus many more to come). Not only that, but the code is very readable, and there can be no confusion that you are relating Foos to Bars, unlike a local variable which can have any name.
Step 4 - Tracking Sets of IDs
The previous example assumed a given OrderNumber
would only appear once per customer. When we start to consider OrderLineItem
s in our flat data format, we see this assumption is not valid. Again, only showing the relevant parts of the data file:
ID,Name,OrderNumber,LineItemNumber,Quantity
42,Acme Inc,1234,1,53
42,Acme Inc,1234,2,91
31,Foo Corp,1234,2,62
31,Foo Corp,1234,1,88
42,Acme Inc,1234,3,57
And our existing OrderLineItem
class:
public class OrderLineItem
{
public int OrderID { get; set; }
public int OrderLineItemID { get; set; }
public string LineItemNumber { get; set; }
public int Quantity { get; set; }
public double Discount {get; set; }
}
Let's assume for a moment that some sort of further processing is required on the OrderLineItem
s. It could be another import file, or just setting some post-import data. To do it right, we need to keep track of what OrderLineItem
IDs belong to which order. This is easily solved by adding the ability to handle List
s of IDs to DataLoadCache
.
public static class DataLoadCacheV4<T, U>
{
private static Dictionary<int, List<int>> _foreignIDLists =
new Dictionary<int, List<int>>();
public static void StoreForeignIDInList(int primaryID, int foreignID)
{
if (!_foreignIDLists.ContainsKey(primaryID))
{
_foreignIDLists.Add(primaryID, new List<int>());
}
_foreignIDLists[primaryID].Add(foreignID);
}
public static bool ContainsForeignIDInList(int primaryID, int foreignID)
{
return _foreignIDLists.ContainsKey(primaryID) &&
_foreignIDLists[primaryID].Contains(foreignID);
}
public static List<int> GetForeignIDsInList(int primaryID)
{
return _foreignIDLists[primaryID];
Step 5 - A Complete Example
Hopefully by now, I have done a good enough job getting the point across. This final example shows a full implementation of the DataLoadCache
class. Obviously, the methods in the class are specialized for this article, but you should see how easily you can create new methods to handle almost any relationship between your data classes.
The example uses two import sources to show how you can solve difficult situations with ease. Assume the second file contains discounts, but they are broken down by company branch which we do not track (see example 1). Also, we track discounts on a line item basis, so we will need to break up the discount amount equally among the line items. Just to make it more fun, assume they are not giving us the company name and just the branch ID.
First Data File
ID,Name,OrderNumber,LineItemNumber,Quantity
42,Acme Inc,1234,1,53
42,Acme Inc,1234,2,91
31,Foo Corp,1234,2,62
31,Foo Corp,1234,1,88
42,Acme Inc,1234,3,57
Second Data File
ID,OrderNumber,Discount
42,1234,50
99,5678,67
31,1234,31
28,1234,10
-
public static class DataLoadCacheV5<T, U>
{
private static Dictionary<string, int> _namesIDMaps =
new Dictionary<string, int>();
public static bool ContainsIDByName(string name)
{
return _namesIDMaps.ContainsKey(name);
}
public static void StoreIDByName(string name, int id)
{
_namesIDMaps.Add(name, id);
}
public static int GetIDByName(string name)
{
return _namesIDMaps[name];
}
private static Dictionary<int, int> _externalIDMaps =
new Dictionary<int, int>();
public static void StoreIDByExternalID(int externalID, int primaryID)
{
_externalIDMaps[externalID] = primaryID;
}
public static bool ContainsIDByExternalID(int externalID)
{
return _externalIDMaps.ContainsKey(externalID);
}
public static int GetIDByExternalID(int externalID)
{
return _externalIDMaps[externalID];
}
private static Dictionary<int, Dictionary<string, int>>
_foreignIdsByName = new Dictionary<int, Dictionary<string, int>>();
public static void StoreForeignIDByName(int primaryID,
string foreignName, int foreignID)
{
_foreignIdsByName[primaryID].Add(foreignName, foreignID);
}
public static bool ContainsForeignIDByName(int primaryID, string foreignName)
{
return _foreignIdsByName.ContainsKey(primaryID)
&& _foreignIdsByName[primaryID].ContainsKey(foreignName);
}
public static int GetForeignIdByName(int primaryID, string foreignName)
{
return _foreignIdsByName[primaryID][foreignName];
}
private static Dictionary<int, List<int>> _foreignIDLists =
new Dictionary<int, List<int>>();
public static void StoreForeignIDInList(int primaryID, int foreignID)
{
if (!_foreignIDLists.ContainsKey(primaryID))
{
_foreignIDLists.Add(primaryID, new List<int>());
}
_foreignIDLists[primaryID].Add(foreignID);
}
public static bool ContainsForeignIDInList(int primaryID, int foreignID)
{
return _foreignIDLists.ContainsKey(primaryID) &&
_foreignIDLists[primaryID].Contains(foreignID);
}
public static List<int> GetForeignIDsInList(int primaryID)
{
return _foreignIDLists[primaryID];
}
}
-
public void TestData5()
{
IDataReader reader = ...
while (reader.Read())
{
int branchID = reader.GetInt32(0);
string custName = reader.GetString(1);
if (!DataLoadCacheV5<Customer,Customer>.ContainsIDByName(custName))
{
int newCustID = ...
DataLoadCacheV5<Customer,Customer>.StoreIDByName(custName, newCustID);
}
int custID = DataLoadCacheV5<Customer,Customer>.GetIDByName(custName);
if (!DataLoadCacheV5<Customer,Customer>.ContainsIDByExternalID(branchID))
{
DataLoadCacheV5<Customer,Customer>.StoreIDByExternalID(branchID, custID);
}
string orderNumber = reader.GetString(5);
if (!DataLoadCacheV5<Customer, Order>.ContainsForeignIDByName(custID, orderNumber))
{
int newOrderID = ...
DataLoadCacheV5<Customer,Order>.StoreForeignIDByName(
custID, orderNumber, newOrderID);
}
int orderID = DataLoadCacheV5<Customer,Order>.GetForeignIdByName(custID, orderNumber);
string orderLineItemNumber = reader.GetString(6);
if (!DataLoadCacheV5<Order,OrderLineItem>.ContainsForeignIDByName(
orderID, orderLineItemNumber))
{
int newOrderLiID = ...
DataLoadCacheV5<Order, OrderLineItem>.StoreForeignIDByName(
orderID, orderLineItemNumber, newOrderLiID);
}
int orderLineItemID =
DataLoadCacheV5<Order, OrderLineItem>.GetForeignIdByName(
orderID, orderLineItemNumber);
DataLoadCacheV5<Order, OrderLineItem>.StoreForeignIDInList(
orderID, orderLineItemID);
}
IDataReader reader2 = ...
while (reader2.Read())
{
int branchID = reader2.GetInt32(0);
int custID = DataLoadCacheV5<Customer, Customer>.GetIDByExternalID(branchID);
string orderNumber = reader2.GetString(1);
int orderID =
DataLoadCacheV5<Customer, Order>.GetForeignIdByName(custID, orderNumber);
List<int> orderLineItemIDs = DataLoadCacheV5<Order,
OrderLineItem>.GetForeignIDsInList(orderID);
double totalDiscount = reader.GetDouble(3);
double discountPart = totalDiscount / (double)orderLineItemIDs.Count;
foreach (int orderLineItemID in orderLineItemIDs)
{
}
}
}
Conclusion
It's important to note that every time you use the DataLoadCache
class with different types, you are actually creating a completely new class. It's basically the same as if you hand typed all the specialized versions you would need (Customer
-cache, Order
-cache, etc,). If you do any kind of data importing, or even sequential processing, I hope you find this article useful.
If you are asking yourself "What's the point, why don't I just stick the data in the database and query it when I need it?", I ask you to consider if that solution scales and how well it will work on a slow network connection. My solution attempts to utilize the resources on the client machine (namely processing power and RAM) to keep the calls to the database down to a minimum.