Introduction
It's often useful to generate test data when prototyping an application so you can see how it behaves with a lot of real-looking data. That's not the same as simulating traffic nor load testing in a strict sense, but generated data is better than just racking your brains to fill in a handful of records manually as you develop.
There are a number of GUI apps out there that generate data for a target database platform -- such as Red Gate's SQL Data Generator. I tried a couple of them, but what I found missing was a feature aimed at multi-tenant systems to ensure that generated data maintains tenant isolation. This comes into play when randomly looking up foreign key values. Let's say you're designing a multi-tenant system where different tenants have their own list of -- for example -- "order types." When you generate "order" records, the "order types" must come from the order types that belong to a particular tenant. In other words, you can't mix one tenant's order types with another tenant's orders. I never saw a feature like that in the few tools I looked at, and that was a deal-breaker.
Second, I wasn't attached to the GUI experience for test data generation. In other words, this seemed like an opportunity to bring a "code-first" mentality to the problem. This opened some doors in terms of avoiding complex UI commitments, and likewise made it possible to target different back-end databases. The test data generator (TDG from now on) I came up with doesn't depend on any particular database. It doesn't even make database connections! As we'll see, it simply generates runtime objects that you decide how to persist.
In my walkthrough below, I use Postulate ORM because I'm biased, but you can actually use any ORM. If you want step-by-step using Postulate here, follow along in the next section "Setting Up." If you just want to get to the heart of the matter generating test data, skip to "Generating Data" below.
Setting Up
In this walkthrough, we're going to create a console project that demonstrates using my TDG library for a hypothetical sales order system. The system is simplistic, but demonstrates multi-tenant concepts, and highlights some issues you may run into using my TDG.
1. Create a new solution with a console app called "CodeProjectTDG." Add a class library project called "ModelClasses." Set the .NET version to 4.6.1 for both projects, and install nuget packages AoTestDataGen and Postulate.Orm on both projects.
2. Install the Postulate Schema Merge app the GitHub release page.
3. In the post build event for the ModelClasses project, paste this command:
"C:\Users\Adam\AppData\Local\Adam O'Neil Software\Postulate Schema Merge\PostulateMergeUI.exe" "$(TargetPath)"
4. Create app.config files in both projects, and add this connection string:
<connectionStrings>
<add name="default" connectionString="Data Source=(localdb)\MSSqlLocalDb;Database=TdgSample;Integrated Security=true"/>
</connectionStrings>
5. Add model classes to the ModelClasses project from the article's zip file. Your solution explorer should look like this:
6. Build the solution. You should see the Schema Merge app come up, and see a screen like this:
Execute these changes (button in upper right of Schema Merge window), and you're ready to start generating data.
Generating Data
You might surmise from the table names how they relate, but here's a little diagram to orient yourself. This is a multi-tenant design where Organization serves as the tenant root. An Organization has multiple Customers (people who make purchases) and Items (things the organization sells). Customers place Orders which consist of one or more Order Items.
We need to add a little bit of code to our console app Program class. Here I've added two static variables _tdg and _db inside the standard console app boilerplate. For brevity, I haven't pictured the using statements. I also created a reference to the ModelClasses project from the console app.
class Program
{
private static TestDataGenerator _tdg = new TestDataGenerator();
private static TdgDb _db = new TdgDb();
static void Main(string[] args)
{
}
}
Now we can really do something. We have to start with the Organization table since it has no dependencies (foreign keys). Here is our first use of the TDG library:
private static void CreateOrgs()
{
_tdg.Generate<Organization>(10, org =>
{
org.Name = _tdg.Random(Source.CompanyName) + _tdg.RandomInRange(1, 100).ToString();
org.TaxRate = _tdg.RandomInRange(0, 10, i => Convert.ToDecimal(i) * 0.01m);
}, records =>
{
_db.SaveMultiple(records);
});
}
This says, in effect, create 10 Organization instances, and for each one give it a random name followed by a random number between 1 and 100. Also, set the tax rate, to a random value between 0 and 10%. Finally, save the objects using Postulate's SaveMultiple method. If you're using Entity Framework or some other ORM, you would execute the save whatever way was appropriate. (See Generate method in GitHub repo.)
Why did I append a digit to the company name? To make sure it's unique in the table. Organization.Name is the primary key. My list of random company names is rather short in the TDG library, so appending a digit was the simplest way to ensure uniqueness.
What is the _tdg.Random method? This is something you can call when you need to generate a variety of random data when populating a record. There are several overloads of this that we'll cover, but in the usage above, this grabs a random "company name" from TDG's list of made-up company names. The list of usable Source arguments is here.
What is the _tdg.RandomInRange method? This is a way to generate a random int in a range, and -- optionally -- project it into a different format or type according to need. In this case above where I need to generate a random tax rate (of type decimal), I multiply a generated random int by .01 to make it more resemble a tax rate. This depends on the Enumerable.Range static method. Since there is no generic way to generate ranges of different types, I needed to accept a lambda to perform some kind of projection or conversion to a different type. We'll see in a bit how this works with dates.
Why is the save (_db.SaveMultiple) separate from the record generation part? The generated objects are saved in batches. By default, every 100 generated objects triggers a save. You can change the _tdg.BatchSize to change how often a save happens within the generator loop, and indeed there's a case below when we need to do this.
Let's execute this. Add the CreateOrgs() method call to the Main method like this, and run it:
static void Main(string[] args)
{
CreateOrgs();
}
Now let's look in our SQL Server table and see what came out.
Yours should look a little different, but there should be ten rows. (Note that since I marked Organization.Name as the primary key, and Postulate clusters the PK by default, they are presented sorted by name in the absence of an ORDER BY clause.) And yes, my built-in company names are pretty wacky. I welcome more contribution on the GitHub repo, but wackiness is sort of the rule.
Let's move on to the Item table. Here's the method I wrote to fill the Item table.
private static void CreateItems()
{
int[] orgIds = null;
using (var cn = _db.GetConnection())
{
cn.Open();
orgIds = cn.Query<int>("SELECT [Id] FROM [Organization]").ToArray();
}
_tdg.Generate<Item>(120, item =>
{
item.OrganizationId = _tdg.Random(orgIds);
item.Name = _tdg.Random(Source.WidgetName);
item.UnitCost = Convert.ToDecimal(_tdg.RandomInRange(10, 150)) + _tdg.RandomInRange(10, 25, i => i * 0.1m);
item.UnitPrice = item.UnitCost + Convert.ToDecimal(_tdg.RandomInRange(3, 50));
item.IsTaxable = _tdg.RandomWeighted(new TaxableWeighted[]
{
new TaxableWeighted() { IsTaxable = true, Factor = 1 },
new TaxableWeighted() { IsTaxable = false, Factor = 5 }
}, x => x.IsTaxable);
}, records =>
{
_db.SaveMultiple(records);
});
}
Let's break it down. First, I'm getting a list of all Organization.Id values into an array using Dapper's Query<T> method. (Dapper is a Postulate dependency, so it's installed when you install the Postulate.Orm package.)
Next, I do the Item generation itself. I generated 120 rows, where each one....
- has a random Organization.Id via the _tdg.Random method. This overload simply picks a random index from an array.
- has a random "widget name" again using _tdg.Random, using the TDG library's built-in source of widget names.
- has a unit cost between $10 and $150 plus another random amount between $1 and $2.50, just to make the values look a little more "jaggedly" realistic.
- has a unit price that adds a random markup to the cost between $3 and $50.
- is marked as non-taxable 5x more often than not. To implement weighted randomness, where some options are chosen more often than others, you derive a class from IWeighted and set the Factor property according to the "weight" of the option. Finally, you supply a lambda expression (in the case above x => x.IsTaxable) that indicates what exactly you're returning to the property you're setting.
Lastly, I use the same Postulate SaveMultiple method to save records to the database again in batches of 100 by default, just as before with CreateOrgs. I chose the number 120 above just to make sure it worked when I picked a random number not equal to the default TDG batch size.
Run this very much like we did CreateOrgs. Notice I just commented out CreateOrgs(), and added CreateItems():
static void Main(string[] args)
{
CreateItems();
}
I won't take a screenshot here of the SQL Server output, but you should have 120 Item records at this point with all sorts of colorful Item.Names. Let's move onto the Customer table:
private static void CreateCustomers()
{
int[] orgIds = null;
using (var cn = _db.GetConnection())
{
cn.Open();
orgIds = cn.Query<int>("SELECT [Id] FROM [Organization]").ToArray();
}
_tdg.Generate<Customer>(5000, c =>
{
c.OrganizationId = _tdg.Random(orgIds);
c.FirstName = _tdg.Random(Source.FirstName);
c.LastName = _tdg.Random(Source.LastName);
c.Address = _tdg.Random(Source.Address);
c.City = _tdg.Random(Source.City);
c.State = _tdg.Random(Source.USState);
c.ZipCode = _tdg.Random(Source.USZipCode);
c.Email = $"{c.FirstName.ToLower()}.{c.LastName.ToLower()}@{_tdg.Random(Source.DomainName)}";
}, records =>
{
_db.SaveMultiple(records);
});
}
You can probably start to see some patterns here! First, I get a list of all Organization.Ids. Then I use the Generate method, this time with 5,000 records. The record creation is pretty straightforward here I think. The only point of interest is how the email is generated. I had wanted a built-in random email source, but it didn't really work since an email address is normally a function of someone's name in some way. I didn't have an easy way to accept arguments for that on the _tdg.Random method. So, I made the email address more "manually" as a concetantion of the record's FirstName and LastName properties. I do however have a built-in list of domain names.
Run this much like before, commenting out the previous Create methods so we can have a little record of having done so.
static void Main(string[] args)
{
CreateCustomers();
}
Now let's fill the Order table. This will bring up the tenant isolation issue I brought up in the introduction. When we create random Order records, we need to make sure that the CustomerIds of those orders belong to the same Organization as the Order itself.
private static void CreateOrders()
{
int[] orgIds = null;
dynamic[] customerIds = null;
using (var cn = _db.GetConnection())
{
cn.Open();
orgIds = cn.Query<int>("SELECT [Id] FROM [Organization]").ToArray();
customerIds = cn.Query("SELECT [OrganizationId], [Id] FROM [Customer]").ToArray();
}
_tdg.Generate<Order>(7000, ord =>
{
ord.OrganizationId = _tdg.Random(orgIds);
ord.CustomerId = _tdg.Random(customerIds, item => item.Id, item => item.OrganizationId == ord.OrganizationId);
ord.Number = _tdg.RandomFormatted("AA000-A0000");
ord.Date = _tdg.RandomInRange(0, 2000, i => new DateTime(2013, 1, 1).AddDays(i));
}, records =>
{
_db.SaveMultiple(records);
});
}
First, I get a list of all the Organization.Ids into an array. Then, I get all the Customer.Ids as well, but I need to include the OrganizationId for each customer. I did that as dynamic because I was too lazy to create a type, and since it would be scoped to this method only, I really didn't need a type. Again, when I'm building a random order, I need some way to limit the choice of customer to the organization in scope when the record is created.
Then I call the Generate method for 7,000 records. (I wanted some customers to have more than one order -- hence the larger order count than customer count.) Breaking it down:
- Choose a random index from orgIds
- Choose a random index from customerIds, and in particular the Id property of customerIds, but filter the possible choices so that only those customers with an OrganizationId that was just generated for this record can be used. Implementation here.
- Generate a random alpha-numeric string that looks sort of like an order number you'd see somewhere. I created a "poor-man's regex" format for generating different types of random strings. See implementation on GitHub.)
- Generate a random date starting from 1/1/2013 extending up to 2,000 days in the future. Yes there will be future dates in here, but it didn't matter for my purposes.
Lastly, I use SaveMultiple as before, then run it like so:
static void Main(string[] args)
{
CreateOrders();
}
One thing we want to check right away is that the orders are properly isolated by organization. In other words, we want to ensure that there are no customers with orders in another organization. I wrote this query to check. I was looking for records where the order organization did not match the customer organization. There were none, so that's good!
We're down to the last table OrderItem. This one required some slightly different handling. I'll just show you the code, then explain:
private static void CreateOrderItems()
{
int[] orgIds = null;
dynamic[] items = null;
dynamic[] orders = null;
using (var cn = _db.GetConnection())
{
cn.Open();
orgIds = cn.Query<int>("SELECT [Id] FROM [Organization]").ToArray();
items = cn.Query("SELECT [OrganizationId], [Id] FROM [Item]").ToArray();
orders = cn.Query("SELECT [OrganizationId], [Id] FROM [Order]").ToArray();
_tdg.BatchSize = 1;
foreach (var order in orders)
{
_tdg.Generate<OrderItem>(1, 7, oi =>
{
do
{
oi.OrderId = order.Id;
oi.ItemId = _tdg.Random(items, item => item.Id, item => item.OrganizationId == order.OrganizationId);
oi.Quantity = _tdg.RandomInRange(1, 25).Value;
oi.UnitPrice = _db.Find<Item>(cn, oi.ItemId).UnitPrice;
oi.ExtPrice = oi.Quantity * oi.UnitPrice;
} while (cn.Exists("[OrderItem] WHERE [OrderId]=@orderId AND [ItemId]=@itemId", new { orderId = oi.OrderId, itemId = oi.ItemId }));
}, records =>
{
_db.SaveMultiple(cn, records);
});
}
}
}
As before, I get some seed data -- all the Organization.Ids, followed by items and orders. Notice that these two are dynamic[] because I was too lazy to make a type, and I didn't really need one anyway. However as before I do need to know which OrganizationIds the items and orders have, so dynamic is a decent choice to hold both the Id and OrganizationId properties for each item and order.
For the Generate part, notice I set the BatchSize to 1. This causes each OrderItem to be saved one at a time instead of in larger batches. You'll see in a moment why.
Instead of generating an absolute number of records, I looped through the orders and created a random number of items (between 1 and 7) for each order. I wanted a realistically uneven number of items per order. If I'd generated simply an absolute number of OrderItems, then they would tend to level out over the set of orders.
Notice that the OrderItem creation is inside a do....while. I did that because of the primary key on OrderItem. You can't have the same item duplicated on the same order, but TDG doesn't have any kind of rule against duplication. The best approach I found was to do it inside of a loop, checking for existence of that key combination. The do loop continues as long as the key combination is in use. I use the Postulate.Exists method, which is actually a wrapper around a Dapper method, to tell if the OrderId + ItemId combination is in use or not. Indeed, the reason I wrapped everything in a using connection block was so I could use the already-open connection to leverage the Exists method.
For that Exists check to work right, I needed each OrderItem record to be saved one at a time. Otherwise, you would end up caching some duplicates in memory before Exists has a chance to catch them at the physical table level. I considered doing something with dictionary keys and adding some kind of duplicate checking in TDG, but no simple option jumped out at me.
Notice as before that when I set the OrderItem.ItemId, I'm filtering the itemId choices according to the organizationId in scope during the creation of the record. This keeps itemIds properly isolated by organization.
oi.ItemId = _tdg.Random(items, item => item.Id, item => item.OrganizationId == order.OrganizationId);
I verified this at the end much like I did with the Order table. I needed to ensure that no items and orders from different organizations didn't exist:
Update: GenerateUnique method
When generating OrderItem records, I had to go to some extra lengths to ensure uniqueness of the record before saving it. I realized that will be a pretty common situation, so I added a new GenerateUnique method that encapsulates the logic I used in the example above -- namely by adding an exists Func argument. This allows you to check whether the record you generated already exists before trying to save it. If it does, TDG simply keeps generating records until it comes to one that doesn't exist. If after 100 attempts, a unique record couldn't be generated, an exception is thrown. This was to prevent an infinite loop, which I encountered when testing. (In my case, I had to go back and generate more Item records so that I had a bigger pool to select from.)
TModel record = new TModel();
int attempts = 0;
const int maxAttempts = 100;
do
{
attempts++;
if (attempts > maxAttempts) throw new Exception($"Couldn't generate a random {typeof(TModel).Name} record after {maxAttempts} tries. Exception was thrown to prevent infinite loop.");
create.Invoke(record);
} while (exists.Invoke(connection, record));
save.Invoke(record);
Notice the exists argument requires a connection. I declared it as IDbConnection to be as platform-neutral as possible. I considered trying to do it in a completely database-free way by keeping key values in memory some kind of way. I felt that was going to be overly complicated. Since these generated records are intended to end up in a database anyway, I felt there was nothing wrong with assuming that an IDbConnection would be available. Here's what the re-written code looks like in the CreateOrderItems method:
_tdg.GenerateUnique<OrderItem>(cn, 1, 7, oi =>
{
oi.OrderId = order.Id;
oi.ItemId = _tdg.Random(items, item => item.Id, item => item.OrganizationId == order.OrganizationId);
oi.Quantity = _tdg.RandomInRange(1, 25).Value;
oi.UnitPrice = _db.Find<Item>(cn, oi.ItemId).UnitPrice;
oi.ExtPrice = oi.Quantity * oi.UnitPrice;
}, (connection, record) =>
{
return connection.Exists(
"[OrderItem] WHERE [OrderId]=@orderId AND [ItemId]=@itemId",
new { orderId = record.OrderId, itemId = record.ItemId });
}, (record) =>
{
_db.Save(record);
});
It's really the very same logic as before, but the existence check is now baked into the method -- so you don't have to remember to set the BatchSize to 1 nor write the while loop yourself. Moreover, you also get protection from an infinite loop in case you don't have enough seed data to assure uniqueness.
Conclusion
I really enjoyed working on this, since the test data generation problem had really bugged me for some time. In particular, it was liberating to approach it as a code-first problem instead of a GUI app problem. It would've been really complicated to support lambda-like things and different kinds of loops and conditions in a GUI app that off-the-shelf test data generation tools either have to do, or never get around to doing. I hope you can find this useful also. Check out the GitHub repo and contribute or raise an issue if you have a suggestion.