ADO.NET : Some internals uncovered - Part 2

Abhishek Sur

4.91/5 (4 votes)

1 Jan 2011CPOL7 min read

13.6K

ADO.NET : Some internals uncovered

For the last 2 days, I thought of unleashing few of the hidden facts of Data Storage. It is true that the introduction of the .NET Framework without introducing Generic in the first place is one of the biggest mistakes. Because of Non-Generic data structure, a large amount of data is being unnecessarily boxed and unboxed without any reason as such. After Generics was introduced with .NET 2.0, most of these classes which made you do type conversions is either been totally depreciated or few of them rarely used as is there for backward compatibility. But for every programmer, the most common class called DataTable still exists without any change. In this post, we will try to see the actual implementation of DataTable as a whole, so that we could understand what is going on when we store data into it.

Introduction

DataTable is an application object that maps to a Database Table. It holds a collection of DataRows and DataColumns where DataRow represents the actual Data while the DataColumns holds Integration rules. Basically, we all know the basics of DataTable and often use it for our regular work.

DataSet holds a list of DataTable, and DataRelation. DataTable as being said is the storage of Data in the form of a Table with Row and Columns, the DataRelation holds the Relationship rules between two tables in a DataSet. Hence ADO.NET has been structured in such a way that you can have the flavour of the entire database from your application end.

But, one thing that you must keep in mind is that these objects are created way back in time of .NET 1.0 release or even before when there was no Generics or even the concept of State Machines. Generics allows you to pass type of an object during actual object creation and hence will allow you to create Type information based on the object which we create in real time, as a result, it eliminates unnecessary runtime boxing and unboxing. Keeping this in mind, I thought I should check how these classes are made in Framework. Let's look at each of them, one by one in this post and let's identify the pros and cons.

DataSet

DataSet is the application object that can hold the entire structure of the database schema. Basically, internally it is a serializable object which holds two sets of collection, DataTableCollection and DataRelationCollection where the former is an implementation of an ArrayList of DataTables and the later is a list of DataRelation. DataSet can be Cloned (produces another object with New tables from existing schema) or Copied (Produces another object with same tables).

Some of the important benefits of DataSet

Easily Serializable, even could be written to XML File using WriteXML or can be read to the Object.
Allows you to Merge other DataSet into it easily.
Can hold Relation and Tables into it.

Problems & best implementation of DataSet

Can only hold a collection of DataTable, and it does not allow us to define our own data structures and produce custom DataRelation among them.
No Generic support, use of ArrayList means each element has to be upcast to object when being stored and again to actual element (DataTable) when being retrieved.

DataTable

DataTable is the primary object which holds the Data. It holds a collection of DataColumn and DataRow. Each of the DataRow holds the data element while DataColumn holds the rules between each columns. The DataRowCollection internally represents an ArrayList of DataRow while DataColumnCollection is on DataColumns.

To hold the DataRows, a DataTable maintains an internal sealed class called RecordManager. Everytime you create a DataRow element inside a DataTable class, the object is stored into RecordManager. The DataRowCollection holds a list of DataRow. Here RBTree class implemented as Generic list of DataRow objects is used.

Some important benefits of DataTable

DataTable can store any Table even if the data is complex
Serializable, so can be ported in communication medium easily
Keeps track of indexed field

Problems of DataTable

Most of the objects are non-generic, and hence leading to conversion of types
Very bulky

DataRow

DataRow on the other hand holds the actual data. It has an indexer to the object which lets you get or set data to it.

Internals to ADO.NET Continued....

Now let us take a look at how the data is being stored into these data structures and later retrieved from it. Say you have opened a connection to database and used Adapter.Fill to fill data into a DataTable. Now let's examine what internally happens to store Data into the DataTable and also retrieving the same when required. Say the code I write to retrieve the Data is as below:

SqlDataAdapter adpt = new SqlDataAdapter("Select * from Table", conn);
adpt.Fill(DTable);

Now when you call Fill method, it actually invokes the external SQL server process to load the Data into memory and parses the loaded memory into the existing data structure. If you have already read my last post on internals of ADO.NET, you should be clear how the DataRows are created using DataReader and put into DataRowCollection. In this post, I will cover from there, how the internal DataRow is parsed.

The DataReader loads each DataRow into memory and takes you through with calling LoadDataRow method of SchemaMapping. If you see into Reflector, the code for LoadDataRow looks like:

Here the initial call to dataReader.GetValues is actually getting the data from memory into DataRow, which is then Loaded into DataTable and finally to the DataSet. Now if you try to look into the definition of GetValues, you need to look into the actual implementation of it in the provider specific classes. For SqlClient, the data is loaded using SqlBuffer class. The Buffer reads each individual entry and puts it as an array. So basically for a DataRow object loading Data for SQL Server, it puts the actual Data into an array of SqlBuffer. So if you can see the implementation of SqlBuffer, you would be clear about the actual data storage of DataRow.

So if you look into the implementation of SqlDataReader, you can see it basically maintains the collection of SqlBuffer. SqlBuffer is the actual unit of data storage. It maintains a structure Storage in it which has member to store almost all data basic data types The Storage stores the actual data in Type and the SqlBuffer holds the exact type information.

Here inside the SqlBuffer, you can see the member _value as Storage. StorageType on the other hand is an Enum which sets the actual type of Data associated in the DataRow. The SqlBuffer also exposes properties, each of which can get or set the value for the object. The exact type information is also held into DataColumnCollection and according to that, the Type of the Data is parsed.

Hence, this means the DataRow is actually stored internally into proper types and hence does not have any performance implication, right? No, there is.

Actually the performance implication lies above these facts when you try to retrieve the value or store it into the DataRow object. We generally use indexer to fetch data from the DataRow. The indexer to the DataRow is actually of type object.

public object this[]
{ 
     get; set;
}

This is the major problem to DataRow class. So eventually, every time you try to retrieve data from DataRow, even though proper type is maintained underneath, the indexer box to object type. Also, upon using setter, the proper type maintained from our code is also being boxed to the setter and which again is unboxed to proper data type.

Conclusion

I think these classes could be generalized more to induce better performance to the API. I am also hoping to have an enumeration of generic objects in the whole data structure of DataRow and DataColumns (may be using DataCell<t>) in future.

I hope you like the post.

I don't know if I did made things clear to you. I have to put only those which are important to know. I would suggest to use Reflector to see the implementation in detail to know more.

Thank you for reading. Looking forward to your feedback.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)