For the last 2 days, I thought of unleashing few of the hidden facts of Data Storage. It is true that the introduction of the .NET Framework without introducing Generic in the first place is one of the biggest mistakes. Because of Non-Generic data structure, a large amount of data is being unnecessarily boxed and unboxed without any reason as such. After Generics was introduced with .NET 2.0, most of these classes which made you do type conversions is either been totally depreciated or few of them rarely used as is there for backward compatibility. But for every programmer, the most common class called DataTable
still exists without any change. In this post, we will try to see the actual implementation of DataTable
as a whole, so that we could understand what is going on when we store data into it.
Introduction
DataTable
is an application object that maps to a Database
Table. It holds a collection of DataRows
and DataColumns
where DataRow
represents the actual Data while the DataColumns
holds Integration rules. Basically, we all know the basics of DataTable
and often use it for our regular work.
DataSet
holds a list of DataTable
, and DataRelation
. DataTable
as being said is the storage of Data in the form of a Table with Row and Columns, the DataRelation
holds the Relationship rules between two tables in a DataSet
. Hence ADO.NET has been structured in such a way that you can have the flavour of the entire database from your application end.
But, one thing that you must keep in mind is that these objects are created way back in time of .NET 1.0 release or even before when there was no Generics or even the concept of State Machines. Generics allows you to pass type of an object during actual object creation and hence will allow you to create Type information based on the object which we create in real time, as a result, it eliminates unnecessary runtime boxing and unboxing. Keeping this in mind, I thought I should check how these classes are made in Framework. Let's look at each of them, one by one in this post and let's identify the pros and cons.
DataSet
DataSet
is the application object that can hold the entire structure of the database schema. Basically, internally it is a serializable object which holds two sets of collection, DataTableCollection
and DataRelationCollection
where the former is an implementation of an ArrayList
of DataTables
and the later is a list of DataRelation
. DataSet
can be Cloned (produces another object with New tables from existing schema) or Copied (Produces another object with same tables).
Some of the important benefits of DataSet
- Easily Serializable, even could be written to XML File using
WriteXML
or can be read to the Object. - Allows you to Merge other
DataSet
into it easily. - Can hold Relation and Tables into it.
Problems & best implementation of DataSet
- Can only hold a collection of
DataTable
, and it does not allow us to define our own data structures and produce custom DataRelation
among them. - No Generic support, use of
ArrayList
means each element has to be upcast to object when being stored and again to actual element (DataTable
) when being retrieved.
DataTable
DataTable
is the primary object which holds the Data
. It holds a collection of DataColumn
and DataRow
. Each of the DataRow
holds the data element while DataColumn
holds the rules between each columns. The DataRowCollection
internally represents an ArrayList
of DataRow
while DataColumnCollection
is on DataColumns
.
To hold the DataRows
, a DataTable
maintains an internal sealed class called RecordManager
. Everytime you create a DataRow
element inside a DataTable
class, the object is stored into RecordManager
. The DataRowCollection
holds a list of DataRow
. Here RBTree
class implemented as Generic
list of DataRow
objects is used.
Some important benefits of DataTable
DataTable
can store any Table
even if the data is complex- Serializable, so can be ported in communication medium easily
- Keeps track of indexed field
Problems of DataTable
- Most of the objects are non-generic, and hence leading to conversion of types
- Very bulky
DataRow
DataRow
on the other hand holds the actual data. It has an indexer to the object which lets you get
or set
data to it.
Internals to ADO.NET Continued....
Now let us take a look at how the data is being stored into these data structures and later retrieved from it. Say you have opened a connection to database and used Adapter.Fill
to fill data into a DataTable
. Now let's examine what internally happens to store Data into the DataTable
and also retrieving the same when required. Say the code I write to retrieve the Data is as below:
SqlDataAdapter adpt = new SqlDataAdapter("Select * from Table", conn);
adpt.Fill(DTable);
Now when you call Fill
method, it actually invokes the external SQL server process to load the Data into memory and parses the loaded memory into the existing data structure. If you have already read my last post on internals of ADO.NET, you should be clear how the DataRows
are created using DataReader
and put into DataRowCollection
. In this post, I will cover from there, how the internal DataRow
is parsed.
The DataReader
loads each DataRow
into memory and takes you through with calling LoadDataRow
method of SchemaMapping
. If you see into Reflector, the code for LoadDataRow
looks like:
Here the initial call to dataReader.GetValues
is actually getting the data from memory into DataRow
, which is then Loaded into DataTable
and finally to the DataSet
. Now if you try to look into the definition of GetValues
, you need to look into the actual implementation of it in the provider specific classes. For SqlClient
, the data is loaded using SqlBuffer
class. The Buffer reads each individual entry and puts it as an array. So basically for a DataRow
object loading Data for SQL Server, it puts the actual Data into an array of SqlBuffer
. So if you can see the implementation of SqlBuffer
, you would be clear about the actual data storage of DataRow
.
So if you look into the implementation of SqlDataReader
, you can see it basically maintains the collection of SqlBuffer
. SqlBuffer
is the actual unit of data storage. It maintains a structure Storage in it which has member to store almost all data basic data types The Storage stores the actual data in Type
and the SqlBuffer
holds the exact type information.
Here inside the SqlBuffer
, you can see the member _value
as Storage
. StorageType
on the other hand is an Enum
which sets the actual type of Data
associated in the DataRow
. The SqlBuffer
also exposes properties, each of which can get
or set
the value for the object. The exact type information is also held into DataColumnCollection
and according to that, the Type
of the Data
is parsed.
Hence, this means the DataRow
is actually stored internally into proper types and hence does not have any performance implication, right? No, there is.
Actually the performance implication lies above these facts when you try to retrieve the value or store it into the DataRow
object. We generally use indexer to fetch data from the DataRow
. The indexer to the DataRow
is actually of type object.
public object this[]
{
get; set;
}
This is the major problem to DataRow
class. So eventually, every time you try to retrieve data from DataRow
, even though proper type is maintained underneath, the indexer box to object type. Also, upon using setter, the proper type maintained from our code is also being boxed to the setter and which again is unboxed to proper data type.
Conclusion
I think these classes could be generalized more to induce better performance to the API. I am also hoping to have an enumeration of generic objects in the whole data structure of DataRow
and DataColumns
(may be using DataCell<t>
) in future.
I hope you like the post.
I don't know if I did made things clear to you. I have to put only those which are important to know. I would suggest to use Reflector to see the implementation in detail to know more.
Thank you for reading. Looking forward to your feedback.