Choosing Data Containers for .NET, Part 1
Date: Feb 28, 2003
Article is provided courtesy of Sams.
Return to the article
In this first article of his series on .NET and data containers, Jimmy Nilsson explains how he used the DataReader as a baseline in his performance tests on various data containers.
In the dark ages before .NET, probably the most common choice for a data
container was to use disconnected ADO Recordsets.
With .NET, the situation regarding data containers has changed. We can't
really say that there is a clear best practice yet, but Microsoft most often
recommends ADO.NET DataSets as the way to go. I, on the other hand,
prefer to use custom classes and collectionsthat is, a custom
object-oriented format. Who is right, Microsoft or me?
In this new series of articles, I discuss different data containers for .NET,
including DataSet, the Generic container datatype, and custom
classes. First up in this article is the DataReader (which isn't
really a container, but it's discussed here as a baseline).
Inspiration for This Series of Articles
Early in 2002, I read the manuscript to Martin Fowler's book
Patterns of Enterprise Application Architecture
(Addison-Wesley, 2002). Fowler inspired me to reconsider the architecture I had
discussed in my book
.NET Enterprise Design with Visual Basic .NET and SQL Server 20001
(Sams, 2001). The architecture discussed in my book is pretty much
DataSet-focused, but I started to think about using a domain model rich
in functionality and built with custom classes. That was the main reason I
started to write the series of articles for VB2TheMax, called
"A Pure Object-Oriented Domain Model by a db-Guy."
Very much related to the problem of constructing the domain model is determining
what data containers to use. (The article you're reading right now stands
alone from the VB2TheMax articles, so that is, of course, optional reading.)
The other source of inspiration that prompted me to write about data
containers is that Microsoft has published some papers about architecture and
best practices, and tends to recommend DataSets as the solution for
most situations. One example of such an article is
"Application Architecture for .NET: Designing Applications and Services,"
which states:
It is recommended that you avoid designs requiring you to transfer data in a
custom object-oriented format, because doing so requires custom serialization
implementation and can create a performance overhead. Generally, you should use
a more data-centric format, such as a DataSet, to pass the data from
the data access logic components to the business layers, and then use it to
hydrate a custom business entity if you want to work with the data in an
object-oriented fashion. In many cases, though, it will be simpler just to work
with the business data in a DataSet.
This article starts investigating different options for data
containersregarding not only performance, but other factors as
welland tries to find pros and cons for different options.
What I Mean by "Data Containers"
Before we get started, it's important to explain what I mean by the term
"data containers."
I think of data containers as both carriers of data and an important part of
the programming model. That's still pretty abstract, right? Let's get
down to earth. By "carrier," I mean that you need a way of
transporting data between tiers in distributed applications. As an example, you
use Windows Forms at a client, and you fetch a list of customers from a service
running on an application server.
The other part of my explanation involves the phrase "part of the
programming model." By that, I mean that when you need to cache data, run
calculations, do validations, and so on, you need to keep your data disconnected
from the database. The choice of programming model, for example, affects how to
find a specific row, how to inspect columns, and so on. What is the most
suitable way of representing the data, then, to make your programming productive
and keep execution efficient?
NOTE
It's probably easier to use the same data container for the carrier of
data between tiers and as the part of the programming model, but you don't
have to do things this way. For example, you might use custom
classes for your programming model and use datasets as the carrier. You will
have to do quite a lot of conversions, but it's possible to go that route
nonetheless.
Best Practice for Data Containers
So, what is the best practice? Let's take a step back and think about
the time before .NET.
Yesterday
Those of us who were building n-tier systems with VB6 didn't have
many choices for data containers. The main problem was that we couldn't
write classes that were marshaled by value. All classes written in VB6 classes
were marshaled by reference. Therefore, if we used custom object-oriented
formats for the "carrier" functionality, we were asking for bad
performance. Instead, we had to use, for example, tagged strings, arrays of
simple values, or disconnected ADO Recordsets. (Disconnected ADO
Recordsets used custom marshaling and, therefore, were marshaled by
value.)
NOTE
VB programmers weren't the only ones suffering from the problem of
marshalling by reference in COM. C++ programmers suffered, too. It wasn't
exactly trivial to write COM components that used custom marshalling.
A problem arose with representation of custom object-oriented formats as well
because all classes written with VB became COM components. For some of our
classes, that was just great; for other classes, though, we didn't need
that. As a result, we paid with high overhead for each instance of those
classes. It was even worse if we made the mistake of configuring the
"wrong" classes in COM+ without configuring them so that they could
"activate in the context of the caller." Then the overhead of each
instance was as much as 3KB. For example, arrays, XML DOMs, and disconnected ADO
Recordsets were often used for the representation, too.
As I said, ADO Recordsets wasn't the only option, but for most
situations, it was considered a best practice to use them, especially for
carrier functionality.
Today and Tomorrow
Okay, it wasn't a single choice "yesterday," but in a way it
was easy to choose data containers because there weren't too many options.
With .NET, the situation has changed. I don't think we can say that there
is a clear best practice yet, but, as I've said already, Microsoft most
often recommends ADO.NET DataSets as the way to go. I have changed my
mind and now prefer to use custom classes and collectionsthat is, a custom
object-oriented format. Who is right, Microsoft or me?
Before you answer that question and call me crazy, let me make it clear that
this question is not as simple as it first sounds. There is never a single
answer to all questions, and that goes for data containers, too. Different
situations require different solutions I have talked to Microsoft about this
because of their pretty clear recommendations, and they've said that they
are actually pretty fond of a custom object-oriented format. In their view, the
problem is that it's much more complicated and more risky to go my route.
After all, DataSets come with a lot of built-in functionality for free.
A lot of the literature (for example, Fowler's book) supports
Microsoft's view that building a custom object-oriented format requires
work and is risky.
The choice of which data container to use isn't obvious. Another
question to ask is, does it matter at all? As you probably have guessed by now,
I think it's very importantI would go so far to say that the
choice is crucial. You and your application will live with the choice for a
long time, and the implications are enormous. This decision will affect
several qualities:
Performance. How much time does it take from request until
complete response?
Scalability. How is the performance affected when the size of the
problem increases? For example, what happens when more users are added or larger
amounts of data must be transferred?
Productivity. How much effort is required to write the initial
version of an application?
Maintainability. How much effort is required to extend and change
the application?
Interoperability. How easily can heterogeneous clients work with
the data?
As you probably know, you can't have all of these qualities at 100%;
they somewhat compete with each other, so to speak.
Which Is the Most Important Quality?
Determining which qualities are most important depends on different
situations. Assuming that you have to choose one that is more important for most
situations, which one would you choose?
I would probably choose maintainability. By focusing on that, you might get a
little less productivity when you write the initial version, but you can quite
easily get performance, scalability, and interoperability if you later find out
that you don't completely fulfill those qualities. The cost of system
ownership increasingly depends on maintainability. My guess is that this
won't change in the near future.
I use these qualities when comparing the different options discussed
next.
.NET Options
You can choose among many options in .NET:
DataReader. This isn't really a container,
but just a way (and one that performs well) to stream data out from the
database, as a firehose.
Untyped DataSet. The untyped DataSet (or just
DataSet) is an in-memory database in which you hold disconnected data
in tabular format as DataTables. There's a lot of built-in
functionality, such as sorting and filtering.
Typed DataSet. In this case, the schema of the
DataSet is created at design time. Therefore, the code that you write
against the typed DataSet will be typesafe. (A typed DataSet
inherits from an untyped DataSet, so I won't address that as yet
another option.)
Wrapped container datatype. By wrapping a DataSet, you
can get all the nice functionality while exposing only a minimal interface so
that the DataSet is encapsulated. You could just as well wrap an XML
DOM, a Hashtable, or something else that fits your needs, but I chose a
DataSet for this article.
Generic container datatype. The Framework Class Library (FCL) has
a wealth of generic container classes, such as Array,
Hashtable, XML DOM, and so on. When I discuss this further, I will use
Hashtable as an example.
Custom classes and collections. In this case, you build up a
custom object model that maps as closely as possible to your problem domain. You
get a lot of flexibility, but it also means a lot of work.
Of course, this list is not exhaustive, but it represents a broad range of
possible solutions. It also represents solutions that I will discuss in this
series of articles.
Another option that probably will be interesting in the future is
ObjectSpaces from Microsoft. An alpha version was out, but Microsoft decided to
withdraw it several months ago and has published no new version yet. The last
thing I heard about this is that we will have to wait until V2 of the .NET
Framework.
Other options are to have a look at third-party solutions such as
PragmaTier and
EntityBroker.
ObjectSpaces (at least the withdrawn alpha) and the types of third-party
solutions I'm mentioning here have custom classes at the heart of the
programming model.
The DataReader: Background
Let's start by discussing the DataReader very briefly. (For
more background information, go to
MSDN.)
As I said before, DataReader isn't really a container. It is
marshaled by reference, and I'm using it here especially as a baseline in
the performance tests.
NOTE
Let me stress that I don't think the DataReader should be used
between tiers in a well-factored architecture. In a way, DataReaders
are the reason that many systems are created as a mishmash of presentation,
business, and data code. Again, I discuss DataReader here as a
baseline, not as a best practice.
The core functionality of the DataReader is to stream data from the
database very efficiently. It's also the case that the DataSet
(or, rather, DataAdapter) uses a DataReader behind the
curtains to fill the DataSet with data.
As a matter of fact, there isn't a class called DataReader in
the .NET Framework. Instead, there are two interfaces, IDataReader and
IDataRecord, and the classes implementing those interfaces are
provider-specific. The classes are called, for example, SqlDataReader
(specific to SQL Server), OleDbDataReader (for OLE DBsupporting
databases) and OdbcDataReader (for ODBC-supporting databases). Because
I'm mostly working with SQL Server, I will use the SqlDataReader
class in this article.
Many of you might argue that it's preferred to program against the
interfaces instead so that you easily can switch among SqlDataReader,
OleDataReader, or another DataReader implementation. I can
agree with that. Currently, a few (small) caveats come with that, but I'm
sure these will be dealt with in an upcoming version of .NET Framework.
Some Pros and Cons with the DataReader
As I have already said, the DataReader isn't really a container
(at least, not the way I see it), so the pros and cons examples are affected by
that:
Performance in the AppDomain. The performance is very good for
streaming data from the database, but only within the same AppDomain.
Performance, cross-AppDomains and cross-machines. The situation is
the opposite regarding performance if you use a DataReader from another
AppDomain than the one in which it was created. Because the DataReader
is marshaled by reference, all method calls have to go back to the source
AppDomain.
Connected operations only. The DataReader is useful only
when it has a connection open to the database. So, even though you can quickly
get data out from the database and have the connection occupied for an as short
time as possible, you risk starting to do a lot of work for each row in the
DataReader and having the connection open for a very long
time.
NOTE
In my opinion, it's really dangerous to hand out DataReaders to
clients that have been written by other developers. How can you know that they
won't hold on to the DataReader for a very long time? What you can
know is that, in that case, it will hurt.
Occupied connection. You cannot use one and the same connection
from two different DataReaders at the same time. You must loop through
the first DataReader (or close it) before using the connection for
another DataReader.
Interoperability. You can't send a DataReader to a
client that isn't a .NET client. It doesn't make sense for other
clients.
Exposed database. The DataReader is weak at encapsulating
the database. The schema itself is somewhat exposed, but the user of a
DataReader also has a very direct coupling to the database. This will
affect maintainability in the long run.
Only about reading. It's probably obvious and well known, but
it's worth mentioning anyway. This is important when you think about the
programming model because then you can't rely only on the
DataReader.
DataReader Code Examples
To get a feeling of how to work with a DataReader (or, rather,
System.Data.SqlClient.SqlDataReader), take a look at the UML diagram in
Figure 1. (Note, that I have listed only a selection of the methods there.)
Figure 1 DataReader in UML.
In Listing 1, you find some code for when data is fetched from the database
with the help of a stored procedure. As you can see, you don't instantiate
an SqlDataReader with New. Instead, you create one with the
help from a command (or, rather, SqlCommand).
Listing 1: Code for Creating a DataReader
Dim aCommand As New _
SqlCommand(SprocOrder_FetchWithLines, _
_GetConnection())
aCommand.CommandType = _
CommandType.StoredProcedure
aCommand.Parameters.Add("@id", _
SqlDbType.Int).Value = id
Return aCommand.ExecuteReader _
(CommandBehavior.CloseConnection Or _
CommandBehavior.SequentialAccess)
Browsing through the information in the DataReader, the code could
look like it does in Listing 2. Note that here I'm browsing through a
DataReader with two resultsets. That's the reason of the call to
NextResult().
Listing 2: Code for Browsing Through a DataReader
Dim anOrder As SqlDataReader = _
_service.FetchOrderAndLines(_GetRandomId())
anOrder.Read()
_id = anOrder.GetInt32(OrderColumns.Id)
_customerId = anOrder.GetInt32 _
(OrderColumns.CustomerId)
_orderDate = anOrder.GetDateTime _
(OrderColumns.OrderDate)
anOrder.NextResult()
While anOrder.Read
_productId = anOrder.GetInt32 _
(OrderLineColumns.ProductId)
_priceForEach = anOrder.GetDecimal _
(OrderLineColumns.PriceForEach)
_noOfItems = anOrder.GetInt32 _
(OrderLineColumns.NoOfItems)
_comment = anOrder.GetString _
(OrderLineColumns.Comment)
End While
anOrder.Close()
Before we take a look on the results of the DataReader in the tests,
let's discuss the tests a bit.
The Tests
If you're like me, you like reading about performance tests. Therefore,
in this series of articles, I run a couple of tests to compare the different
data containers. The test cases used are the following:
The test cases will all be run under the following circumstances:
Finally, all test cases, both with a single user and with five users, will be
run in these cases:
In an AppDomain
Cross-machines
All test cases are end-to-endthat is, the data containers are filled
with data from the database for each call (you saw an example of that in Listing
1). Then the data containers are transferred to a client, which inspects all
values by moving them to local variables (an example of that was shown in
Listing 2).
The test cases are exposed with one service layer class per data container.
For example, you can see the DataReader service-layer class in Figure
2.
Figure 2 One example of a service-layer class.
NOTE
By "service layer," I mean the same as the business facade layer,
or the application layer, or what other term you prefer.
As you can see in Figure 2, all the service-layer classes inherit from
MarshalByRefObject. This is because they should be suitable as root
classes when used via remoting.
All service-layer classes call the same three stored procedures:
Order_Fetch. Used for the first test case. Fetches
one specific order.
Order_FetchLines. Used for the second test case.
Fetches the detail lines for a specific order.
Order_FetchWithLines. Used for the third test
case. Fetches the main information about a specific order and its detail
lines.
Finally, the stored procedures access the database shown in Figure 3.
Figure 3 Database schema.
And the Testbed
I decided to build a little utility for running my tests. (The primary reason
for that was that I wanted something that was simple and nice to show in a
presentation. When I write this, I have only a console application
implementation of the testbed, but it takes a matter of minutes to write a
Windows Formsbased testbed also.) The design of the testbed is shown in
Figure 4.
Figure 4 Design for the testbed.
As you can see in Figure 4, there is one test-driver class (for example,
TdDataReader) for each data container to test. As I mentioned before,
there is also one service-layer class (for example,
FetchingOrderInformation__DataReader) for each data container. The
TestBed (UI) class parses (in the console application implementation,
at least) the command given at the command line, instantiates the correct test
driver, and calls the Execute() method with the current arguments.
I also wrote a small host application for remoting. In that console
application, I host the service-layer classes when I run the tests
cross-machines.
Most of you probably are less interested in the testbed and more interested
in the results, so let's move to that section now. (If you want, you can
download the testbed
and play with the code.)
Result of the Tests
As you know by now, trusting test results is a risky business. So many things
are different between the tests and the test environment compared to your own
environment, so you might get completely different results yourself. Anyway, I
find it interesting to publish my results, and you can use them as a starting
point for your own tests. That is also the case with the
code, of
course.
I said earlier that I used DataReader as a baseline. Therefore, I
recalculated all the values so that I get a value of 1 for
DataReader and so that the rest of the data containers will have a
value that is relative to the DataReader value. That makes it very easy
to compare. The higher the value is, the better. To avoid showing you only
results with the 1 digit, I have also added a sneak peek at the untyped
DataSet in the tables.
Table 1: Read One Row
|
1 User, in AppDomain
|
5 Users, in AppDomain
|
1 User, Cross-Machines
|
5 Users, Cross-Machines
|
DataReader
|
1
|
1
|
1
|
1
|
Untyped DataSet
|
0.6
|
0.6
|
1.4
|
1.7
|
Table 2: Read Many Rows
|
1 User, in AppDomain
|
5 Users, in AppDomain
|
1 User, Cross- Machines
|
5 Users, Cross- Machines
|
DataReader
|
1
|
1
|
1
|
1
|
Untyped DataSet
|
0.5
|
0.6
|
6.9
|
9.7
|
Table 3: Read One Master Row and Many Detail Rows
|
1 User, in AppDomain
|
5 Users, in AppDomain
|
1 User, Cross-Machines
|
5 Users, Cross-Machines
|
DataReader
|
1
|
1
|
1
|
1
|
Untyped DataSet
|
0.5
|
0.5
|
6.1
|
8.5
|
To explain the results a bit with an example, take a look at
Table 3. In the single-user test, the untyped DataSet is half as well
performing as the DataReader within the AppDomain, but it is more than
six times as well-performing cross-machines. This gives a good idea of the cost
of marshalling by reference of the DataReader, across AppDomain
boundaries.
NOTE
"In AppDomain" is pretty similar to what we talked about as
"in-process" before .NET.
It's pretty early to give any grades because we have only covered the
DataReader, but this will show what you can expect from the coming
articles. In Table 4, I have given some highly subjective results according to
the qualities discussed at the beginning of the article. A score of 5 means
great, and a score of 1 means bad.
NOTE
Even if you agree with me in Table 4, you should add some weights so that the
qualities most important for you have a higher weight than the others.
Table 4: Grades According to Qualities
|
Performance in AppDomain/Cross-Machines
|
Scalability in AppDomain/Cross-Machines
|
Productivity
|
Maintainability
|
Interoperability
|
DataReader
|
5/1
|
4/1
|
2
|
1
|
1
|
Now I'll say a few words about each quality grade.
Performance
As you saw before, the performance is great within an AppDomain, but
it's just the opposite if the DataReader has to "travel"
between AppDomains, possibly cross-machines. The reason for the bad performance
cross-AppDomains is, of course, that DataReaders are marshalled by
reference and, therefore, actually won't travel at all. That means that all
calls to a DataReader will lead back to the AppDomain where it was
created.
NOTE
Even if, against all recommendations, you decide to ship DataReaders
across AppDomain boundaries, watch out for what methods you use. For example,
GetSqlMoney() won't work because it returns a datatype that
isn't marked with the Serializable attribute.
Scalability
Scalability is pretty similar to performance when we discuss
DataReaders. One thing to consider is that you must be extremely
carefully so that you don't do long-running operations during a browse
through a DataReader. Then you will hurt scalability instead because
the connection will be occupied until the DataReader has been read
through (or closed). Anyway, I think the DataReader should have a 5 for
scalability in AppDomain, as long as it is used wisely. However, the risk is
large enough that you will hold on to the DataReader and its
connection too long and, therefore, hurt scalability. That's the
reason for a 4.
NOTE
It's a bit awkward to say whether a DataReader in itself is
scalable. The scalability is only about how the DataReader is used.
Productivity
Using DataReaders as the primary programming model means quite a lot
of work if you manually write all the code. You should probably write yourself a
generator that writes the code for you.
Maintainability
To be honest, I don't see DataReaders as a good choice for
simulating a data container. The main problem is maintainability. I don't
think the client should have such a direct reference in to the database. This
will make it much harder, for example, to change the database structure in the
future. In my opinion, DataReaders are useful for dragging data from
the database as fast as possible and filling another disconnected data
structure.
NOTE
I'm pretty sure that I will get some comments about this article saying
that no real architect would dream of using a DataReader as a data
container. I know, I know.
Interoperability
Another big problem with DataReaders is that they aren't
interoperable. Try to hand out a DataReader to a client running Linux,
for example, and wait for flaming e-mails.
Conclusion
In this first article, you didn't really get many answers. But we have
laid the groundwork for some very interesting explorations in an important and
immature field. Of course, you will find that there isn't one winning data
container for every situationonly trade-offs, as usual. On the other hand,
the choice is very important.
In this first article, I discussed the DataReader a bit, mostly to
use it as a baseline in my performance tests. The place for DataReaders
is typically in the implementations of other data containers. That is the case
for the DataSet and also most custom object-oriented formats. It might
also be interesting to use DataReaders in small applications directly
from the UI, but for most enterprise applications, it's a no-no.
I'm not 100% sure about the agenda details of the upcoming articles at
this point, but the current plan looks like this:
Part 1: DataReader (Okay, this is a pretty safe guess.)
Part 2: Untyped and typed DataSets
Part 3: Wrapped container datatype and the Generic container
datatype
Part 4: Custom classes and collections
Part 5: A summary of interesting findings and the results of additional
tests
Don't miss the second article in this series, about
DataSets.