InformIT

Choosing Data Containers for .NET, Part 1

Date: Feb 28, 2003

Article is provided courtesy of Sams.

Return to the article

In this first article of his series on .NET and data containers, Jimmy Nilsson explains how he used the DataReader as a baseline in his performance tests on various data containers.

In the dark ages before .NET, probably the most common choice for a data container was to use disconnected ADO Recordsets.

With .NET, the situation regarding data containers has changed. We can't really say that there is a clear best practice yet, but Microsoft most often recommends ADO.NET DataSets as the way to go. I, on the other hand, prefer to use custom classes and collections—that is, a custom object-oriented format. Who is right, Microsoft or me?

In this new series of articles, I discuss different data containers for .NET, including DataSet, the Generic container datatype, and custom classes. First up in this article is the DataReader (which isn't really a container, but it's discussed here as a baseline).

Inspiration for This Series of Articles

Early in 2002, I read the manuscript to Martin Fowler's book Patterns of Enterprise Application Architecture (Addison-Wesley, 2002). Fowler inspired me to reconsider the architecture I had discussed in my book .NET Enterprise Design with Visual Basic .NET and SQL Server 20001 (Sams, 2001). The architecture discussed in my book is pretty much DataSet-focused, but I started to think about using a domain model rich in functionality and built with custom classes. That was the main reason I started to write the series of articles for VB2TheMax, called "A Pure Object-Oriented Domain Model by a db-Guy." Very much related to the problem of constructing the domain model is determining what data containers to use. (The article you're reading right now stands alone from the VB2TheMax articles, so that is, of course, optional reading.)

The other source of inspiration that prompted me to write about data containers is that Microsoft has published some papers about architecture and best practices, and tends to recommend DataSets as the solution for most situations. One example of such an article is "Application Architecture for .NET: Designing Applications and Services," which states:

It is recommended that you avoid designs requiring you to transfer data in a custom object-oriented format, because doing so requires custom serialization implementation and can create a performance overhead. Generally, you should use a more data-centric format, such as a DataSet, to pass the data from the data access logic components to the business layers, and then use it to hydrate a custom business entity if you want to work with the data in an object-oriented fashion. In many cases, though, it will be simpler just to work with the business data in a DataSet.

This article starts investigating different options for data containers—regarding not only performance, but other factors as well—and tries to find pros and cons for different options.

What I Mean by "Data Containers"

Before we get started, it's important to explain what I mean by the term "data containers."

I think of data containers as both carriers of data and an important part of the programming model. That's still pretty abstract, right? Let's get down to earth. By "carrier," I mean that you need a way of transporting data between tiers in distributed applications. As an example, you use Windows Forms at a client, and you fetch a list of customers from a service running on an application server.

The other part of my explanation involves the phrase "part of the programming model." By that, I mean that when you need to cache data, run calculations, do validations, and so on, you need to keep your data disconnected from the database. The choice of programming model, for example, affects how to find a specific row, how to inspect columns, and so on. What is the most suitable way of representing the data, then, to make your programming productive and keep execution efficient?

NOTE

It's probably easier to use the same data container for the carrier of data between tiers and as the part of the programming model, but you don't have to do things this way. For example, you might use custom classes for your programming model and use datasets as the carrier. You will have to do quite a lot of conversions, but it's possible to go that route nonetheless.

Best Practice for Data Containers

So, what is the best practice? Let's take a step back and think about the time before .NET.

Yesterday

Those of us who were building n-tier systems with VB6 didn't have many choices for data containers. The main problem was that we couldn't write classes that were marshaled by value. All classes written in VB6 classes were marshaled by reference. Therefore, if we used custom object-oriented formats for the "carrier" functionality, we were asking for bad performance. Instead, we had to use, for example, tagged strings, arrays of simple values, or disconnected ADO Recordsets. (Disconnected ADO Recordsets used custom marshaling and, therefore, were marshaled by value.)

NOTE

VB programmers weren't the only ones suffering from the problem of marshalling by reference in COM. C++ programmers suffered, too. It wasn't exactly trivial to write COM components that used custom marshalling.

A problem arose with representation of custom object-oriented formats as well because all classes written with VB became COM components. For some of our classes, that was just great; for other classes, though, we didn't need that. As a result, we paid with high overhead for each instance of those classes. It was even worse if we made the mistake of configuring the "wrong" classes in COM+ without configuring them so that they could "activate in the context of the caller." Then the overhead of each instance was as much as 3KB. For example, arrays, XML DOMs, and disconnected ADO Recordsets were often used for the representation, too.

As I said, ADO Recordsets wasn't the only option, but for most situations, it was considered a best practice to use them, especially for carrier functionality.

Today and Tomorrow

Okay, it wasn't a single choice "yesterday," but in a way it was easy to choose data containers because there weren't too many options. With .NET, the situation has changed. I don't think we can say that there is a clear best practice yet, but, as I've said already, Microsoft most often recommends ADO.NET DataSets as the way to go. I have changed my mind and now prefer to use custom classes and collections—that is, a custom object-oriented format. Who is right, Microsoft or me?

Before you answer that question and call me crazy, let me make it clear that this question is not as simple as it first sounds. There is never a single answer to all questions, and that goes for data containers, too. Different situations require different solutions I have talked to Microsoft about this because of their pretty clear recommendations, and they've said that they are actually pretty fond of a custom object-oriented format. In their view, the problem is that it's much more complicated and more risky to go my route. After all, DataSets come with a lot of built-in functionality for free. A lot of the literature (for example, Fowler's book) supports Microsoft's view that building a custom object-oriented format requires work and is risky.

The choice of which data container to use isn't obvious. Another question to ask is, does it matter at all? As you probably have guessed by now, I think it's very important—I would go so far to say that the choice is crucial. You and your application will live with the choice for a long time, and the implications are enormous. This decision will affect several qualities:

As you probably know, you can't have all of these qualities at 100%; they somewhat compete with each other, so to speak.

Which Is the Most Important Quality?

Determining which qualities are most important depends on different situations. Assuming that you have to choose one that is more important for most situations, which one would you choose?

I would probably choose maintainability. By focusing on that, you might get a little less productivity when you write the initial version, but you can quite easily get performance, scalability, and interoperability if you later find out that you don't completely fulfill those qualities. The cost of system ownership increasingly depends on maintainability. My guess is that this won't change in the near future.

I use these qualities when comparing the different options discussed next.

.NET Options

You can choose among many options in .NET:

Of course, this list is not exhaustive, but it represents a broad range of possible solutions. It also represents solutions that I will discuss in this series of articles.

Another option that probably will be interesting in the future is ObjectSpaces from Microsoft. An alpha version was out, but Microsoft decided to withdraw it several months ago and has published no new version yet. The last thing I heard about this is that we will have to wait until V2 of the .NET Framework.

Other options are to have a look at third-party solutions such as PragmaTier and EntityBroker. ObjectSpaces (at least the withdrawn alpha) and the types of third-party solutions I'm mentioning here have custom classes at the heart of the programming model.

The DataReader: Background

Let's start by discussing the DataReader very briefly. (For more background information, go to MSDN.)

As I said before, DataReader isn't really a container. It is marshaled by reference, and I'm using it here especially as a baseline in the performance tests.

NOTE

Let me stress that I don't think the DataReader should be used between tiers in a well-factored architecture. In a way, DataReaders are the reason that many systems are created as a mishmash of presentation, business, and data code. Again, I discuss DataReader here as a baseline, not as a best practice.

The core functionality of the DataReader is to stream data from the database very efficiently. It's also the case that the DataSet (or, rather, DataAdapter) uses a DataReader behind the curtains to fill the DataSet with data.

As a matter of fact, there isn't a class called DataReader in the .NET Framework. Instead, there are two interfaces, IDataReader and IDataRecord, and the classes implementing those interfaces are provider-specific. The classes are called, for example, SqlDataReader (specific to SQL Server), OleDbDataReader (for OLE DB–supporting databases) and OdbcDataReader (for ODBC-supporting databases). Because I'm mostly working with SQL Server, I will use the SqlDataReader class in this article.

Many of you might argue that it's preferred to program against the interfaces instead so that you easily can switch among SqlDataReader, OleDataReader, or another DataReader implementation. I can agree with that. Currently, a few (small) caveats come with that, but I'm sure these will be dealt with in an upcoming version of .NET Framework.

Some Pros and Cons with the DataReader

As I have already said, the DataReader isn't really a container (at least, not the way I see it), so the pros and cons examples are affected by that:

NOTE

In my opinion, it's really dangerous to hand out DataReaders to clients that have been written by other developers. How can you know that they won't hold on to the DataReader for a very long time? What you can know is that, in that case, it will hurt.

DataReader Code Examples

To get a feeling of how to work with a DataReader (or, rather, System.Data.SqlClient.SqlDataReader), take a look at the UML diagram in Figure 1. (Note, that I have listed only a selection of the methods there.)

Figure 1Figure 1 DataReader in UML.

In Listing 1, you find some code for when data is fetched from the database with the help of a stored procedure. As you can see, you don't instantiate an SqlDataReader with New. Instead, you create one with the help from a command (or, rather, SqlCommand).

Listing 1: Code for Creating a DataReader

    Dim aCommand As New _
    SqlCommand(SprocOrder_FetchWithLines, _
    _GetConnection())
    
    aCommand.CommandType = _
    CommandType.StoredProcedure
    aCommand.Parameters.Add("@id", _
    SqlDbType.Int).Value = id
    
    Return aCommand.ExecuteReader _
    (CommandBehavior.CloseConnection Or _
    CommandBehavior.SequentialAccess)

Browsing through the information in the DataReader, the code could look like it does in Listing 2. Note that here I'm browsing through a DataReader with two resultsets. That's the reason of the call to NextResult().

Listing 2: Code for Browsing Through a DataReader

    Dim anOrder As SqlDataReader = _
    _service.FetchOrderAndLines(_GetRandomId())

    anOrder.Read()
    _id = anOrder.GetInt32(OrderColumns.Id)
    _customerId = anOrder.GetInt32 _
    (OrderColumns.CustomerId)
    _orderDate = anOrder.GetDateTime _
    (OrderColumns.OrderDate)

    anOrder.NextResult()
    While anOrder.Read
      _productId = anOrder.GetInt32 _
      (OrderLineColumns.ProductId)
      _priceForEach = anOrder.GetDecimal _
      (OrderLineColumns.PriceForEach)
      _noOfItems = anOrder.GetInt32 _
      (OrderLineColumns.NoOfItems)
         _comment = anOrder.GetString _ 
          (OrderLineColumns.Comment)
    End While

    anOrder.Close()

Before we take a look on the results of the DataReader in the tests, let's discuss the tests a bit.

The Tests

If you're like me, you like reading about performance tests. Therefore, in this series of articles, I run a couple of tests to compare the different data containers. The test cases used are the following:

The test cases will all be run under the following circumstances:

Finally, all test cases, both with a single user and with five users, will be run in these cases:

All test cases are end-to-end—that is, the data containers are filled with data from the database for each call (you saw an example of that in Listing 1). Then the data containers are transferred to a client, which inspects all values by moving them to local variables (an example of that was shown in Listing 2).

The test cases are exposed with one service layer class per data container. For example, you can see the DataReader service-layer class in Figure 2.

Figure 2Figure 2 One example of a service-layer class.

NOTE

By "service layer," I mean the same as the business facade layer, or the application layer, or what other term you prefer.

As you can see in Figure 2, all the service-layer classes inherit from MarshalByRefObject. This is because they should be suitable as root classes when used via remoting.

All service-layer classes call the same three stored procedures:

Finally, the stored procedures access the database shown in Figure 3.

Figure 3Figure 3 Database schema.

And the Testbed

I decided to build a little utility for running my tests. (The primary reason for that was that I wanted something that was simple and nice to show in a presentation. When I write this, I have only a console application implementation of the testbed, but it takes a matter of minutes to write a Windows Forms–based testbed also.) The design of the testbed is shown in Figure 4.

Figure 4Figure 4 Design for the testbed.

As you can see in Figure 4, there is one test-driver class (for example, TdDataReader) for each data container to test. As I mentioned before, there is also one service-layer class (for example, FetchingOrderInformation__DataReader) for each data container. The TestBed (UI) class parses (in the console application implementation, at least) the command given at the command line, instantiates the correct test driver, and calls the Execute() method with the current arguments.

I also wrote a small host application for remoting. In that console application, I host the service-layer classes when I run the tests cross-machines.

Most of you probably are less interested in the testbed and more interested in the results, so let's move to that section now. (If you want, you can download the testbed and play with the code.)

Result of the Tests

As you know by now, trusting test results is a risky business. So many things are different between the tests and the test environment compared to your own environment, so you might get completely different results yourself. Anyway, I find it interesting to publish my results, and you can use them as a starting point for your own tests. That is also the case with the code, of course.

I said earlier that I used DataReader as a baseline. Therefore, I recalculated all the values so that I get a value of 1 for DataReader and so that the rest of the data containers will have a value that is relative to the DataReader value. That makes it very easy to compare. The higher the value is, the better. To avoid showing you only results with the 1 digit, I have also added a sneak peek at the untyped DataSet in the tables.

Table 1: Read One Row

1 User, in AppDomain

5 Users, in AppDomain

1 User, Cross-Machines

5 Users, Cross-Machines

DataReader

1

1

1

1

Untyped DataSet

0.6

0.6

1.4

1.7


Table 2: Read Many Rows

1 User, in AppDomain

5 Users, in AppDomain

1 User, Cross- Machines

5 Users, Cross- Machines

DataReader

1

1

1

1

Untyped DataSet

0.5

0.6

6.9

9.7


Table 3: Read One Master Row and Many Detail Rows

1 User, in AppDomain

5 Users, in AppDomain

1 User, Cross-Machines

5 Users, Cross-Machines

DataReader

1

1

1

1

Untyped DataSet

0.5

0.5

6.1

8.5


To explain the results a bit with an example, take a look at Table 3. In the single-user test, the untyped DataSet is half as well performing as the DataReader within the AppDomain, but it is more than six times as well-performing cross-machines. This gives a good idea of the cost of marshalling by reference of the DataReader, across AppDomain boundaries.

NOTE

"In AppDomain" is pretty similar to what we talked about as "in-process" before .NET.

It's pretty early to give any grades because we have only covered the DataReader, but this will show what you can expect from the coming articles. In Table 4, I have given some highly subjective results according to the qualities discussed at the beginning of the article. A score of 5 means great, and a score of 1 means bad.

NOTE

Even if you agree with me in Table 4, you should add some weights so that the qualities most important for you have a higher weight than the others.

Table 4: Grades According to Qualities

Performance in AppDomain/Cross-Machines

Scalability in AppDomain/Cross-Machines

Productivity

Maintainability

Interoperability

DataReader

5/1

4/1

2

1

1


Now I'll say a few words about each quality grade.

Performance

As you saw before, the performance is great within an AppDomain, but it's just the opposite if the DataReader has to "travel" between AppDomains, possibly cross-machines. The reason for the bad performance cross-AppDomains is, of course, that DataReaders are marshalled by reference and, therefore, actually won't travel at all. That means that all calls to a DataReader will lead back to the AppDomain where it was created.

NOTE

Even if, against all recommendations, you decide to ship DataReaders across AppDomain boundaries, watch out for what methods you use. For example, GetSqlMoney() won't work because it returns a datatype that isn't marked with the Serializable attribute.

Scalability

Scalability is pretty similar to performance when we discuss DataReaders. One thing to consider is that you must be extremely carefully so that you don't do long-running operations during a browse through a DataReader. Then you will hurt scalability instead because the connection will be occupied until the DataReader has been read through (or closed). Anyway, I think the DataReader should have a 5 for scalability in AppDomain, as long as it is used wisely. However, the risk is large enough that you will hold on to the DataReader and its connection too long and, therefore, hurt scalability. That's the reason for a 4.

NOTE

It's a bit awkward to say whether a DataReader in itself is scalable. The scalability is only about how the DataReader is used.

Productivity

Using DataReaders as the primary programming model means quite a lot of work if you manually write all the code. You should probably write yourself a generator that writes the code for you.

Maintainability

To be honest, I don't see DataReaders as a good choice for simulating a data container. The main problem is maintainability. I don't think the client should have such a direct reference in to the database. This will make it much harder, for example, to change the database structure in the future. In my opinion, DataReaders are useful for dragging data from the database as fast as possible and filling another disconnected data structure.

NOTE

I'm pretty sure that I will get some comments about this article saying that no real architect would dream of using a DataReader as a data container. I know, I know.

Interoperability

Another big problem with DataReaders is that they aren't interoperable. Try to hand out a DataReader to a client running Linux, for example, and wait for flaming e-mails.

Conclusion

In this first article, you didn't really get many answers. But we have laid the groundwork for some very interesting explorations in an important and immature field. Of course, you will find that there isn't one winning data container for every situation—only trade-offs, as usual. On the other hand, the choice is very important.

In this first article, I discussed the DataReader a bit, mostly to use it as a baseline in my performance tests. The place for DataReaders is typically in the implementations of other data containers. That is the case for the DataSet and also most custom object-oriented formats. It might also be interesting to use DataReaders in small applications directly from the UI, but for most enterprise applications, it's a no-no.

I'm not 100% sure about the agenda details of the upcoming articles at this point, but the current plan looks like this:

Don't miss the second article in this series, about DataSets.

800 East 96th Street, Indianapolis, Indiana 46240