Choosing Data Containers for .NET, Part 2
Date: Mar 21, 2003
Article is provided courtesy of Sams.
In the first installment of this series of articles, I discussed the DataReader. It's actually not what I call a data container, but it's a very good baseline for the tests because the other options use DataReader behind the scenes. In this second article, I discuss DataSets, both untyped and typed.
NOTE
Supplementing code for this article can be found at
http://www.jnsk.se/
informit/container1.htm.
Keep the Fire Burning
In the first article, I posed the question of whether Microsoft or I have determined the best data container for most situations. I vote for custom classes, and Microsoft seems to vote for DataSets. My friend J.C. Oberholzer sent me some feedback:
I tend to think you are rightMicrosoft writes ADO.NET and updates it as they go along. If you combine different components, you may end up supporting different versions of ADO.NET and DataSets when Microsoft decides to upgrade ADO.NET for any one of their reasons. When working with multiple components to create a system and maintaining different versions of these components, I would rather maintain my own than rely on Microsoft to make the decisions. The DataSet has a lot of built-in functionality that this guy needs, some that another guy needs, and so on. My custom goodie has the functionality that I need and can debug.
That's an interesting viewpoint, in my opinion. To balance it a bit, though, I've also heard that because Microsoft is pushing DataSets so much, they will see to it that we have a smooth upgrade path in the future.
Okay, let's get started with the topic of today: DataSets. (But be sure to read the first article, if you haven't already!)
Background on the DataSet
When I worked with ADO and VB6, I often used disconnected ADO Recordsets. They were nice, especially regarding marshaling, because they used custom marshaling and could "travel" between processes. But there were caveats, of course. One was that you needed to send several resultsets, for example, when you had used several SELECT statements within one stored procedure; you couldn't disconnect a Recordset with several resultsets. The solution I most often used was to move the resultsets from the first Recordset into an array of Recordsets. It was kind of a hack, but it was useful.
Another important occurrence in the dark ages before .NET was that Microsoft showed the in-memory database (IMDB) in the beta versions of COM+ 1.0. The idea was to have an in-memory cache of data to work with so that the ordinary database server wasn't touched as often and, consequently, performance and scalability would increase. For different reasons, IMDB was withdrawn before the final release of COM+ 1.0.
Rumors say that IMDB was withdrawn for one or more of the following reasons:
In tests, IMDB-based solutions gave worse performance than when SQL Server was used as a cache server.
Beta testers were disappointed when they found that the IMDB didn't understand SQL but instead had a more ISAM-ish programming model.
According to Microsoft, beta testers showed very little interest.
With the release of ADO.NET, we got the DataSet class, which addresses both the need for a disconnected Recordset with several resultsets and the need for an in-memory cache. In Figure 1, you can see the different classes from which the DataSet is aggregated.
Figure 1 Class model of the DataSet.
A DataSet can have one or more DataTables. (A DataTable can be thought of as a resultset.) Each DataTable can have DataRows, DataColumns, and Constraints. The DataSet also can have DataRelations between the different DataTables. With this model, you can build a complete representation of a database, but in memory.
When I discussed the DataReader, I said that concrete DataReadersfor example, SqlDataReader and OleDbDataReaderare called. That is not the case with the DataSet; it is independent of providers and data sources. You can create and fill a DataSet completely with simple code if you want to, without touching a database at all.
As a matter of fact, there is much to say about the DataSetprobably enough for complete books. Before moving to the next section, I'd like to summarize some of the other key built-in features of the DataSet:
Support for sorting, filtering, and searchingThe DataSet comes out of the package with a wealth of built-in functionality that will boost your productivity. It will take you some time to get used to all the features of the DataSet, but it's not that hard. Not everything works exactly the way you want it to, of course, but not have everything exactly your way that's the price you pay by reusing built-in functionality.
Control of what commands to useIf you worked with disconnected Recordsets in ADO, you probably remember that you had little control over the UpdateBatch() method. With the DataSet (or, rather, the DataAdapter), you can decide what command to use for updates, inserts, deletes, and selects. That's a huge improvement.
Support for concurrency controlThe DataSet has built-in support for optimistic concurrency control, too. Because the DataSet is about working with disconnected data, it automatically adds to the WHERE clause for DELETE and UPDATE so that no conflicts affect the specific row(s) between SELECT and UPDATE/DELETE. This default model often works well enough.
Support for events when the DataSet is affectedThanks to a wealth of events that are fired when the DataSet is affected somehow, you have good control over the DataSet.
XML integrationOne of the benefits of the strong XML integration with the DataSet is that you can work with the data in a DataSet both relationally as tables and hierarchically as XML.
Background on the Typed DataSet
So far, I've been talking about the untyped Dataset. You can also instruct Visual Studio .NET to create a typed DataSet for you. You do that by describing the schema in XML or graphically, as shown in Figure 2.
Figure 2 A schema for a typed DataSet.
The schema is used for generating a class in your project with help from the XSD utility. (That is done automatically for you in Visual Studio .NET.) The generated class inherits from DataSet and somewhat encapsulates the untyped DataSet.
NOTE
I use the phrase "somewhat encapsulates" for how the typed DataSet hides information about the untyped DataSet because you can always access the untyped DataSet instead. That is because the typed DataSet inherits an untyped DataSet. In my opinion, this is a major weakness.
When you are using a typed DataSet, the schema of the DataSet is known at design time, so many of the tools in Visual Studio .NET can be used to increase productivity (for example, when you create the user interface). You also benefit by working with the DataSet in a type-safe manner when you write your code. Sure, you might still get runtime-type exceptions if your typed DataSet and the database aren't in synch, but many of us really like to have IntelliSense.
NOTE
The first time I saw IntelliSense in VB, my reaction was to ask where I could disable it. But I quickly got accustomed to using it, and I'm now very sure that it increases my productivity a lot. The most irritating thing about IntelliSense is that it doesn't work everywhere, such as in Notepad.
All readers, in a chorus: "Show us some code! Please!"
Okay, I'll do that, but first let's go over a quick summary of pros and cons.
Pros and Cons of the DataSet
The DataSet is certainly a powerful thing. Let's take a look at some advantages it offers (some are more or less the same as those previously mentioned as key functionality of the DataSet):
The DataSet is well known, with loads of code examples
It has built-in support for sorting, searching, filtering, and concurrency control.
It has native XML integration.
Of course, some drawbacks are associated with DataSets:
It involves high overhead for a single row.
It wastes bandwidth and memory.
It's untyped.
It's not completely in your controlfor example, RowState is read-only.
Pros and Cons of the Typed DataSet
The typed DataSet inherits from the untyped DataSet, so the pros and cons are largely the same. The typed DataSet also has these benefits:
Type-safe operations, with all their advantages
Data validationfor example, setup at design time
Data binding at design time
These disadvantages are associated with typed DataSets:
It has higher overhead than that of an untyped DataSet.
It's not exactly what you want in many situations. (Nothing concrete, just a feeling I get when I work with a typed DataSet.)
Those of you who have read Martin Fowler's Patterns of Enterprise Application Architecture (Addison-Wesley, 2002) will recognize that the patterns typically used when working with DataSets are the Table Module pattern and the Transaction Script pattern.
DataSet Code Examples
Now let's take a look at some code samples, both from the server side and from the client side. First we'll address the server side. In Listing 1, you find some code for fetching data from the database with help from a stored procedure.
Listing 1: Code for Filling a DataSet
Dim aCommand As New SqlCommand _ (SprocOrder_FetchWithLines, _GetClosedConnection) aCommand.CommandType = _ CommandType.StoredProcedure aCommand.Parameters.Add _ ("@id", SqlDbType.Int).Value = id Dim anAdapter As New SqlDataAdapter(aCommand) anAdapter.TableMappings.Add("Table", "Orders") anAdapter.TableMappings.Add _ ("Table1", "OrderLines") anAdapter.TableMappings _ (OrderTables.Orders).ColumnMappings.Add _ ("Customer_Id", "CustomerId") anAdapter.TableMappings _ (OrderTables.OrderLines).ColumnMappings.Add _ ("Orders_Id", "OrderId") anAdapter.TableMappings _ (OrderTables.Orders).ColumnMappings.Add _ ("Product_Id", "ProductId") anAdapter.Fill(dataSet)
Note the basic pattern shown in Listing 1. First a Command is set up. Then comes a DataAdapter, and, finally, Fill is called on the DataAdapter.
NOTE
The code in Listing 1 doesn't show how the DataSet is instantiated. This is because the code in Listing 1 is from a utility method that can be used for filling both typed and untyped DataSets. For that to work, the DataSet is instantiated as a DataSet or an OrderDataSet, for example, outside of the utility method and is sent as a parameter.
Quite a lot of the code in Listing 1 relates to mappings. First is some mapping code for giving the first DataTable the Orders name and then for giving the second DataTable the OrderLines name. For typed DataSets, this is important: Without this, you will end up with four DataTables in the DataSet instead of two. For untyped DataSets, this is important only for creating meaningful names for the DataTables.
The second mapping section is for changing some of the column names used in the stored procedure. Again, for the typed DataSet, this is important, but for the untyped DataSet, this is merely for convenience.
Now let's look at some code from the client side. To browse the information in the DataSet, we could use the code in Listing 2. Note that here I'm browsing a DataSet with two resultsets (or, rather, DataTables).
Listing 2: Code for Browsing a DataSet
Dim anOrderDS As DataSet = _ _service.FetchOrderAndLines(_GetRandomId()) Dim anOrder As DataRow = _ anOrderDS.Tables(OrderTables.Orders).Rows(0) _id = DirectCast(anOrder(OrderColumns.Id), _ Integer) _customerId = DirectCast _ (anOrder(OrderColumns.CustomerId), Integer) _orderDate = DirectCast _ (anOrder(OrderColumns.OrderDate), Date) Dim anOrderLine As DataRow For Each anOrderLine In anOrderDS.Tables _ (OrderTables.OrderLines).Rows _productId = DirectCast(anOrderLine _ (OrderLineColumns.ProductId), Integer) _priceForEach = CType(anOrderLine _ (OrderLineColumns.PriceForEach), Decimal) _noOfItems = DirectCast(anOrderLine _ (OrderLineColumns.NoOfItems), Integer) _comment = DirectCast(anOrderLine _ (OrderLineColumns.Comment), String) Next
NOTE
You might wonder about the idea of running a loop for the order lines and then just pushing the value of each column of each order line to a private variable, such as _productId. I do this so that the test runs end to end, all the way from the database to variables in the client. Therefore, I want to touch all columns in all rows of the data container.
Note in Listing 2 that I am referring to DataTables and DataColumns with enumerations. This is to make the code more readable than when magic integers are used and more efficient than when strings are used.
Let's compare the browse code for an untyped DataSet (just shown) with similar code for a typed DataSet. The version for the typed DataSet is found in Listing 3.
Listing 3: Code for Browsing a Typed DataSet
Dim anOrderDs As OrderDs = _ _service.FetchOrderAndLines(_GetRandomId()) Dim anOrder As OrderDs.OrdersRow = _ anOrderDs.Orders(0) _id = anOrder.Id _customerId = anOrder.CustomerId _orderDate = anOrder.OrderDate Dim anOrderLine As OrderDs.OrderLinesRow For Each anOrderLine In anOrderDs.OrderLines _productId = anOrderLine.ProductId _priceForEach = anOrderLine.PriceForEach _noOfItems = anOrderLine.NoOfItems _comment = anOrderLine.Comment Next
The code in Listing 3 is clearer and much shorter than the "same" code in Listing 2. This is because the schema is created at compile time, so you don't have to describe it over and over again in your code. Instead of referring to, for example, the generic DataRow class in Listing 2, I'm programming against specific types. I also can skip all the casting and conversions because all columns are in the "correct" data type already. That's definitely a way of reducing code bloat.
DataSet Tests
Time to discuss the test results. As with all the other test cases, there is a service-layer class for each test case. The service-layer classes for the DataSet test cases are shown in Figure 3.
Figure 3 One example of a service-layer class.
The service-layer classes inherit, as usual, from MarshalByRefObject. They should be suitable as root classes when used via remoting.
NOTE
Note that the second method in class for the typed DataSet returns OrderDs2. That typed DataSet class has only an OrderLines DataTable. Otherwise, I would have had to use a workaround to avoid getting a constraint error when fetching only OrderLines from the database.
You might think that it would be more appropriate to send just a DataTable instead of a complete DataSet in this case. I will discuss that further in Part 5 of this series.
Result of the Tests
In the first part of the articles series, I gave you a sneak peak regarding the throughput test results of the untyped DataSet. Now it's time to show you the results for all test cases discussed so far.
Once again, I will use DataReader as a baseline. Therefore, I have recalculated all the values so that I get value 1 for DataReader; the rest of the data containers will have a value that is relative to the DataReader value, for easy comparison. The higher the value, the better.
Table 1: Results for the First Test Case: Reading One Row
1 User, in AppDomain |
5 Users, in AppDomain |
1 User, Cross-Machines |
5 Users, Cross-Machines |
|
DataReader |
1 |
1 |
1 |
1 |
Untyped DataSet |
0.6 |
0.6 |
1.4 |
1.7 |
Typed DataSet |
0.4 |
0.5 |
1 |
1.1 |
Table 2: Results for the Second Test Case: Reading Many Rows
1 User, in AppDomain |
5 Users, in AppDomain |
1 User, Cross-Machines |
5 Users, Cross-Machines |
|
DataReader |
1 |
1 |
1 |
1 |
Untyped DataSet |
0.6 |
0.6 |
6.9 |
9.7 |
Typed DataSet |
0.5 |
0.5 |
6 |
8.6 |
Table 3: Results for the Third Test Case: Reading One Master Row and Many Detail Rows
1 User, in AppDomain |
5 Users, in AppDomain |
1 User, Cross-Machines |
5 Users, Cross-Machines |
|
DataReader |
1 |
1 |
1 |
1 |
Untyped DataSet |
0.5 |
0.5 |
6.1 |
8.5 |
Typed DataSet |
0.4 |
0.4 |
5.1 |
6.9 |
As you might guess, the five-users test uses 100% of the CPU because I'm not using any think time. That goes for both the AppDomain test and the cross-machines test.
In the cross-machines test, I should switch to several client machines, but I haven't done that yet. Perhaps I will rerun the tests in Part 5. On the other hand, the server in the five-users, cross-machines test uses approximately 80% of the CPU, so that would be the bottleneck.
This reminds me that I need to mention the test equipment. Because my company is small one (it's just me), I don't have a full-blown lab. Therefore, I have used three ordinary machines:
1.8GHz, 512MB RAM. This serves as everything except the database server in the AppDomain tests. It's the client for the cross-machines tests.
1.7GHz, 512MB RAM. This is the server for the cross-machines tests.
750MHz, 255MB RAM. This is the database server for all the tests.
As you learned earlier, both the Untyped DataSet and the typed DataSet have more overhead than the DataReader in the AppDomain. On the other hand, they perform better than the DataReader in the cross-machines test, especially when several rows are fetched. This is just as expected. It's also expected that the typed DataSet carries more overhead than the untyped DataSet.
But some forthcoming results aren't as you might expect. I'll whet your appetite a bit by telling you that with custom classes for the third test1 user and cross-machinesI get 16! (That is, it's 16 times more efficient to use custom classes than a DataReader for that specific test.) That is probably not what you expect from all talk about how efficient DataSets are. The untyped DataSet performs almost three times as poorly as custom classes when serialized across machines because DataSets are serialized as XML, even with a binary formatter. Test the code snippet in Listing 4, and open the results file in Notepad to see for yourself.
Listing 4: Code for Serializing a DataSet to a File
Dim fs As IO.FileStream = _ New IO.FileStream("c:\temp\ds.txt", IO.FileMode.Create) Dim bf As New _ System.Runtime.Serialization. _ Formatters.Binary.BinaryFormatter _ (Nothing, New Runtime.Serialization.StreamingContext _ (Runtime.Serialization.StreamingContextStates.Remoting)) bf.Serialize(fs, anOrderDS) fs.Close()
NOTE
You can read more about serialization aspects of DataSets in Dino Esposito's article "Binary Serialization of ADO.NET Objects" and in his book Applied XML Programming for Microsoft .NET (Microsoft Press, 2002). There Dino also discusses some workarounds to this problem. I will discuss the test result involved when using a workaround in Part 5 of this series.
Highly Subjective Results
It's time to add some grades for untyped and typed DataSets to my list of "highly subject results." In Table 4, you will find that I have assigned some grades according to the qualities discussed at the beginning of the article. A score of 5 is excellent, and a score of 1 is poor.
Table 4: Grades According to Qualities
|
Performance in AppDomain/Cross-Machines |
Scalability in AppDomain/Cross-Machines |
Productivity |
Maintainability |
Interoperability |
DataReader |
5/1 |
4/1 |
2 |
1 |
1 |
DataSet |
3/3 |
3/3 |
4 |
3 |
4 |
Typed DataSet |
2/2 |
2/2 |
5 |
4 |
5 |
I'd like to say a few words about each quality grade next.
Performance
Unlike the DataReader, both types of DataSets are marshalled by value. Therefore, performance is okay cross-machines, too.
Scalability
As I said last time, in this specific test I think performance and scalability go hand in hand, as those qualities were defined for this series of articles. It's important to note that DataSets won't hold open connections against the database, so using them entails less risk of killing scalability from holding on to connections too long.
Productivity
DataSets are great for productivity because you get a lot of functionality built in, debugged, and ready to use. Productivity is especially good for typed DataSets because there is a lot of design-time support for them in Visual Studio .NET.
In my opinion, the DataSet is very much about rapid application development (RAD) and does a good job regarding that.
Maintainability
I believe that maintainability will be pretty good for both types of DataSets. It's especially good for the typed DataSet because you have a strong contract against your code accessing it. On the other hand, I really like the idea of keeping the behavior together with the data, as with classic object-oriented solutions, thereby making it possible to get a very high degree of encapsulation. DataSets are useful for a more data-centric or document-centric approach so that you can let the behavior act on the data in the DataSets. This works very well, of course, but, in my opinion, in many situations long-term maintainability suffers.
Also worth mentioning is the loosely coupled model that typed DataSets use. That is, with an event-based model, you can use a specific typed DataSet in many situations, using different rules for each situation. You put the rules in event procedures in other classes instead of within the typed DataSet itself.
Interoperability
Finally, interoperability is pretty good for both types of DataSets. They serialize themselves to XML, but the DataSet also has the built-in possibility of a WriteXml() method that can be used to get a format other than the diffgram format that you get from the ordinary serialization of DataSets.
I decided to give the typed DataSet a score of 5 instead of 4 for interoperability because the XSD means a stronger contract with the client. In my opinion, that is desirable when it comes to interoperability.
Conclusion
In the first article in this series, I discussed the DataReader and concluded that it isn't meant to be used as a data container. That's hardly surprising. In this article, I discussed untyped and typed DataSets, which, are very nice data containers. DataSets are especially good thanks to all their built-in functionality. If you can benefit from that, DataSets are a winning choice. But I also raised the point that DataSets aren't great regarding performance and maintainabilitymore about that in an upcoming article.
Don't miss the third article in this series, about wrapped containers and generic containers.