InformIT

The Cost of GUIDs as Primary Keys

Date: Mar 8, 2002

Article is provided courtesy of Sams.

Return to the article

In this article, Jimmy Nilsson presents the pros and cons of using globally unique identifiers (GUIDs) as the datatype for primary keys in SQL Server 2000. In doing so, he shows you test results that hint of performance characteristics and introduces you to a special type of GUID that he invented, called COMBs, that solves what otherwise might give you a big throughput problem.

In this article, I will be discussing the pros and cons of using globally unique identifiers (GUIDs) as the datatype for primary keys in SQL Server 2000. You will see concrete test results that give a hint of the performance characteristics. Along the way, I will also discuss a special type of GUID that I "invented," called COMBs. Before we start discussing GUIDs, though, I'd like to say a few words about natural and surrogate keys.

Natural or Surrogate Keys

When you do the physical design of a relational database, it's very important to decide upon which style to use for the primary keys. Some people prefer to use natural keys—that is, keys that are found in the domain that the database models. Others prefer to use surrogate keys, which are constructed keys with no other purpose than to be just keys (and which are not found in the domain). An example of a natural key is a Social Security number. A value incrementing by 1 for each row is a typical example of a surrogate key.

Using natural keys is the traditional approach, in line with Codd's original relational model. When you use them, you have only natural data that means something to users. This is good if users will ask ad hoc queries directly to the database in raw SQL. You can also often reduce the numbers of joins when using natural keys because you don't have to go to a lookup table to convert an ID to a description. One more advantage is that you get the minimum number of constraints because you don't have to protect the uniqueness of the natural keys separately. You already did this when you used them as primary keys.

Surrogate keys can be seen as a newer approach. This approach does not conflict with the relational model, but, in a way, it is a step closer to a more object-based approach in which each object has an ID and the structure of all IDs is of the same type. When you use surrogate keys, you often get smaller foreign keys, which reduces the size of the database. There is no risk of users changing the values of the primary keys, and the programming can be more consistent because all keys are of the same format.

NOTE

With cascading updates/deletes in SQL Server 2000, the problem of users changing the values of primary keys is not so great anymore because you don't have to program the UPDATE of dependent rows manually.

That was a brief description of the different kinds of keys. Now let's assume that we choose to use surrogate keys when we design a new database.

A New Database

We are about to build a new database for order processing. A greatly simplified version of the database will look as shown in Figure 1.

Figure 1 Database schema for a simplified database for order processing.

As you can see in Figure 1, ordermaster.id and orderdetail.id are primary keys. There are also several foreign keys: namely, ordermaster.customer_id, orderdetail.ordermaster_id, and orderdetail.product_id.

NOTE

In this simplified model, I skipped adding the product and the customer tables, so ordermaster.customer_id and orderdetail.product_id aren't implemented as physical foreign keys.

We should take into consideration all of these primary and foreign keys when deciding on the datatype to use.

Which Datatype Should Be Used for Surrogate Keys?

I'm going to focus on two different datatypes for surrogate keys, INTEGER and GUID. Let's start with INTEGER.

NOTE

In the specific example shown in Figure 1, the users need a column in the ordermaster table to identify each order uniquely. The most correct solution to this is add an orderno column, which is given a value separately from the ID column. In the case of INTEGER+IDENTITY for the primary key, it would be fairly common to skip the orderno column, but I prefer not to let the user know the real key value.

To avoid affecting the test too much, I decided to skip the user-known column here. Otherwise, I would have had to maintain the orderno value manually for when the IDENTITY has already been used for the ID. When GUID is used for the ID, I could have let IDENTITY handle the orderno column. I did not do that, though, to avoid testing too many apples and oranges (or apples and pears, as we say in Sweden).

INTEGER as the Datatype

Even though we have decided to use surrogate keys, we are not yet finished—we still have to choose a specific datatype. The most typical one to use is INTEGER. An INTEGER is 4 bytes. You can declare INTEGER columns as IDENTITY and thereby let the database provide new row values automatically. Listing 1 illustrates an example of a new row being INSERTed into the ordermaster table and the new ID being fetched for later use when INSERTing into the orderdetail table.

Listing 1: Stored Procedure for Inserting a Row into ordermaster, INT Version

CREATE PROCEDURE a_OrderINT_InsertMaster
(@customerId INT, @orderId INT OUTPUT)
AS
 DECLARE @anError INT

 INSERT INTO orderINTmaster
 (customer_id, orderdate, status)
 VALUES
 (@customerId, GETDATE(), 1)

 SET @anError = @@ERROR
 SET @orderId = @@IDENTITY

 RETURN @anError

Before we continue, I'd like to say a couple of things about the code in Listing 1 and the other code in this article:

UNIQUEIDENTIFIER as the Datatype

A GUID is called UNIQUEIDENTIFIER in SQL Server. It has been supported as a native datatype in SQL Server since version 7. A GUID is 16 bytes long (four times the size of an INTEGER).

In Listing 2, you can see an example of a new row being INSERTed into the ordermaster table when GUIDs are used for the keys. In this case, the ID is generated before the INSERT statement, with the help of NEWID().

NOTE

For reasons of simplicity, there are two sets of tables and stored procedures: one set with "INT" in the name and one set with "GUID" in the name.

Listing 2: Stored Procedure for Inserting a Row into ordermaster, GUID Version

CREATE PROCEDURE a_OrderGUID_InsertMaster
(@customerId UNIQUEIDENTIFIER
, @orderId UNIQUEIDENTIFIER OUTPUT)
AS
 DECLARE @anError INT

 SET @orderId = NEWID()

 INSERT INTO orderGUIDmaster
 (id, customer_id
 , orderdate, status)
 VALUES
 (@orderId, @customerId
 , GETDATE(), 1)

 SET @anError = @@ERROR

 RETURN @anError

Pros and Cons of GUIDs as Primary Keys

Let's say that we plan to use GUIDs for the primary keys. What are the advantages and disadvantages when compared to INTEGERs? We should always start with the positive, so here goes:

I'm not a salesman, so here are a few drawbacks:

So, the overhead related to size is the main problem. Most developers seem to have an idea of how big the overhead is, but few have tested it. It's time that this was changed, so let's take a look at a few performance tests and get an idea of how large the overhead is.

The Tests

Now I'll discuss the test cases and their results, as well as the test environments and other factors.

The Test Cases

The test cases are as follows:

  1. Inserting an order with 10 detail rows (repeated 500 times, measured when having 1,000,000 orders)

  2. Fetching 100 specific orders out of 1,000,000, one by one

  3. The same as B, but this time with a JOIN with the orderdetail table

The Metrics

The metric is the time that the test case takes to execute.

NOTE

You can easily compare the size overhead of the database, if you want to. Just run some INSERTs for the INT case in one new database and run the same number of INSERTs for the GUID case in another new database. Then compare the file sizes afterward.

Because the test database is so simplified, the size metric is a bit inaccurate because the number of foreign keys compared to ordinary columns is pretty big. Keep this in mind if you evaluate the size metric. This also applies to the time difference, but there are definitely situations in which the database consists of almost only foreign keys.

All tests will be executed by a single user. The tests will be executed directly at the database server—I'm testing only at the data tier in this article. For example, the database will create the GUIDs with NEWID(). You could quite easily expand the test to run from several users or from an application server instead. A very relevant test would be to let the application server create the GUIDs instead, but that is beyond the scope of this article.

The Results

Now, I'm sure you are curious about the results. I promise to show them, but first I'd like to say a few things about test results in general and the environment that I used.

Disclaimer

It's extremely important to be skeptical when reading test results. To come to the right conclusion, you have to translate the tests to your environment and to your situation, and then re-execute the tests.

Environment

The equipment that I used for the tests is not real production equipment; I used only a development machine (1.7GHz, 512MB RAM). The test machine was a Windows XP machine with SP2 for SQL Server 2000.

In a new production environment, you will typically have RAID 1 (mirroring) for the log and RAID 10 (mirroring and striping) with at least four disks for the data. You will also have more CPU cache than 256KB, probably more than one CPU, and also at least 1GB of memory.

With that out of the way, let's take a look at the test results.

Test A: INSERT of Orders

In Table 1, you can find the results of the first test, in which I INSERT orders and details for those orders. The time values are recalculated so that the first result has a value of 1 (a baseline value) and all other time values shows a percentage compared to the first test result value.

As you can see in Table 1, the overhead for the GUID solution is actually huge. What is especially worrisome is the sign of bad scalability that I saw when I compared results from executions when I had different numbers of orders.

Table 1 Result of Test A: INSERT of 500 Orders

Test A

Time (with 1,000,000 Orders)

INTEGER

1

GUID

10


This big difference occurred with "only" 1,000,000 orders (and 10,000,000 detail rows). The difference was very small with a small amount of orders, so this is probably a really bad problem with 100,000,000 orders!

Something is wrong—very wrong!

Test A: INSERT of Orders, Revised

I did quick tests before comparing GUIDs and IDENTITY+INT without seeing the huge overhead that I saw when I scaled up the test a bit. (I also did the tests on NT4, and the problem is probably smaller there, as you soon will see.) This certainly worried me a lot! I tried different tricks, different indexing strategies, different fill factors, a clustered IDENTITY+INT, and so on—everything without luck. My guess was that the reason for the problem was the lack of order for the GUIDs generated at Windows 2000 (and beyond). In Listing 3, you can see a series of GUIDs generated with a couple of calls to NEWID() on SQL Server 7 on NT4. Compare that series with the one in Listing 4, which is generated with a couple of calls to NEWID() on SQL Server 2000 on Windows 2000.

Listing 3: Some GUIDs Generated with SQL Server 7 on NT4

B3BFC6B1-05A2-11D6-9FBA-00C04FF317DF
B3BFC6B2-05A2-11D6-9FBA-00C04FF317DF
B3BFC6B3-05A2-11D6-9FBA-00C04FF317DF
B3BFC6B4-05A2-11D6-9FBA-00C04FF317DF
B3BFC6B5-05A2-11D6-9FBA-00C04FF317DF

Listing 4: Some GUIDs Generated with SQL Server 2000 on Windows 2000

C87FC84A-EE47-47EE-842C-29E969AC5131
2A734AE4-E0EF-4D77-9F84-51A8365AC5A0
70E2E8DE-500E-4630-B3CB-166131D35C21
15ED815C-921C-4011-8667-7158982951EA
56B3BF2D-BDB0-4AFE-A26B-C8F59C4B5103

As you saw in Listing 3, only the eighth half byte is changed between calls. On the other hand, in Listing 4, only the 13th half byte is constant between calls. In Windows 2000, the MAC address isn't used any longer for when GUIDs are generated. Instead, the GUID is only a 16-byte random number. Well, that isn't totally true. The 13th half byte is constant, so only 15.5 bytes are random. Adam Nathan at Microsoft explained to me that the 13th half byte is the value that will point out the source of the GUID, and 4 means Microsoft. Thanks, Adam!

The guess that I had about the problem with the INSERT overhead for GUIDs was that the lack of order in the Windows 2000–generated GUIDs was giving SQL Server a hard time administering indexes under massive INSERT periods. Therefore, I tried to create an order for the GUIDs instead. I tried to CAST the current DATETIME to a BINARY(8) and put that first in the GUID. Unfortunately, that had no effect. When I investigated it further, I found that when I had a BINARY(16) value and CASTed it to a UNIQUEIDENTIFIER, some bytes were scrambled. What to do? You guessed it. I tried to compensate for the scrambling to see if that had any positive effect on throughput, but no effect occurred. Then I found out that it wasn't the first (high) byte that was important for the new ordering, but the last (low) bytes. I also learned that I didn't have to use BINARY(8) for the current DATETIME. BINARY(6) is enough for the next 77 years, so I decided to occupy only the last (low) 6 bytes with the current DATETIME. In Listing 5, you can see the result of PRINT CAST(GETDATE() AS BINARY(8)). The first (high) bytes are 0.

Listing 5: GETDATE() as BINARY(8)

0x0000919001862D20

The algorithm that I used to generate my new GUIDs is shown in Listing 6; GUIDs of that type (generated in SQL Server 2000 on Windows 2000) are shown in Listing 7. I call this type of GUIDs COMBs (for COMBined, abbreviated).

Listing 6: An Algorithm to Generate Ordered GUIDs (COMBs)

DECLARE @aGuid UNIQUEIDENTIFIER

SET @aGuid = CAST(CAST(NEWID() AS BINARY(10)) 
+ CAST(GETDATE() AS BINARY(6)) AS UNIQUEIDENTIFIER)

Listing 7: A Set of GUIDs (COMBs) Generated with the New Algorithm

E25AFE33-DB2D-4502-9BF0-919001862D20
83E689D3-8549-4094-B223-919001862D20
CC22A56D-0CD5-43C5-990E-919001862D20
D5149998-1718-468C-B1AD-919001862D20
CBD0182D-4A0E-40AC-9A4C-919001862D20

As you saw in Listing 7, those GUIDs (COMBs) have the same last 6 bytes. Those represent the DATETIME of when those five GUIDs were generated. A DATETIME has the "precision" of 1/300th of a second, and it's no problem for SQL Server to generate five GUIDs in 3ms with this algorithm, even though it is slower than using only NEWID(). As you can see, ordering the GUIDs isn't perfect because I will have several GUIDs being generated with the same DATETIME—but that doesn't really matter here.

Finally, I found something that helped with my INSERT overhead. When I used this new solution, I got the throughput table shown in Table 2.

Table 2 Result of Test A: INSERT of 500 Orders, Revised

Test A

Time (with 1,000,000 Orders)

INTEGER

1

GUID (COMB)

1.05


This was great news, and the overhead of the GUIDs was actually less than expected. Perhaps you wonder whether my algorithm to generate GUIDs in itself isn't very slow. Often when you do conversions, you see CPU cycles fly away. On the other hand, compared to writing to a database, those conversions are pretty cheap.

You also might think that this algorithm must be risky because there is no guarantee of unique GUIDs. That's correct, but the built-in algorithm makes no guarantee, either. It's just a random number. (Not even the NT4 algorithm was guaranteed, as I said before.) Okay, a 15.5-byte random number has low predictability of being duplicated. (Most of us think that the chance of winning millions in the lottery is almost zero. Still, the chance is a number that is less than 4 bytes!)

With my algorithm, the random number's size decreases from 15.5 bytes to 9.5 bytes because the last 6 bytes are occupied with the current DATETIME. Of course, the risk of generating duplicated random numbers with my algorithm is larger than with NEWID(), but it's not a problem if the numbers don't occur within the same 1/300 of a second. In my opinion, the risk of duplicates is small enough that it's acceptable. And when you have UNIQUE CONSTRAINTs, it's okay even if you happen to see two duplicated GUIDs being generated once a decennium or so.

NOTE

To judge the likelihood of generating two duplicate COMBs, I ran a quick test. I used 6 static bytes for the low bytes and then created a million COMBs without duplicates. It won't be possible for you to INSERT close to that many rows in 1/300 of a second.

I also discussed the risk with a friend who is a mathematician. He pointed out that we don't know the algorithm being used for the random number. Is it "perfect"? He also pointed out that it's very dangerous to make assumptions and conclusions about likelihood and random numbers. Even so, he said that there seemed to be a very low risk of creating duplicate COMBs.

To add some more fuel to the fire, I'd like to point out that the differences in real-life apps are probably bigger than the one I showed in Table 1 when ordinary GUIDs are used. This is because I skipped test results when CHECKPOINTing was going on—CHECKPOINTing was much more meaningful to the GUID test. When I used my custom (COMB) algorithm, CHECKPOINTing was much less frequent and was used for short intervals. That is obvious when I compare the time for INSERTing 500,000 orders when there are 500,000 orders present at the start. You can see the throughput shown in Table 3.

Table 3 Result of Test A: INSERT of 500,000 Orders

Test A

Time (with 500,000 Orders)

INTEGER

1

GUID

30

GUID (COMB)

1.1


When did you last get a wage increase of 30 times your salary? Think about it, and you'll get a feeling for what a huge difference this is!

Finally, the memory consumption of the tests was very different. The IDENTITY+INT and my new GUID (COMB) algorithm grabbed perhaps 1MB of RAM for each 500 orders INSERTed. The ordinary GUID algorithm grabbed all memory given to SQL Server (350MB in my tests) when 500 or 1,000 orders were INSERTed!

NOTE

One property of ordinary GUIDs isn't maintained with COMBs: The order of creation is not known with GUIDs. With COMBs, it is known. For some applications, this can be a problem that keeps COMBs from being used.

Is it common to see this type of raw INSERTs only? No, not that common. Most applications also need some reading of data. Now on to the second test.

Test B: Fetch of Single Order

In the second test, I fetch the customer ID from some of the orders, one by one. To get a correct set of IDs to work with, I first ran a script that loops through the table to grab 100 random IDs and PRINTs a complete stored procedure call to the stored procedure a_Order_Fetch, which is shown in GUID version in Listing 8.

Listing 8: GUID Version of a_Order_Fetch

CREATE PROCEDURE a_OrderGUID_Fetch
(@orderId UNIQUEIDENTIFIER
, @customerId UNIQUEIDENTIFIER OUTPUT)
AS
 DECLARE @anError INT

 SET @aCustomerId = 
 (SELECT om.customer_id
 FROM orderGUIDmaster om
 WHERE om.id = @orderId)

 SET @anError = @@ERROR

 RETURN @anError

Then I copied the 1,000 PRINTed rows and pasted them to another script that EXECUTEs the calls. To skew the test results as little as possible here, I first executed a DBCC DROPCLEANBUFFERS in the beginning of the script that calls the stored procedure.

Let's take a look at the results for the second test. As usual, the first value is recalculated to 1, to be used as a baseline.

Table 4 Result of Test B: SELECT of Orders

Test B

Time for Fetching 100 Orders

INTEGER

1 (baseline)

GUID (COMB)

1.1


This is fine, isn't it? But what about the dreaded 16-byte JOIN?

Test C: A Small JOIN

The most common argument for not using GUIDs for primary keys is that it's so expensive for JOINs. Perhaps you have heard this: "Use a 16-byte column to join? Are you crazy?"

I reused Test B but changed the stored procedures slightly so that the orderdetail table was also used in a JOIN, as shown in Listing 9.

Listing 9: COMB Version of a_Order_FetchWithDetail

CREATE PROCEDURE a_OrderCOMB_Fetch
(@orderId UNIQUEIDENTIFIER, @noOfRows INT OUTPUT)
AS
 DECLARE @anError INT

 SET @noOfRows = 
 (SELECT COUNT(*)
 FROM orderCOMBmaster om
 INNER JOIN orderCOMBdetail od
 ON om.id = od.ordermaster_id
 WHERE om.id = @orderId)

 SET @anError = @@ERROR

 RETURN @anError

The results of the third test case are shown in Table 5.

Table 5 Result of Test C: SELECT of orders and JOIN with Details

Test C

Time for Fetching 100 Orders

INTEGER

1 (baseline)

GUID (COMB)

1.1


Worth mentioning is that the foreign keys are clustered both in the INTEGER case and in the COMB case. As a matter of fact, it was faster not using clustered foreign keys for this specific test, but I thought that it was most natural (and probably more effective in other tests) to use clustered foreign keys.

A Few Tweaks

The test set is pretty untuned, but I'd like to mention a few small tweaks that proved to work well:

I did a few other things between tests to get as fair tests as possible. I ran a backup of the database after the massive INSERTs to all the tables before I collected my metrics. That way, I emptied the transaction log. I also stopped SQL Server between each test case. And, of course, I ran each test case several times and skipped extreme values before finding typical values.

Other Approaches

Of course, there are other solutions than INTs and GUIDs. Another typical solution is to go for the BIGINT datatype (64 bits), which is new for SQL Server 2000. You can use IDENTITY for BIGINTs, too.

Conclusion

Here is my conclusion regarding the overhead of using ordinary GUIDs for the primary keys:

With my COMB algorithm, the three problems mentioned disappear. The problem with the large overhead in size is there both with my COMBs and with ordinary GUIDs. (After all, COMBs have the format of GUIDs!)

But GUIDs have a lot of advantages, too. (These are sometimes so big that you really don't have a choice!) In my opinion, using GUIDs as primary keys is often a very good idea (especially if you go for the COMBs instead). Is the overhead or the advantages more important when it comes to your applications?

Special thanks to Vladimir Suponitsky and Tibor Karaszi!

800 East 96th Street, Indianapolis, Indiana 46240