Home > Articles > Data > SQL Server

SQL Server Reference Guide

Hosted by

Toggle Open Guide Table of ContentsGuide Contents

Close Table of ContentsGuide Contents

Close Table of Contents

Querying Multiple Data Sources from a Single Location (Distributed Queries)

Last updated Mar 28, 2003.

It's rare that a single data store (such as an Oracle or SQL Server database) holds all of the data a particular application might need. In many cases a single system has data that could be correlated with another system, so users are left to collate the information themselves, such as bringing up two applications and possibly even copying and pasting into a spreadsheet or third data tool to combine the data they need.

Add to this that companies are using not only on-premise applications but adding in applications on the web, and users are left with even more disconnected data sources. As a data professional, from the Database Administrator to the Database Developer, you will be called on to stich these systems together to ensure that your users have access to what they need.

You may also be asked to work with a Hybrid Cloud Solution such as Windows and SQL Azure. To allow the power and flexibility of these solutions but migrate slowly or perhaps to keep private and secure data locally, you'll need a way to connect to two or more data sources at once within a single user interface. In this tutorial I'll show you a few options on how to do that.

Like everything else in computing, there is no single solution to a set of requirements. It's important when you think about any of these options that they fit the particular situation at hand. Evaluate the needs first, and then layer the solutions and find the one that fits best. It's not uncommon or incorrect to use any or all of a mix of these solutions, especially across the entire application set you have to work with. You'll find one solution works best for this application, a different one for another application.

As I explain each of these options, remember that within your application requirements, there are other vectors to think about when connecting to multiple data sources: security, performance, and data types to name the more important ones. Each system has different ways of handling these items, and you need to understand what those impacts are, especially in the security area. Many code break-ins are a result of a "weak link" of combining more than one data source.

There are two main ways to access data on multiple servers: using a Data Access Layer (DAL) or using Distributed Queries originated from SQL Server. In this tutorial I'll show you each method, which have their own strengths and weaknesses.

Whether you're a DBA or a Developer, read through both of these sections to understand how each works, and then you can work with your administrator or developer counterpart to come up with the best solution.

For this tutorial, I'll use the standard Purchase Order example to illustrate. Purchase orders have a "header" set of information that doesn't change very frequently, and a "details" area that holds all of the information for a particular order. These are usually in at least two sets of tables in a Relational Database Management System (RDMBS) and are tied together with a Primary Key. In this simplified example, I'll just use a company "number" as the key that ties these together.

Using A Data Access Layer or "Middle Tier"

This process is called a Data Tier or more commonly a Data Access Layer (DAL). The basic idea in any of these is that you replace the direct connection to a database from a client to a central program that accesses data for the client. The presentation client then accesses the DAL to return data back to the screen. It basically moves the query from the client to another program.

When you use this approach, normally you're converting a data structure to a programming Class. This is an Object-Oriented view of data. Classes are Objects that can hold properties (in many ways similar to a column in a database table) or perform actions such as set (write) and get (read).

To bring several datasets together into a single object, the general process is to set up a Class that holds the data structure — this is what is returned to the calling program. The class is made up of a connection to the various datasets, joined on a common key between them. So basically you set up something called "A," connect to "B" and "C," and put the data they hold in "A." The client program then works from "A" as if it were a single object. This is similar to what happens in a standard SQL Join, with the exception that it isn't done in the SQL language and the data sources can be virtually anything. 

You might wonder why you wouldn't just solve every data integration problem this way — and some developers do. There are some issues with this approach, most notably Data Integrity, Performance, and Data Type mismatches.

Just because two datasets share a distinct value may not indicate that you can join them together. Using the simple example of the Purchase Order, it's not always true that just because a dataset that has a company number as a field in a table, that particular number is always the right one when you perform a programmatic join to another dataset, perhaps in an XML document, that has a company number in a Details line items set of data. So you can see that the first danger — and it is a big one — is that data might be joined without regard to ensuring that no "false connections" are made. This issue gets more problematic the more data sets you add in. Just as in electronics, just because you can plug something in doesn't mean you won't end up with a nasty shock.

Assuming that you have ensured the data sets maintain integrity between them, you also need to consider performance. You might have one of the most extremely tuned database systems around, but if that is joined in a program to a huge text file accessed over an Internet connection, the data is returned at the combination rate of both of them. There are ways to mitigate this issue, but it is something to consider.

Data type mismatches are a real issue with this method. It's almost certain that data will be returned from different systems using different data types, and even a small difference can be catastrophic. You can mitigate this issue by carefully examining how each data source handles a data type and casting or converting it in the code to a common, consistent format and numeric handling.

Although definitely not the last thing you should consider but the most important, security is a key consideration when accessing data from a DAL. The DAL should not allow more data access than is completely necessary for a given operation. Once again, understand the data security in each source, and ensure that your code does not allow access to more than absolutely necessary. Also, perform a full risk analysis on your code to ensure that it is secure from any attacks that might allow an unauthorized access to that authorized data.

Reading a combined dataset is relatively simple. However, you have to handle adding, editing or deleting data separately. Normally you capture the changes on the form the user edits, and then perform a set operation back through each dataset individually. In essence your code handles the transaction handling normally done by the RDBMS engine. That's not a problem, it just needs to be thought through carefully. There are constructs, especially within .NET Framework Libraries, to help you handle this.

With all of those caveats, you can access data from multiple sources in code quite simply. I'll stick with the .NET constructs in this tutorial as an example, although there are several examples here at InformIT and elsewhere for Java, C++ and other languages.

Using a .NET Data Adaptor DataSet

The first and most simple method is to use a .NET DataSet object. You set up two (or more) connections, query the data between them, set up a joining operation to combine them into a third object that acts as the aggregated single source. Here's a C# example of that process:

//Set up two connections, more are possible
SqlConnection companyConnection = new SqlConnection("Data Source=FirstServer;Integrated Security=SSPI;Initial Catalog=PurchaseOrders");
SqlConnection lineItemConnection = new SqlConnection("Data Source=SecondServer;Integrated Security=SSPI;Initial Catalog=PurchaseOrders");

SqlDataAdapter headerAdapter = new SqlDataAdapter("SELECT * FROM Company", companyConnection);
//Note: if this is to be a different source such as Oracle or text files, set up a different
//connection information above and possibly a different connection below
SqlDataAdapter lineitemAdapter = new SqlDataAdapter("SELECT * FROM LineItem", lineItemConnection);

// This is the combined dataset
DataSet companyLineItems = new DataSet();

// Go get data from each dataset from above
headerAdapter.Fill(companyLineItems, "companies");
lineitemAdapter.Fill(companyLineItems, "lineItems");


// Join them
DataRelation relation = companyLineItems.Relations.Add("CustLineItems",
  companyLineItems.Tables["companies"].Columns["companyID"],
  companyLineItems.Tables["lineItems"].Columns["companyID"]);

  // Display the combined data
foreach (DataRow pRow in companyLineItems.Tables["companies"].Rows)
{
  Console.WriteLine(pRow["companyID"]);
	foreach (DataRow cRow in pRow.GetChildRows(relation))
	Console.WriteLine("\t" + cRow["LineItem"]);
}

Using Language Integrated Query (LINQ)

Another method is to use Language Integrated Query (LINQ). LINQ is a single data query language used in .NET programming languages. It's completely integrated into the .NET framework, so developers use this construct quite often. I have an entire article on this technology here, so I won't repeat that information in this article.

Here's a snippet of an example of two text files in C# (the SQL Server examples are longer, to be more clear I'm just using text files here):

class PurchaseOrder
{
    public string CompanyName { get; set; }
    public string CompanyAddress { get; set; }
    public int ID { get; set; }
    public List<int> LineItems { get; set; }
}

class PopulateCollection
{
    static void Main()
    {
		 string[] POHeaders = System.IO.File.ReadAllLines(@"../../../POHeaders.csv");
         // File example, Bucksoft,1234 Mockingbird Lane,1001

        string[] LineItems = System.IO.File.ReadAllLines(@"../../../LineItems.csv");
        // File example, 1001, Product 234, Product 237, Product 341, Product 795

        // Merge the data sources - This is the main part of the work
        IEnumerable<PurchaseOrder> queryNamesLineItems =
            from nameLine in POHeaders
            let splitName = nameLine.Split(',')
            from itemLine in LineItems
            let splitLineItemLine = itemLine.Split(',')
            where splitName[2] == splitLineItemLine[0]
            select new PurchaseOrder()
            {
                CompanyName = splitName[0],
                CompanyAddress = splitName[1],
                ID = Convert.ToInt32(splitName[2]),
                LineItems = (from itemAsText in splitLineItemLine.Skip(1) // Do not show the key again
                              select Convert.ToInt32(itemAsText)).
                              ToList()
            };

        List<PurchaseOrder> companys = queryNamesLineItems.ToList();

        // Display each company's name and items
        foreach (var company in companys)
        {
            Console.WriteLine("The items are {0} {1} {2} {3} {4}", company.CompanyName, company.CompanyAddress, company.LineItems);
        }

        //Keep console window open in debug mode
        Console.WriteLine("Press any key to exit.");
        Console.ReadKey();
    }
}

This example is similar to one in Microsoft documentation on LINQ, and there is a full explanation of that code here: http://msdn.microsoft.com/en-us/library/bb513866.aspx

Using Entity Framework

Entity Framework (EF) is a new form of working from a model-first development mechanism, and I've described it further here. I won't show a full example of this kind of query, primarily because it requires not just code. You'll modify control files for the application to add the various providers to be able to call them in code. Because it's more a more complex configuration, and since I'm trying to focus more on a data professional audience than a developer one, I'll point you to this article which has more information on this process in specific, and to these articles for more general EF information. There's also an excellent book here to learn more as well.

Using Distributed Queries

Instead of (or even in combination with) using a Data Access Layer, you can have SQL Server connect to other datasets to bring data together in a single query or Stored Procedure, Function, or View. Microsoft calls this a Distributed Query, and you can learn more about how they view this process here: http://msdn.microsoft.com/en-us/library/ms188721.aspx

Using this method you're able to handle some of the performance and data type issues, but even so there are issues to consider. There's still the security angle, and the simple fact is that querying across multiple data sources takes more time than from one single source. It also gets more difficult to create dissimilar joins using the SQL Server when you add more and more sources into the mix.

Four Part Queries

The simplest way to access more than one data set is to qualify the name in four parts in the query:

server.database.schema.object

While it's simple, there are quite a few caveats. First, if you want to use the server name part, you'll have to set up a linked server — which I'll talk about in a moment. So I'll drop the server name for now —

database.schema.object

Now you can see that the other data source actually has to be another database on the same server, which implies of course that your other data source is on SQL Server, and most specifically that it is on the same Instance of SQL Server. Also, from the security standpoint you have to allow "Cross Database Ownership Chaining," a very dangerous thing to do.

With all of those caveats, I set my test instance of SQL Server to allow Cross Database Ownership Chaining and ran this code:

/* Super uber-brain-dead example - cross database ownership 
must be enabled which can be evil */
USE master;
GO

CREATE DATABASE TestPurchaseOrderDatabaseA;
GO
CREATE DATABASE TestPurchaseOrderDatabaseB;
GO

/* Set up the first database to hold the company data */
Use TestPurchaseOrderDatabaseA;
GO
CREATE TABLE PurchaseOrderHeader 
(CompanyID int PRIMARY KEY
,CompanyName Varchar(50)); 
GO

INSERT INTO PurchaseOrderHeader 
VALUES (1, 'Awesome Examples RUs');
GO
INSERT INTO PurchaseOrderHeader 
VALUES (2, 'Security Is Overrated, Inc.');
GO


/* Set up the second database to hold the details data */
Use TestPurchaseOrderDatabaseB;
GO
CREATE TABLE PurchaseOrderDetail 
(LineID int PRIMARY KEY
, CompanyID int
,LineItem Varchar(50)); 
GO
INSERT INTO PurchaseOrderDetail
VALUES (1,1, 'Distributed Query Example 1');
GO
INSERT INTO PurchaseOrderDetail
VALUES (2,1, 'Distributed Query Example 2');
GO
INSERT INTO PurchaseOrderDetail
VALUES (3,2, 'Let Me In');
GO
INSERT INTO PurchaseOrderDetail
VALUES (4,2, 'My Password is 1234');
GO

/* Now put it all together */
SELECT A.CompanyName, B.LineItem
FROM TestPurchaseOrderDatabaseA.dbo.PurchaseOrderHeader as A
INNER JOIN TestPurchaseOrderDatabaseB.dbo.PurchaseOrderDetail as B
ON A.CompanyID = B.CompanyID
 
/* Clean Up, Clean Up, everybody do your share */
Use master;
GO
DROP DATABASE TestPurchaseOrderDatabaseA;
GO
DROP DATABASE TestPurchaseOrderDatabaseB;
GO

Once again, only under rare circumstances is this a good idea.

OPENROWSET

If you need to access data from a different database or even a different server and server type than SQL Server, you can use the OPENROWSET call in a SQL Server query. You will need to supply the type of connection you want to make, send the query, and once again join the results. Here's an example where I'll use the same table layout, but this time I'll put the details in a text file. The instructions if you want to try this are in the comments.

/* Use OPENROWSET to connect a text file to a database.

First, you'll need a text file called LineItems.txt in c:\temp 
that has this data in it:

1	Distributed Query Example 1
1	Distributed Query Example 2
2	Let Me In
2	My Password is 1234

Now create another text file in c:\temp called LineItems.fmt 
that has this data:
9.0
2
1  SQLCHAR  0  10 "\t"        1  ID                SQL_Latin1_General_Cp437_BIN
2  SQLCHAR  0  40 "\r\n"      2  Description        SQL_Latin1_General_Cp437_BIN
*/

USE master;
GO
CREATE DATABASE TestPurchaseOrderDatabase;
GO

/* Set up the database to hold the company data */
Use TestPurchaseOrderDatabase;
GO
CREATE TABLE PurchaseOrderHeader 
(CompanyID int PRIMARY KEY
,CompanyName Varchar(50)); 
GO

INSERT INTO PurchaseOrderHeader 
VALUES (1, 'Awesome Examples RUs');
GO
INSERT INTO PurchaseOrderHeader 
VALUES (2, 'Security Is Overrated, Inc.');
GO

/* Now put it together */
SELECT A.CompanyName, B.Description
FROM PurchaseOrderHeader as A
INNER JOIN OPENROWSET( BULK 'c:\temp\LineItems.txt', 
   FORMATFILE = 'c:\temp\LineItems.fmt') AS B
ON A.CompanyID = B.ID;
GO

This method works well for well-defined OLE DB data sources, and there are quite a few of those. You can learn more about your options here: http://msdn.microsoft.com/en-gb/library/ms190312.aspx

But a simple examination of the Execution Plan for this query shows that it's quite a bit of time to "spin up" that remote query provider to access the data. While it works, it can be a bit slow, especially if you use it over and over. For that, you can use a Linked Server. 

Linked Servers and OPENQUERY

A "Linked Server" is essentially a definition of another data source within SQL Server. The engine opens a channel to the other data source and allows you to query it.

To create a Linked Server, you can use graphical tools or the sp_addlinkedserver  Stored Procedure. Once you connect to that server you can implement the four-part query I mentioned earlier. You can reference this link to locate the exact parameters you need, based on the type of data source you want to connect to. For instance, here is an example of connecting to an Oracle server called by the SQL*Net name of PurchasingServer, and setting that name to TestPurchaseOrderDatabaseB in SQL Server:

EXEC sp_addlinkedserver
   @server = N' TestPurchaseOrderDatabaseB',
   @srvproduct = N'Oracle',
   @provider = N'MSDAORA',
   @datasrc = N'PurchasingServer';
GO

Now you can reference the server in a SQL Server by the four part name like this:

SELECT * FROM
TestPurchaseOrderDatabaseB.dbo.PurchaseOrderDetails;
GO

And use it just as I showed earlier for the three-part name. While Linked Servers simplify connections, they can be problematic. Connection parameters (like having the proper Oracle Client installed on the server) and so on can grow quickly. Also, performance can suffer because of the methods that the engine chooses to query the other server.

To help with the second issue, you can create a "pass-through" query. This construct sends the query directly to the other data system, which will theoretically process it faster than SQL Server will, and return a data set quicker. To create a pass-through query after you create your Linked Server you can use the OPENQUERY command.

Essentially you treat this command as a sub-SELECT operation, like this:

SELECT *
FROM OPENQUERY(TestPrucahseOrderDatabaseB, 'SELECT ID, Description  FROM PruchaseOrderGroup.LineItems') ;
GO

This can be an interesting exercise when you want to edit or delete the data — read the full description of that process here: http://msdn.microsoft.com/en-us/library/ms188427.aspx

While Linked Servers are very simple to implement, I find them quite "brittle," meaning that they malfunction often. Also, it is quite frustrating to learn that the query you've been battling with for a while is on a Linked Server — look for those four-part-names if you see a misbehaving query.

Security can also be an issue. Once the server is linked, it carries the security of the code, not necessarily the user.

OPENDATASOURCE

If you do need the power and convenience of a Linked Server, it's often a better choice to use the command. This command allows you to use a four-part-name in a query without having to create a Linked Server first. It's most useful with another SQL Server, because you can send along the Windows Authentication that the SQL Server Service is using. If you use another provider, like Oracle, you need to protect this code because you may have to send along your name and password — something you don't want in visible code. I have also used this with "security-less" systems like Excel spreadsheets, like this:

SELECT * FROM OPENDATASOURCE('Microsoft.Jet.OLEDB.4.0'
,'Data Source=C:\temp\POLineItems.xls;Extended Properties=EXCEL  5.0')...[Sheet1$] ;

As you can see, you have quite a few options for connecting to multiple sources of data. My personal take is that a programmatic method is often the most secure and best performing option, but that assumes that the developers have done their homework on making that true. Each situation is different, and knowing more about what options you have is the key.