InformIT

Introduction to Database Replication

Date: Mar 5, 2004

Return to the article

Replication involves much more than setup, just as marriage involves much more than a honeymoon. And there aren't many sources of information for replication implementation and troubleshooting. In this first in a series on replication, Baya Pavliashvili fills in some of the gaps for you by detailing replication types and giving you a general overview.

As it relates to SQL Server, replication is a way of keeping data synchronized in multiple databases. Implementing and maintaining replication might not be a simple proposition: If you have numerous database servers that need to be involved in various types of replication, a simple task can quickly become complex. Implementing replication can also be complicated by the application architecture.

Microsoft SQL Server has supported replication since version 6.0, and setting up replication has become significantly easier over the years (in fact, 99 percent of replication setup can be accomplished by clicking through replication wizards). However, replication involves much more than setup, and unfortunately there aren't many sources of information for implementing and troubleshooting it. The only way to learn replication is to dig through the knowledge base articles and figure things out on your own.

In this series of articles, I hope to fill in some of these gaps in information. This first article introduces you to replication types and concepts. Future articles examine setup, maintenance, and troubleshooting steps for transactional replication.

NOTE

This series of articles does not discuss merge or snapshot replication.

Replication Terminology

SQL Server replication is commonly described by using the publisher/subscriber metaphor. A database server that makes data available for replication (source server) is referred to as the publisher; a collection of one or more database objects that are enabled for replication is called a publication. SQL Server supports replicating tables, views, stored procedures, and user-defined functions.

One or more servers that get data and/or transactions from the publisher are called subscribers. Replication is managed by the system database, which by default is called distribution. A distribution database—which can reside on the publisher, subscriber, or on a separate server—is created when you configure replication.

The server that hosts the distribution database is referred to as the distribution server or distributor.

TIP

It is recommended that you always use a server that is dedicated to distributing transactions. Thus, the distribution server should be used for nothing but replication.

Each database server can act as a publisher and subscriber at the same time. Each publisher can have multiple subscribers, and each subscriber can receive transactions from multiple publishers.

You should also become familiar with replication agents, which are implemented as SQL Server jobs that perform a particular task according to their schedule.

SQL Server 2000 supports three distinct types of replication: snapshot, transactional, and merge, each of which has its own purpose.

Snapshot Replication

Snapshot replication simply takes a "snapshot" of the data on one server and moves that data to another server (or another database on the same server). After the initial synchronization snapshot, replication can refresh data in published tables periodically—based on the schedule you specify. Although snapshot replication is the easiest type to set up and maintain, it requires copying all data each time a table is refreshed.

Between scheduled refreshes, data on the publisher might be very different from the data on subscriber. In short, snapshot replication isn't very different from emptying out the destination table(s) and using a DTS package to import data from the source.

Transactional Replication

Transactional replication involves copying data from the publisher to the subscriber(s) once and then delivering transactions to the subscriber(s) as they occur on the publisher. The initial copy of the data is transported by using the same mechanism as with snapshot replication: SQL Server takes a snapshot of data on the publisher and moves it to the subscriber(s). As database users insert, update, or delete records on the publisher, transactions are forwarded to the subscriber(s).

To make sure that SQL Server synchronizes your transactions as quickly as possible, you can make a simple configuration change: Tell it to deliver transactions continuously. Alternatively, you can run synchronization tasks periodically. Transactional replication is most useful in environments that have a dependable dedicated network line between database servers participating in replication. Typically, database servers subscribing to transactional publications do not modify data; they use data strictly for read-only purposes. However, SQL Server does support transactional replication that allows data changes on subscribers as well.

Merge Replication

Merge replication combines data from multiple sources into a single central database. Much like transactional replication, merge replication uses initial synchronization by taking the snapshot of data on the publisher and moving it to subscribers. Unlike transactional replication, merge replication allows changes of the same data on publishers and subscribers, even when subscribers are not connected to the network. When subscribers connect to the network, replication will detect and combine changes from all subscribers and change data on the publisher accordingly. Merge replication is useful when you have a need to modify data on remote computers and when subscribers are not guaranteed to have a continuous connection to the network.

Replication can be used effectively for many different purposes, as discussed in the following sections.

Separating Data Entry and Reporting

If you have worked in an environment in which the same database is used for data entry and reporting, you probably know that things aren't always rosy. Constantly reading and modifying data in the same set of tables just doesn't work very well if you care about data integrity. Transactions that run against a set of tables prevent reading the locked data rows and pages, or perhaps prevent even an entire table from being read by a report. In such an environment, you are bound to see blocking locks. Although there are ways to avoid blocking (please see my earlier article about transactions and locking), it is best to separate data entry and reporting databases. Transactional replication works well by delivering data changes from the data entry server to the reporting server.

Distributing Load Across Servers

As your organization grows, you might find yourself in a situation in which a single database server is utilized by too many users. If CPU utilization on your database servers is constantly over 80 percent and you have tuned database design and queries appropriately, chances are you could benefit by spreading the user base over multiple servers. For instance, a server named South could serve all employees working in the southern United States, and a server called North could serve all Northerners. If you need to combine all data for reporting, you could use replication to move transactions from North and South to a server named Central_Reporting.

Providing High Availability

Occasionally, you might consider using replication for high availability; that is, to replicate transactions from the main server to a standby server. If the main server fails, you can then point your data sources to the standby server. Be aware that using replication for high availability takes careful planning and testing. Replication does not provide any sort of automatic fail-over. SQL Server supports other methods of providing high availability, such as clustering and log-shipping, which might be more appropriate for your environment.

Transporting Data

Another common use for replication is to simply move data changes from publishers to subscribers. This method is particularly useful for moving transactional data to a data warehousing server, in which it is transformed and aggregated for OLAP reporting. SQL Server provides other ways of transporting data: DTS, BCP, BULK INSERT statements, and others. Be sure to carefully consider the alternatives before implementing replication because other solutions might be cheaper or even faster than replication.

Replication needs to be planned carefully. Setting things up is easy, but there is no magic UNDO button that will reverse all your actions. Therefore, be sure to test your plan thoroughly before implementing a replication solution. The following sections discuss some of the planning steps necessary for transactional replication.

Defining Server Roles

The first step is defining server roles using the publisher/subscriber/distributor model discussed earlier. Identifying publishers and subscribers is relatively simple: Publisher is the server in which data changes occur; subscriber is where changes need to be delivered.

Be careful when defining the distributor, however. It is often tempting to use the publishing server as a distributor because it can translate into saving the cost of purchasing a separate server. However, if you're implementing replication to reduce the load on the server or to separate read/write activity, you should strongly consider using a separate server dedicated to distributing replicated transactions. Bear in mind that every single transaction that occurs on your server must be recorded in the distribution database prior to being delivered to the subscribers. Some shops opt for using one of the subscriber servers as the distributor, which can work if the subscriber is not heavily taxed with the existing load. Using the subscriber as the distributor is also common in test and quality assurance environments.

Determining What Data Needs to be Replicated

You might now be thinking: "Oh, that's easy...just replicate all database objects, right?" Not so fast! If you have a single database server with 20 users and a 120MB database, you can safely publish all data and subscribe to it from another server. But in the real world, you should replicate only data that you must have on the subscribers. SQL Server replication is a mature technology that can move many millions of transactions each day given sufficient hardware resources.

But I'd hesitate putting my production servers to such a test. To accomplish the best possible performance without bringing your servers to their knees, consider carefully which tables, views, stored procedures and user-defined functions you must replicate.

Replication Agents

Transactional replication involves three agents: snapshot, log reader, and distribution. The snapshot agent takes a snapshot of records on the publisher and copies the data out to the snapshot folder. The snapshot agent also generates scripts for database schema and includes CREATE TABLE and CREATE INDEX scripts.

The snapshot agent doesn't have to run continuously for transactional replication. If you get your replication working properly the first time, you might never have to run the snapshot agent again after the initial synchronization. However, if you do have problems with the subscriber servers missing data, the snapshot agent is there to help.

The log reader agent reads the transaction log on the published databases. This agent moves transactions that are marked for replication to the distribution database. The distribution agent delivers transactions from the distribution database to the subscribers. Log reader and distribution agents have to run continuously (or at scheduled intervals) to keep replication working.

In addition to snapshot, log reader, and distribution agents, replication also uses a few other jobs (agents) to keep things organized. The history cleanup agent, for example, is used to delete transactions that have already been delivered from the distribution database. Indeed, if this agent did not work, the distribution database would grow very large.

Agent Profile Settings

Replication agents are implemented as SQL Server jobs that call executable files with certain parameters. You should be aware that clicking through the replication wizards configures agents to run with default parameters. If you need to tweak agent parameters for troubleshooting or for performance reasons, you'll have to modify the replication agent's profile. (I'll discuss replication agents' parameters in the next article.)

Security Considerations

Replication agents must have appropriate security permissions to read data on the publisher, move transactions to the distributor and apply the data and transactions to the subscribers. You can allow replication agents to run using security credentials of SQL Server Agent service; alternatively, you can define a login that has a more limited set of permissions. Security is not a joking matter: Allow your replication agents too much freedom, and a hacker can destroy your data on publishers as well as subscribers. On the other hand, not granting sufficient permissions to the agents prevents replication from working properly.

Identity Columns

Identity columns present a special challenge for some replication scenarios. For example, consider what happens if you have multiple publishers sending transactions to a single subscriber. You want to split the user base among two servers, but many tables in your database have identity columns. How do you prevent identity values from colliding on two servers? You must manage the identity ranges on your servers somehow to avoid duplicate identity values on the subscriber. One solution is to allow only positive IDENTITY values on Server A and only negative IDENTITY values on Server B.

Collation

Collation is a combination of case-sensitivity and sort order used for character data. Earlier releases of SQL Server limited defining of collation to a particular server or database. With SQL Server 2000, you can define collation for each character column. Such freedom comes with the price of an administrative nightmare, however, if you have numerous character columns using different collations.

As far as replication is concerned, published and subscribed tables must have the same collation for corresponding columns. Having different collations on the publisher and subscriber might result in failure to replicate transactions. Furthermore, SELECT statements executed against the same table on the publisher and subscriber might return different data if tables have different collation.

Triggers and Constraints

Yet another consideration is whether triggers and constraints should be enforced on the subscribing servers. Triggers are often used to maintain data integrity by executing INSERT, UPDATE, or DELETE statements when a corresponding data modification occurs in related tables. Such statements might already be replicated from the publisher, however; therefore, if you're not careful, you might end up running the same transaction twice—once through replicating the transaction from the publisher and then through a trigger that resides on the subscriber. On the other hand, if you allow modifying data on subscribers, you might want to enforce data integrity through triggers. Constraints check the existence of the related record prior to allowing the modification of a row in a table. However, depending on your replication scheme, the related table(s) might not be replicated.

The saving grace with triggers and constraints is the NOT FOR REPLICATION option, which allows SQL Server to disregard triggers and constraints for replicated transactions (but check them for all other transactions). If your subscriber databases are used for read-only purposes, you might wish to drop triggers and constraints from subscribing databases altogether.

TEXT and IMAGE Data

TEXT, NTEXT, and IMAGE are special data types for holding a large number of characters or images stored in binary format. Such data types are typically modified using the WRITETEXT and UPDATETEXT commands, which are not logged. You can also modify such data using INSERT/UPDATE statements, which are logged. Be sure to examine your application carefully to ensure that your text and image data changes are being logged. In addition, be aware that although you can replicate columns with the TEXT data type, SQL Server has a limit of characters that will be replicated for such data.

Immediate Updating and Queued Updating

When you want to use updatable subscribers, you need to consider whether transactions need to be delivered back to the publisher immediately or whether some delay is acceptable. Typically, you should opt for queued updating if continuous network connectivity is not guaranteed.

This article introduced you to high-level concepts of replication with SQL Server 2000. I also gave you some advice for the appropriate planning of transactional replication. This introduction is not meant to scare you off; replication is not rocket science. However, it does require careful planning and much patience to get it right. The upcoming articles in this series will dive into details of setting up, monitoring, and troubleshooting transactional replication.

800 East 96th Street, Indianapolis, Indiana 46240