1.2 System Architectures
1.2.1 Synchronous Systems
The basic synchronous digital transmission system uses a central clock that is distributed to both the transmitter (TX) and the receiver (RX) (Figure 1-1). On every clock edge, the transmitter latches the incoming data, which then travels down the transmission line toward the receiver. The receiver samples the data on the next clock edge. Short-distance synchronous systems, for example between a processor and its memory, are often parallel: Multiple data lines are clocked together.
Figure 1-1 Block diagram of a synchronous system with a common clock source
Figure 1-2 shows the timing diagram for such a synchronous system, with all the relevant delays: the propagation delay of the clock signal from the clock source to the TX latch (TClkTX) and the RX latch (TClkRX), the time it takes the TX to latch the data (TClkDataOut), and the propagation delay of the data path (TDataTXRX). For the sake of simplicity, we will not include timing uncertainties in our analysis. From these delays, we can calculate TData, the time when the data arrives at the receiver latch:
Figure 1-2 Timing diagram for a synchronous system with a common clock
and also TSample, the time when the receiver latch will sample the data:
An additional requirement of the receiver latch is that the incoming data is stable for some time before and after the sampling clock edge; it requires a positive setup time (TSetup) and hold time (THold). How long the data at the receiver latch is stable before sampling is equal to the time difference between TSample and TData. In order to maintain the setup time requirement, this value has to be larger than the setup time:
Then the setup time margin is
From this, we can calculate the minimum cycle time, by setting the setup margin to zero:
If a system has insufficient setup margin, we can increase either the cycle time (make the system slower) or the RX clock propagation delay, or we can decrease either the TX clock propagation delay or the data propagation delay.
How long the data at the receiver latch is stable after sampling is equal to the time difference between TData plus one cycle, minus TSample. This value has to be larger than the hold time:
Then the hold time margin is
Note that the hold time margin is independent of the cycle time; the hold time requirement doesn't relax if the system runs at a slower speed. In order to gain hold time margin, we can decrease the RX clock propagation delay, or we can increase either the TX clock propagation delay or the data propagation delay.
Let's consider an example: an 8-bit parallel synchronous system with 74ACT646 registered transceivers (Figure 1-3). The distance between the transmitter and receiver is 6 inches, which is equivalent to a propagation delay of approximately 1.0 ns on an FR4 printed circuit board. For simplicity, we assume that the clock source is exactly in the middle between transmitter and receiver, so that both TClkTX and TClkRX are 0.5 ns. The 74ACT646 has a specified worst-case setup time of 5.0 ns, a hold time of 0.0 ns, and a clock-to-data-out time of 12.0 ns. From Equation 1-5, we calculate the minimum cycle time as 18.0 ns, which gives us a maximum operating frequency of 55.55 MHz; the hold time margin for this setup is 13.0 ns. If we place the clock source at the transmitter (so that TClkTX equals 0.0 ns and TClkRX equals 1.0 ns), the minimum cycle time is only 17.0 ns, so we can operate at frequencies up to 58.82 MHz, still without violating the hold time requirement: The hold time margin for this configuration is 12.0 ns.
Figure 1-3 Example of a parallel synchronous system with 74ACT646 octal registered transceivers. Both devices are set to latch data from port A (A0–A7) to port B (B0–B7) on a positive edge on the AB clock pulse input (CPAB).
One of the most widely used parallel synchronous systems is the Peripheral Component Interconnect (PCI) bus, designed to attach peripherals to computers. PCI is a multidrop configuration, where multiple receivers are connected to the same transmitter. And in multipoint applications, bidirectional transceivers are attached to a common data bus (Figure 1-4). A bus master is responsible for maintaining bus integrity; for example, it ensures that no two systems send data at the same time. Setup and hold time requirements need to be fulfilled for all combinations of send and receive, which is a further limitation on the speed that is achievable with such a configuration. Different variants of PCI run at 33 MHz and 66 MHz, with 32 or 64 parallel data lines. PCI-X increased the signaling rate even further (to 133 MHz, 266 MHz, and even 533 MHz) but never became widely used in consumer products because of costs associated with the complicated signal routing.
Figure 1-4 Block diagram of a bidirectional multipoint parallel bus (e.g., PCI)
1.2.2 Source Synchronous Systems
In source synchronous systems, the sampling clock is sent along with the data by the transmitter, rather than a central clock source as in synchronous systems. Figure 1-5 shows a generic block diagram. The transmitter has its own clock source, which generates edges for the TX data latch and the clock that is sent to the receiver for sampling. The delay element in the TX clock path ensures that the clock edge arrives at the receiver later than the data, which is required for correct sampling.
Figure 1-5 Block diagram of a source synchronous system
Source synchronous systems can operate at significantly higher speeds than synchronous systems. The reason for this becomes clear when we look at the timing diagram for the system (Figure 1-6). The relevant delays are the propagation delay of the data (TDataTXRX) and clock (TClkTXRX) and the delay between data and clock at the transmitter (TDataClkTX). The two latches at the transmitter do have clock-to-data-out times, but we simply included them in the propagation delays; we've done the same for the clock distribution within the transmitter. The time when the data becomes valid at the receiver is
Figure 1-6 Timing diagram for a source synchronous system
and the sample time is
Note that the cycle time disappeared from the two equations; source synchronous systems don't have a theoretical frequency limit. Setup and hold time requirements still need to be satisfied, however. From the setup time and hold time requirements (Equations 1-3 and 1-6), we calculate the setup time margin:
and the hold time margin:
Because the data-to-clock delay at the transmitter is controlled by the delay element, we can make every source synchronous system work, provided that the sum of the setup and hold times doesn't exceed the cycle time. At very high speeds, however, it becomes increasingly difficult to control the skew between the data path and the clock path, especially if data is transmitted in parallel. Practical source synchronous systems operate at data rates up to 1 GHz.
1.2.3 Source Synchronous Systems with Double Data Rate
A variant of source synchronous transmission uses a half-rate clock and latches data at both the rising and the falling edges. Because the data is transmitted at double the speed relative to a normal clock, this variant is called double data rate (DDR) signaling. Figure 1-7 shows an example of a timing diagram for such a system. The timing relationships are almost the same as before, with one difference: Both positive and negative clock edges can be used as sampling references.
Figure 1-7 Timing diagram for a source synchronous system with a double data rate clock
Because of the reduced clock speed, signal routing is simplified, and lower bandwidth connectors can be used. This is one of the reasons why DDR is used, for example, in high-speed memory interfaces such as DDR-2 SDRAM, with clock speeds up to 400 MHz and corresponding data transfer rates up to 800 Mbit/s.
1.2.4 Forwarded Clock Systems
Forwarded clock systems are very similar to source synchronous systems (Figure 1-8). The main idea of the forwarded clock is that the clock path from the transmitter to the receiver experiences the exact same noise and jitter as the data path.
Figure 1-8 Block diagram of a forwarded clock system
The first major difference of the forwarded clock architecture compared to the source synchronous architecture is that the delay element in the clock path resides in the receiver rather than in the transmitter. This is to make sure that any jitter that occurs during the transmission impacts both data and clock and hence cancels out. The delay element on the receiver is a delay locked loop (DLL) in most cases, which increases the flexibility of the system because it automatically adjusts the delay between the data path and the clock path.
The second major difference is that there are now two latches on the transmitter side: one for the data and one for the clock. The exact same type of latch and driver are used for the clock and the data. This is to ensure that any negative effects that the driver may have on the signal (e.g., thermal drift) affect both the data path and the clock path and cancel out.
1.2.5 Embedded Clock Serial Systems
Embedded clock systems transmit only the serial data stream, and the receiver extracts the sampling clock automatically from the data (Figure 1-9). The main advantage of this architecture is that propagation delays and skew are nonissues. The clock data recovery (CDR) circuit at the receiver takes care of the correct phase alignment between data and clock. This enables serial data signaling at very high rates, up to 10 Gbit/s and beyond. Also, the CDR circuit can track some variations in clock speed and other low-frequency and time variations, which makes embedded clock systems very robust.
Figure 1-9 Block diagram of an embedded clock system
Because the link between the transmitter and the receiver consists of only one transmission line, the possible routing density is greatly increased over parallel source synchronous systems. Serial data can be easily transmitted over thin and flexible cables. Long-distance optical communications systems use this clocking scheme almost exclusively.
However, the price to pay for the flexibility and the high data rates is increased complexity of the transmitter and especially the receiver. The main building blocks of a serial embedded clock system are the parallel-to-serial conversion at the transmitter and receiver, the reference clock generation, and the clock data recovery. Integrated circuit designs are commonly available, though, so designs of this type can be inexpensive and straightforward. We will look at these building blocks in more detail in the following subsections.
22.214.171.124 Serializer and Deserializer
In most cases, the data within both the transmitter and the receiver is kept parallel and converted to and from serial format only for the data transmission over the serial link. The components that perform this conversion from parallel to serial and back are called serializer and deserializer (SERDES) components. SERDES components often integrate the TX and RX latches shown in Figure 1-9.
In Figure 1-10, we show an example implementation of a 4:1 parallel-to-serial converter. The heart of the serializer is a shift register, consisting of the latches L0 to L3. The shift register is clocked by the serial clock, at the serial data rate. The inputs into the latches are multiplexed, and the control input for the multiplexers selects either the shift register chain or the parallel data bits D0 to D3 from the parallel input latch. The clock for the parallel latch is the serial clock divided by four. The control signal needs to select the parallel input for one cycle of the serial clock, and the shift register chain for the next three cycles.
Figure 1-10 Implementation example for a 4:1 shift register serializer
Figure 1-11 shows the corresponding deserializer. The serial data is clocked through the shift register (L0 to L3) and latched into the parallel output latch (D0 to D3) every four cycles of the serial clock.
Figure 1-11 Implementation example for a 1:4 shift register deserializer
The drawback of this rather simplistic SERDES design is that the phase of the incoming data is not known. If the serializer and deserializer operate back to back, the parallel data is not guaranteed to be recovered with the same phase. Figure 1-12 shows an example of this behavior: The parallel data at the output is rotated by one bit relative to the parallel input data. More advanced SERDES architectures provide word synchronization features that ensure that the parallel data phase is correct.
Figure 1-12 Serializer and deserializer (represented by the trapezoids) in back-to-back mode. Deserializer is out of phase.
126.96.36.199 Reference Clock Generation
Both the serializer and deserializer require an at-speed clock signal. At the receiver, this clock is supplied by the clock data recovery circuit, but to avoid the routing of a high-speed clock signal across the system, the transmitter needs to create its own high-speed reference. It is usually generated from a lower-speed system reference clock with the help of a multiplying phase locked loop (PLL) circuit.
188.8.131.52 Clock Data Recovery
The CDR circuit extracts the sampling clock from the serial data stream, adjusting both the phase and frequency in the process to ensure proper sampling. There are two types of CDR circuits: analog and digital.
Analog CDR circuits (Figure 1-13) consist of a phase detector, a loop filter, and a voltage-controlled oscillator (VCO). The phase detector compares the phase of the serial data with the phase of the VCO output. The phase detector output is then low-pass filtered and passed on to the control input of the VCO and therefore tracks the incoming data. The dynamic properties of an analog CDR circuit depend on all three components, but the loop filter certainly has the largest impact; its characteristics determine how fast the CDR circuit locks on the data at start-up (lock time), how much frequency and phase variation can be tracked (tracking range), and how quickly the CDR circuit responds to frequency and phase changes at the input (loop bandwidth). The quality of the output clock depends mainly on the properties of the VCO and its support circuitry (e.g., the power supply): The lower the phase noise of the VCO, the cleaner the clock.
Figure 1-13 Block diagram of an analog CDR circuit
Digital CDR circuits (Figure 1-14) don't have their own oscillator and therefore require a reference clock. The high-speed sample clock is generated from the lower-speed reference clock with a multiplying PLL, using the exact same circuit used in the transmitter. The phase interpolator adjusts the relative phase between the serial data and the clock such that the deserializer can properly sample the data. Because there is no VCO needed, digital CDR circuits are relatively cheap and therefore preferred in many applications. Once the distance between transmitter and receiver is too long, however, the effort for the distribution of the reference clock exceeds the cost of the VCO.
Figure 1-14 Block diagram of a digital CDR circuit
184.108.40.206 Special Topics in Embedded Clock Systems
Both analog and digital clock data recoveries require a minimum number of transitions in the incoming data stream, or they will lose frequency and phase lock. Since random binary data can contain long streams of consecutive one or zero bits, the data has to be altered to guarantee the minimum transition density. But too many transitions can be problematic, too, because the loop filter characteristic and thus the CDR bandwidth can change. For this reason, data for embedded clock transmission usually is encoded. We will discuss the most important coding schemes in Section 1.3.
Since both the transmitter and receiver in an embedded clock system create their own high-speed clocks, the two ends of the transmission system will often run with a slight frequency offset. If the transmitter is faster than the receiver, data can get lost; if the transmitter is slower, it runs out of data to pass to the parallel side. There are several methods to introduce elasticity to compensate for these frequency offsets. One option is to use first-in-first-out type buffers on both ends; however, that's possible only if both TX and RX operate on the exact same average frequency, for example, because they derive their high-speed clocks from the same reference. An alternative is to add comma characters to the data; these are special short bit sequences that do not carry any payload, and can therefore be discarded or inserted as required by the frequency offset.
1.2.6 Spread Spectrum Clocking
Most digital transmission systems operate at frequencies that are regulated, for example, because they are used for TV and radio broadcasting, mobile phone systems, or other radio frequency applications. In order to limit interference, agencies such as the Federal Communications Commission in the United States have put strict limits on the energy that a device may emit.
Shielding a digital communications system is often not practical, either because it is too difficult mechanically or because of cost considerations. Many systems therefore use a spread spectrum PLL, which adds a small amount of low-frequency modulation to the central clock source. The modulation reduces the peak emissions by spreading the emitted energy over a wider frequency band; however, it does not reduce the total emissions of the system.
If the modulation parameters are chosen carefully such that all parts of the system can track the spread spectrum clock, the system performance is not affected; typical values are 0.5% and 30 kHz. Many systems (e.g., PCI express) use an asymmetric approach: The frequency is modulated only downward, in order to keep the maximum frequency below the design limit.