QoS Capable Devices
This section describes the internals of QoS Capable devices. One of the difficulties of describing QoS implementations are the number of different perspectives that may be used to describe all the features. The scope is limited to the priority based model and related functional components to implement this model. The priority based model is in fact the most common implementation approach due to its scalability advantage.
There are two completely different approaches to implementing a QoS capable IP Switch or Server. These approaches are:
The Reservation Model, also known as Integrated Services/Resource Reservation Protocol (RSVP) or ATM, is the original approach, requiring applications to signal their traffic handling requirements. After signalling each switch that is in the path from source to destination reserves resources, such as bandwidth and buffer space, that either guarantee the desired QoS service or assure that the desired service is provided. It is not widely deployed because of scalability limitations. Each switch has to keep track of all this information for each flow. As the number of flows increase, the amount of memory and processing increases, hence limiting scalability.
The Precedence Priority Model, also known as Differentiated Services, IP Precedence Type of Service (TOS), or IEEE 802.1pQ, takes aggregated traffic, segregates the traffic flows into classes, and provides preferential treatment of classes. It is only during episodes of congestion that noticeable differentiated services effects are realized. Packets are marked or tagged according to priority. Switches then read these markings and treat the packets according to their priority. The interpretation of the markings must be consistent within the autonomous domain. The Differentiated Services model defines eight classes, from highest precedence to lowest: Expedited Forwarding (EF), Assured Forwarding 1-4(AF), and Best Effort (BE). Within each class, there are eight drop precedences, which indicate to the switch which packets are more important than others within that class. This results in a total of 8x8=64 Differentiated Services Code Points (DSCP).
This article focuses on the Precedence Priority model due to the increased scalability and current market acceptance of this approach. The next section details how QoS is actually implemented.
Functional ComponentsHigh Level Overview
In the Section, "Implementation Functions", the three high level QoS Components, packet classification, packet scheduling and traffic shaping, and limiting are described. This section describes these QoS components in further detail.
A QoS capable device consists of the following functions:
Admission Control accepts or rejects access to a shared resource. This is a key component for Integrated Services and ATM networks. Admission control ensures that resources are not oversubscribed. Due to this admission control is more expensive and less scalable.
Congestion Management prioritizes and queues traffic access to a shared resource during congestion periods.
Congestion Avoidance prevents congestion early using preventive measures. Algorithms such as Weighted Random Early Detect (WRED), exploit TCPs congestion avoidance algorithms to reduce traffic injected into the network, preventing congestion.
Traffic Shaping reduces the burstiness of egress network traffic by smoothing the traffic and then forwarding out to the egress link.
Traffic Rate Limiting controls the ingress traffic by dropping packets that exceed burst thresholds, thereby reducing device resource consumption such as buffer memory.
Packet Scheduling schedules packets out the egress port so that differentiated services are effectively achieved.
In the next section, the modules that implement these high level functions are described in more detail.
The QoS Profile contains information, inputted by the network/systems administrator on the definition of classes of traffic flows and how these flows should be treated in terms of QoS. For example, a QoS profile might have a definition that web traffic from the CEO should be given AF1 DiffServ Marking, Committed Information Rate (CIR) 1Mbs, Peak Information Rate (PIR) 5 Mbs, Excess Burst Size (EBS) 100 Kbytes, Committed Burst Size (CBS) 50 Kbytes. This profile defines the flow and what QoS the web traffic from the CEO should receive. This profile is compared against the actual measured traffic flow. Depending on how the actual traffic flow compares against this profile, the TOS field of the IP header is re-marked or an internal tag is attached to the packet header, which controls how the packet is handled inside this device.
The profile defines the grade of QoS that a flow should receive, such as Platinum, Gold, Silver, or Bronze. But the actual amount of traffic that a flow injects could exceed what was allocated and hence that traffic may be capped if it exceeds certain thresholds. However, if the switch is not busy, and no one else is using the resources, the switch may allow the flow to exceed the thresholds, since it does not make sense to waste unused resources. However, if the switch is busy, and gets congested, it enforces the flows thresholds and limits the amount of traffic according to the profile. Its like an airplane, the first class seats may be unused, so instead of wasting them, it makes sense for the airline to give the seats away to coach customers. However, if the first class seats are booked then the airline is very strict about seating assignments.
Functional ComponentsDetailed Modules
FIGURE 3 shows the main functional components that are involved in delivering prioritized differentiated services, that apply to a switch or a server. These include: the packet classification engine, the metering, the marker function, policing/shaping, I/P forwarding module, queuing, congestion control management and packet scheduling function.
FIGURE 3 QoS Functional Components
Deployment of Data and Control Planes
Typically, if the example in FIGURE 3 was deployed on a network switch, there would be an ingress board and an egress board, connected together via a backplane. It would be deployed on a server and these functions would be implemented in the network protocol stack, either in the IP module, adjacent to the IP module, or possibly on the network interface card, offering superior performance due to the Application Specific Integrated Circuit (ASIC)/Field Programmable Gate Arrays (FPGA) implementation.
There are two planes:
Data Plane operates the functional components that actually read/write the IP header.
Control Plane operates the functional components that control how the functional units read information from the Network Administrator, directly or indirectly.
The Packet Classifier is a functional component that is responsible for identifying a flow and matching it with a filter. The filter is composed of source and destination, IP address, port, protocol, and the TOS field, all in the IP Header. The filter is also associated with information that describes the treatment of this packet. Aggregate ingress traffic flows are compared against these filters. Once a packet header is matched with a filter, the QoS profile is used by the meter, marker, and policing/shaping functions. Packet Classification performance is critical and much research has been published on it. One algorithm to note is the Recursive Flow Classification (RFC) algorithm. The basic idea behind the RFC packet classification is that the fields of the packet header are projected onto a finite natural number plane and divided up into equivalent sets. The rules are then parsed as indices are created. When a packet header is compared, a hierarchy of indexes are also compared in logarithmic base 2 searches. The algorithm achieves a good balance between reasonable memory requirements and lookup speed.
The metering function compares the actual traffic flow against the QoS profile definition. FIGURE 4 illustrates the different measurement points. The input traffic on average can arrive at 100 Kbytes/sec. However, for a short period of time, the switch or server allows the input flow rate to reach 200 Kbytes/sec for 1 second, which computes to a buffer of 200 Kbytes. For the time period of t=3 to t=5, the buffer is draining at a rate of 50 Kbytes/sec as long as the input packets arrive at 50 Kbytes/sec, keeping the output constant. Another more aggressive burst, arrives at the rate of 400 Kbytes/sec for .5 secs, filling up the 200 Kbytes buffer. From t=5.0 to 5.5, however, 50 Kbytes are drained, leaving 150 Kbytes at t=5.5 secs. This buffer drains for 1.5 secs at a rate of 100 Kbytes/sec. This example is simplified, so that the real figures need to be adjusted to account for the fact that the buffer is not completely filled at t=5.5 secs because of the concurrent draining. Notice that the area under the graph or the integral, represents the number of bytes in the buffer approximately, and bursts represent the high sloped lines above the green dotted line, representing the average rate or the CIR.
FIGURE 4 Traffic Burst Graphic
Marking is tied in with metering so that when the metering function compares the actual measured traffic against the agreed QoS profile the traffic is handled appropriately. The measured traffic measures the actual burst rate and amount of packets in the buffer against the CIR, PIR, CBS, and EBS. The Two Rate Three Color (TrTCM) algorithm is a common algorithm that marks the packets green if the actual traffic is within the agreed CIR. If it is above CIR or below PIR, the packets are marked yellow. Finally, if the actual metered traffic is at PIR or above, the packets are marked red. The device then uses these markings on the packet in the policing/shaping functions to determine how the packets are treated, for example, whether the packets should be dropped, shaped, or queued in a lower priority queue.
The policing functional component uses the metering information to determine if the ingress traffic should be buffered or dropped. Shaping pumps out the packets at a constant rate, buffering packets in order to achieve a constant output rate. The common algorithm used here is the Token Bucket algorithm to shape the egress traffic and to police ingress traffic.
IP Forwarding Module
The IP forwarding module inspects the destination IP address and determines the next hop using the Forwarding Information Base. The forwarding information base is a set of tables populated by routing protocols and/or static routes. The packet is then forwarded internally to the egress board, which places the packet in the appropriate queue.
Queuing encompasses two dimensions or functions. The first function is congestion control that controls the number of packets queued up in a particular queue (see the next section). The second function is differential services. Differential services' queues are serviced by the packet scheduler in a certain manner (providing preferential treatment to pre-selected flows) by servicing packets in certain queues more often than others.
There is a finite amount of buffer space or memory, so the number of packets that can be buffered within a queue must be controlled. The switch or server forwards packets at line rate, however when a burst occurs or if the switch is oversubscribed and congestion occurs, packets are buffered. There are several packet discard algorithms. The simplest is Tail Drop, once the queue fills up any new packets are dropped. This works well for UDP packets, however there are severe disadvantages for TCP traffic. Tail drop causes TCP traffic in already established flows to quickly go into congestion avoidance mode and exponentially drops the rate at which packets are sent. This known problem is called global synchronization. It occurs when all TCP traffic is simultaneously increasing and decreasing flow rates at the same periods in time. What is needed is to have some of the flows slow down, so that the other flows can take advantage of the freed up buffer space. Random Early Detection (RED) is an active queue management algorithm that drops packets before buffers fill up, and randomly reduces global synchronization.
FIGURE 5 describes the RED algorithm. Looking at line C on the far right, when the average queue occupancy is from empty up to 75% full, no packets are dropped. However, as the queue grows past 75%, the probability that random packets are discarded quickly increases, up until the queue is full, then the probability reaches certainty. WRED takes RED one step further by giving some of the packets different thresholds at which packet probabilities of discard, start. As illustrated in FIGURE 5, Line A starts to get random packets dropped at only 25% average queue occupancy, making room for higher priority flows B and C.
FIGURE 5 Congestion Control: RED, WRED Packet Discard Algorithms
The packet scheduler is one of the most important QoS functional components. The packet scheduler pulls packets from the queues and sends them out the egress port, or forwards them to the adjacent STREAMS module, depending on implementation. There are several packet scheduling algorithms that service the queues in a different manner. Weighted Round Robin (WRR) scans each queue, and depending on the weight assigned a certain queue, allows a certain number of packets to be pulled from the queue and sent out. The weights represent a certain percentage of the bandwidth. In actual practice, unpredictable delays are still experienced, since a large packet at the front of the queue may hold up smaller-sized packets behind this large packet. Weight Fair Queuing (WFQ) is a more sophisticated packet scheduling algorithm that computes the time the packet arrived and the time to actually send out the entire packet. WFQ is then able to handle varying sized packets and optimally select packets for scheduling. WFQ conserves work, meaning that no packets are waiting idle, when the scheduler is free. WFQ is also able to put a bound on the delay, as long as the input flows are policed and the length of the queues are bound. In Class Based Queuing (CBQ) (used in many commercial products) each queue is associated with a class, where higher classes are assigned a higher weight translating to relatively more service time from the scheduler that the lower priority queues.