3.3 Video Packet Categorization Based on the Relative Priority Index (RPI)
3.3.1 Desired Characteristics for Prioritization
In our approach, the importance to an end-user of obtaining a low packet loss rate or reduced delay is determined based on the following criteria. First, the decision focuses on relative and fine-grained priorities inside an application, which rely on a linkage to an absolute metric related to the application’s contents. When initially marked in the DS field with RLI/RDI, video application packets at the user end have limited knowledge of the dynamic status of the network and of competing applications within the same boundary. The assigned relative priority for each packet is supposed to be interpreted at the DiffServ-aware node. Next, because more assured (but not guaranteed) service is to be provided, the video application needs to be able to cope with possible packet losses and delays. The assigned RLI/RDI parameters basically assume that the video stream has already been generated with error-resilience features so that the loss or delay of each packet can be tolerated to a certain degree. Finally, the resulting prioritization should exhibit some kind of clustering behavior in the RLI/RDI space so that it can keep its distinction when mapped to the limited DS byte.
As discussed previously, the degree of a video stream’s importance for receiving low-delay network service depends heavily on the application’s context. If we consider different degrees of importance for receiving low delay for different packets within a stream, we find that varying demands for delay quality are usually connected with the layered coding of video compression. For example, the I, P, and B frames of ISO/IEC MPEG-1/2/4 frames have varying demands with regard to delay as well as loss. The situation is similar for the spatial-scalable, SNR-scalable, and data-partitioned layers of MPEG or H.261/H.263, with the exception of the temporal-scalable layer. However, as a video application becomes network-aware, the trend seems to move gradually toward delay variation, even within a stream. A good example is the asynchronous media transmission scenario, where flexible delay margins can be exploited throughout transmission. This idea has been denoted as delay-cognizant video coding in , which applied an extended form of region-based scalability to H.263 video. Multiple-region layers within a stream were constructed in , where varying delay requirements were associated by perceiving a motion event such as nodding, gesturing, or maintaining lip synchronization. To be more specific, cognizant video coding assigns the lowest delay to the most visually significant region, and vice versa. Since packets assigned a longer delay may arrive late, this presents the opportunity for the network service to route packets through less congested but longer routes and charge lower prices. Another example is packets of the MPEG-4 system that encompass multiple elementary streams with different demands under one umbrella. This kind of integrated stream can justify the demand for inter-media RDI/RLI differentiation. Thus, we propose to distinguish a packet based on the loss rate and delay tuple, leaving flexibility to subsequent stages. Note, however, that only fixed RDI for a flow is employed currently.
3.3.2 Macroblock-Level Corruption Model
Investigation of Macroblock Error Propagation
Most state-of-the-art video compression techniques, including H.263+ and MPEG-1, 2, and 4 are based on motion-compensated prediction (MCP). The video codec employs inter-frame prediction to remove temporal redundancy and transform coding to reduce spatial redundancy. MCP is performed at the macroblock (MB) level of 16 × 16 luminance pixels. For each MB, the encoder searches the previously reconstructed frame for the MB that best matches the target MB, being encoded. To increase estimation accuracy, sub-pixel accuracy is used for the motion vector representation, and interpolation is used to build the reference block. Residual errors are encoded by the discrete cosine transform (DCT) and quantization. Finally, all information is coded with either a fixed-length code or variable-length code (VLC).
Basically, two different types of MBs can be selected adaptively for each MB. One is the inter-mode coding that includes motion compensation and residual error encoding. The other is intra-mode coding, in which only DCT and quantization are used for the original pixels. In the initial anchor frame, or when a new object appears in a frame to be encoded, the sum of absolute differences (SAD) of the most closely matching MB can be larger than the sum of the original pixels. For these cases, the intra-mode MB is used. Otherwise, the inter-mode MB, which usually costs less due to temporal redundancy elimination, is used.
In the Internet environment, packets may be discarded due to buffer overflow at intermediate nodes of the network, or they may be considered lost due to long queuing delays. VLC is very vulnerable to even one bit of error. If error occurs in the bit-stream once, the VLC decoder cannot decode the next codes until it finds the next synchronization point. Thus, a small error can result in catastrophic distortion in the reconstructed video sequence.
Packet loss results in the loss of encoded MBs as well as the loss of synchronization. It affects MBs in subsequent packets until the decoder is re-synchronized (or refreshed). When packet loss occurs, an error recovery action (i.e., error concealment), which attempts to identify the best alternative for the lost portion, is performed at the decoder.
Usually, no normative error concealment method is defined for most MCP video compression standards, but various error concealment schemes have been proposed , . The temporal concealment scheme exploits the temporal correlation in video signals by replacing a damaged MB with the spatially corresponding MB in the previous frame. This straightforward scheme, however, can produce adverse visual artifacts in the presence of large motion, and the motion-compensated version of temporal concealment is usually employed with estimated motion vectors from surrounding MBs. Even after sophisticated error concealment, there remains residual error, which is propagated due to the recursive prediction structure. This temporal error propagation is typical for hybrid video coding that relies on MCP in the inter-frame mode. The number of times a lost MB is referenced in the future depends on the coding mode and motion vectors of MBs in subsequent frames. This tells us the importance of each MB from the viewpoint of error propagation.
While propagating temporally and spatially, the residual error after error concealment decays over time due to leakage in the prediction loop. Leaky prediction is a well-known technique to increase the robustness of differential pulse code modulation (DPCM) by attenuating the energy of the prediction signal. For hybrid video coding, leakage is introduced by spatial filtering operations that are performed during encoding. Spatial filtering can either be introduced explicitly by an explicit loop filter or implicitly as a side-effect of half-pixel motion compensation (i.e., with bilinear interpolation). This spatial filtering effect in the decoder was analyzed by Farber et al. . In their work, the loop filter is approximated as a Gaussian-shaped filter in the spatial frequency domain. It is given by
where is determined by the filter shape and indicates the strength of the loop filter. As shown in Eq. (3.1), the loop filter behaves like a low-pass filter, and its bandwidth is determined by time t and filter strength .
By further assuming that the power spectral density (PSD) of error signal u(x, y) is a zero-mean, stationary, random process, we obtain
(i.e., a separable Gaussian PSD with variance that can be interpreted as the average energy). The shape of the PSD of the error signal is characterized as . Thus, the pair of parameters determines the energy and shape of the error signal’s PSD and can be used to match Eq. (3.2) with the true PSD. With these approximations, the variance of the error propagation random process v (x, y) can be derived as
This analytical model, as given in Eqs. (3.1, 3.2, 3.3) has been verified in experimental results , . While the statistical propagation behavior is analyzed with this model, it is difficult to estimate the loss effect for each packet composed of several MBs [i.e., the group of block (GOB) unit, in general]. Thus, we extend the propagation behavior by incorporating additional coding parameters such as error concealment schemes and the encoding mode so that it is possible to track the error propagation effect better with moderate computational complexity in the following section. The purpose of the corruption model is to estimate the total impact of packet loss. When one or more packets are lost, errors are introduced and propagated. The impact of errors is defined as the difference between reconstructed frames with and without packet loss as measured by mean square errors (MSEs).
Derivation of MB-Level Corruption Model
Figure 3.3 shows the error propagation of a lost MB, in which the initial error of the corrupted MB is denoted by u(x, y) and its energy is measured in terms of the error variance . The propagation error v(x, y) in consecutive frames has energy in impaired MB j of frame n + m.
Figure 3.3. Error propagation example of a lost MB.
The initial error due to packet loss is dependent on the error concealment scheme adopted by the decoder. The amount of initial error can be calculated at the encoder if the error concealment scheme used in the decoder is known a priori. For simplicity, we assume that the decoder uses the TCON error concealment scheme as specified in H.263+ Test Model 10 . Also, at this stage, we will confine the corruption model analysis to the case of isolated packet loss. Under a low loss rate, this corruption model exhibits reliable accuracy. The multiple-packet loss case will be discussed in a later section, which will address the scenario involving a higher loss rate. Also, only an MB unit (i.e., 16 × 16) estimation is analyzed, excluding the advanced prediction option, at present. This simplifies an MB as the unit of the coding mode with a unique motion vector (for inter-frame modes). From an MB with 256 pixels, we can extract the PSD of the signal by analyzing its frequency component. In addition, an MB-based calculation costs much less than a pixel-based computation. The MB-level corruption model will also later be extended to the GOB level, with some restrictions.
Let us analyze the energy transition through an error propagation trajectory. Typical propagation trajectories are illustrated in Figure 3.3. A trajectory is composed of two basic trajectory elements: a parallel trajectory and a cascaded trajectory. Before the analysis, some assumptions are required to reduce the computational complexity. Let the decoder with the DPCM loop and spatial filter be a linear system. Also, for each frame to be predicted, the time difference is set to 1, that is, t = 1 in Eq. (3.3).
Figure 3.4(a) shows the parallel propagation trajectory. The error in reference frame n is characterized by , which determines the PSD of the error. In the parallel trajectory, an error can propagate to more than two different areas in subsequent frames. For each path, a different motion vector and a different spatial filtering can be applied. Thus, a different filter strength can be applied, as depicted in the equivalent linear system of Figure 3.4(b).
Figure 3.4. Cascade propagation: (a) the trajectory and (b) the equivalent linear system.
Ha(ω) and Hb(ω) may have a different filter function, but both functions are the Gaussian approximation of the spatial filter. The PSD of the error frame is transferred to the predicted frame via
and the corresponding error energy is derived as
In the cascaded propagation trajectory of Figure 3.4(a), the initial error energy of U of frame n is referenced in A of frame n + 1 and then it is transferred to B in frame n + 2 again. The equivalent linear system is shown in Figure 3.4(b). For each transition, the loop filter function is characterized by and respectively. Then, the PSD of the propagation error in frame n + 2 is given by
and its energy can be derived as
The propagation error energy from U to B is given in Eq. (3.7). It is the same as Eq. (3.3) except for the loop filter efficiency γ. For the cascaded propagation, the loop filter efficiency γ can be derived from the equivalent loop filter strength , which is the sum of filter strengths σfa and σfb. As a result, the equivalent filter strength for the cascaded propagation is the sum of filter strengths along the propagation path. Because of the motion vector (even in sub-pixel accuracy), a portion of one MB is usually referenced by the predicted frames. In this case, all errors of an impaired MB do not propagate to subsequent frames. Thus, we must consider the portion of an MB that contributes to the next predicted frames as a reference, a quantity that is denoted as the dependency weight. If we denote the ith MB of frame n as MBn,i and a portion of MBn,i contributes to MBn+m,j, then the dependency weight wn,i(m, j) is defined as the normalized number of pixels that are transferred from MBn,i to MBn+m,j. If no portion of an MB is referenced by the jth MB of the (n + m)th frame, wn,i(m, j) is zero. Otherwise, it has a value between zero and one.
The dependency weight can be calculated recursively with stored motion vectors and MB types as shown in Figure 3.5. However, since motion compensation is not a linear operation, we have to assume that the motion vectors of neighboring MBs are the same (or at least very similar), so that the target MB is transferred without severe break, as depicted in Figure 3.5. Another assumption is that the error in an MB is uniformly distributed in the spatial domain while also having a Gaussian PSD. Then, the transferred error energy from MBn,i to MBn+m,j can be calculated with the loop-filtering effect and dependency weight. Finally, to evaluate the total impact of the loss of MBn,i, the weighted error variances for MBs of subsequent frames should be summed. Because the initial error can sustain over a number of frames without converging to zero, we must limit frames to be evaluated under an acceptable computational complexity. Fortunately, when the intra-MB refresh technique is used at the encoder, the propagated error energy converges to zero within a fixed number of frames. Thus, in general, a pre-defined number of frames is sufficient to estimate the total impact of the MB loss. This defines an estimation window for the corruption model.
Figure 3.5. Recursive weight calculation
The appropriate estimation window might be determined based on the strength of the intra-MB refresh. As a result, the total energy of errors due to an MB loss in a sequence can be written as
where M is the size of the estimation window and N is the total number of MBs in a frame, respectively. Also, we have
The corruption model derived above can be viewed as an MB-level extension of the statistical error propagation model given in .
3.3.3 Simplified RLI Association and Categorization
The MB-level corruption model described in the previous section indicates the level of importance, for each MB, of not losing the MB. Thus, RLI can be defined on the basis of such a corruption model. However, determining the importance of not losing a particular MB requires an estimation window as data history, which induces some delay before marking RLI per packet. In this section, a simple and practical version is presented for online RPI generation. This RLI is calculated from video factors such as initial error, motion vector, and MB encoding type. Such a simplified MB-based corruption model provides RPI association to approximate the actual loss impact in MSE.
Under packetized video transmission, the RLI assignment for a packet is best when it can precisely represent its error propagation impact on the receiving video quality. However, the specific method for RLI prioritization is totally application- (and furthermore, video compression scheme-) dependent. Thus, we have chosen ITU-T H.263+ video  as the evaluation codec, considering its wide acceptance for low-latency video conferencing and its potential for video streaming. Note that with regard to this RLI association, there is not much difference among the motion-compensated prediction codecs, which include both MPEGs and H.261/H.263.
Given frame size and target bit rate, each packet has a different loss effect on end-to-end video quality due to inter-frame prediction. The impact of packet loss may spread within a frame up to the next re-synchronization (e.g., the picture or GOB headers) due to differential coding, run-length coding, and variable-length coding. This is referred to as spatial error propagation, and it can damage any type of frame. For temporal error propagation, damaged MBs affect the non-intra-coded MBs in subsequent frames, which use corrupted MBs as references. Recent research on the corruption model  has attempted to model this loss effect. The corruption model is a tool used to estimate the packet loss impact on the overall received video quality. However, most modelling efforts have focused on the statistical side of the loss effect, while a dynamic solution was sought in our approach. Instead of computationally complex options, we devised ways to associate RLI to H.263 packets with a simple and online calculation.
Basically, we use the error-resilient H.263+ stream, compressed at a target rate of 384 kilobits per second (kbps), for a common intermediate format (CIF) test sequence with 10 frames per second (fps). Several error resilience and compression efficiency options (“Annexes D, F, I, J, and T” with random intra-refresh) are used in the generation of the so-called “Anchor” (i.e., GOB) mode stream. The random intra-refresh rate is set to 5% to cover the network packet loss. It is then packetized by one or more packets per GOB, which is dependent on the maximum transfer unit (MTU) size of the IP network. Thus, we propose a simple yet effective RLI association scheme for this H.263+ video stream, which is calculated for each GOB packet.
Various video factors are taken into consideration for the proposed RLI association. First, the magnitude and direction of the motion vector for each MB are included to reflect the loss strength and temporal loss propagation due to motion. Then, the encoding types (intra, intra-refreshed, inter, etc.) are considered. In other words, the refreshing effect is added by counting the number of intra-coded MBs in a packet. Finally, the initial error due to packet loss is considered, assuming the normative error concealment adopted at the decoder. To better explain this, sample distributions of these three video factors are depicted in Figure 3.6. The RLI for a packet may take into account all of these video factors by summarizing the normalized video factors with an appropriate weighting as
where NVF is the total number of video factors considered, NVFn stands for the normalized video factors, and Wn stands for the corresponding weight factors. The normalization is done with an online version such as through updating the sampling mean at an ith update time.
Figure 3.6. Video factors (VFs) for RLI with “Foreman” sequence: (a) motion vector magnitude; (b) number of I-coded MBs in each packet; and (c) initial error due to packet loss.
Figure 3.7 shows an example of the resulting RLI assignment for a “Foreman” sequence. This model provides a much more accurate estimated MSE, but it needs forehead information in a particular window range to allow calculation of the propagation effect. For our purpose of QoS mapping, we do not need an accurate MSE amount because the relative priority order and relative quantity are enough. Eq. (3.10) is simple and can be generated easily as an online version with the sampling mean. A more accurately estimated RLI based on a corruption model is proposed in , but it needs the history information of several subsequent frames to make such an estimate, which is somewhat of a trade off. As shown in Figure 3.8(d), using 20 frame history windows to calculate RLI has more correlation with actual measured MSE than the process shown in Figure 3.8(c).
Figure 3.7. RLI for “Foreman” sequence obtained by applying Wn’s (0.15, 0.15, and 1.75) for the three video factors of Figure .
Figure 3.8. A comparison of: (a) the actually measured MSE coming from loss of each GOB packet; (b) the proposed RLI pattern for a “Foreman” sequence; (c) the correlation distribution between actual MSE and proposed RLI of the same GOB packet number; and (d) the correlation distribution between actual MSE and corruption model-based RLI in .
To show that proposed RLI approximates actual loss propagation, the actually measured MSE from each GOB packet is calculated and compared with the proposed RLI pattern in Figure 3.8. The simple RLI shows a pattern similar to the actual MSE, and there is general correlation between two. This is enough for our goal of differentiating video packets in a stream for prioritization marking according to their importance since the QoS mapping is needed from only several source categories and the network DS levels are few in number.
Finally, RLIs are categorized into K DS traffic categories to enable mapping to limited DS field space (or eventually to be ready for mapping to more limited network DS levels). In our approach, simple non-uniform quantization of RLI is performed for this categorization. That is, as shown in Figure 3.9, categorization is done with gradually increasing steps as category k increases. Another possible approach is to categorize packets into an equal number (i.e., uniform distribution). After categorization, all packets belonging to category k may be represented by their average RLI value, .
Figure 3.9. Packet distribution and average RLI of each video DS category for a “Foreman” sequence.
Higher RLI thus represents more potential damage to visual quality. To verify the clustering behavior of RLI association, we observed RLI distribution for several video sequences. As shown in Figure 3.10, the RLI distribution varied according to a scene’s characteristics. Since it affects the QoS mapping, we will explore the impact of RLI distribution patterns in Section 3.4.2. Finally, RLIs are categorized to enable mapping to limited DS field space (or eventually to be ready for mapping to more limited network DS levels), which is represented by source category k among K categories. In our approach, simple, uniform quantization of RLI is performed for this categorization. After categorization, all packets belonging to category k may be represented by their average RLI value, .
Figure 3.10. Different RLI (cumulative) distributions for several video sequences.