Home > Articles

  • Print
  • + Share This
Like this article? We recommend

Ethernet Performance Troubleshooting

Ethernet performance troubleshooting is device specific because not all devices have the same architecture capabilities. Therefore, the discussion of troubleshooting performance issues will have to be tackled on a per-device basis.

The following SolarisTM tools aid in the analysis of performance issues:

  • kstat to view device-specific statistics

  • mpstat to view system utilization information

  • lockstat to show areas of contention

You can use the information from these tools to tune specific parameters. The tuning examples that follow describe where this information is most useful.

You have two options for tuning: using the /etc/system file or the ndd utility.

Using the /etc/system file to modify the initial value of the driver variables requires a system reboot for the to take effect.

If you use the ndd utility for tuning, the changes take effect immediately. However, any modifications you make using the ndd utility will be lost when the system goes down. If you want the ndd tuning properties to persist through a reboot, add these properties to the respective driver.conf file.

Parameters that have kernel statistics but have no capability to tune for improvement are omitted from this discussion because no troubleshooting capability is provided in those cases.

ge Gigabit Ethernet

The ge interface provides the following tuning parameters that assist in performance troubleshooting.

TABLE 3 ge Performance Tunable Parameters

Parameter

Values

Description

ge_intr_mode

0-1

Enables the ge driver to send packets directly to the upper communication layers rather than queueing them.

0 = Packets are not passed in the interrupt service routine but are placed in a streams service queue and passed to the protocol stack later, when the streams service routine runs.

1 = Packets are passed directly to the protocol stack in the interrupt context.

Default: 0 (queue packets to upper layers)

ge_dmaburst_mode

0-1

Enables infinite burst mode for PCI DMA transactions rather than using cache-line size PCI DMA transfers. This feature supported only on Sun platforms with the UltraSparc® III CPU.

0 = Disabled (default)

1 = Enabled

ge_nos_tmd

32-8192

Number of transmit descriptors used by the driver.

Default = 512

ge_put_cfg

0-1

An enumerated type that can have a value of 0 or 1.

0 = receive processing occurs in the worker threads.

1 = receive processing occurs in the streams service queues routine.

Default = 1


The ge interface provides some statistics you can use to measure the performance bottlenecks in the driver at the transmit or receive end of the link. The kstats allow you to decide what corrective tuning can be applied, based on the tuning parameters previously described. The useful statistics are shown in TABLE 4.

TABLE 4 List of ge Specific Interface Statistics

kstat name

Type

Description

rx_overflow

counter

Number of times the hardware is unable to receive a packet due to the internal FIFOs being full.

no_free_rx_desc

counter

Number of times the hardware is unable to post a packet because there are no more Rx descriptors available.

no_tmds

counter

Number of times transmit packets are posted on the driver streams queue for processing later by the queue's service routine.

nocanput

counter

Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet.

pci_bus_speed

value

The PCI bus speed that drives the card.


When rx_overflow is incrementing, packet processing is not keeping up with the packet arrival rate. If it is incrementing and no_free_rx_desc is not, this indicates that the PCI bus or SBus bus is presenting an issue to the flow of packets through the device. This could be because the ge card is plugged into a slower I/O bus. You can confirm the bus speed by looking at the pci_bus_speed statistic. An SBus bus speed of 40 MHz or a PCI bus speed of 33 MHz might not be sufficient to sustain full bidirectional one-gigabit Ethernet traffic.

Another scenario that can lead to rx_overflow incrementing on its own is sharing the I/O bus with another device that has similar bandwidth requirements to those of the ge card.

These scenarios are hardware limitations. There is no solution for SBus. For PCI bus, a first step in addressing them is to enable infinite burst capability on the PCI bus. You can achieve that by using the /etc/system tuning parameter ge_dmaburst_mode.

Alternatively, you can reorganize the system to give the ge interface a 66-MHz PCI slot, or separate devices that contend for a shared bus segment by giving each of them a bus segment.

The probability that rx_overflow incrementing is the only problem is small. Typically, Sun systems have a fast PCI bus, and memory subsystem, so delays are seldom induced at that level. It is more likely is that the protocol stack software might fall behind and lead to the Rx descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstat no_free_rx_desc will begin to increment, meaning the CPU cannot absorb the incoming packet in the case of a single CPU. If more than one CPU is available, it is still possible to overwhelm a single CPU. But given that the Rx processing can be split using the alternative Rx data delivery models provided by ge, it might be possible to distribute the processing of incoming packets to more than one CPU. You can do this by first ensuring that ge_intr_mode is not set to 1. Also be sure to tune ge_put_cfg to enable the load-balancing worker thread or streams service routine.

Another possible scenario is where the ge device is adequately handling the rate of incoming packets, but the upper layer is unable to deal with the packets at that rate. In this case, the kstat nocanputs parameter will be incrementing. The tuning that can be applied to this condition is available in the upper layer protocols, although if you're running the Solaris 8 operating system or earlier, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system.

While the Tx side is also subject to an overwhelmed condition, this is less likely than any Rx-side condition. If the Tx side is overwhelmed, it will be visible when the no_tmds parameter begins to increment. If the Tx descriptor ring size can be increased, the /etc/system tunable parameter ge_nos_tmd provides that capability.

ce Gigabit Ethernet

The ce interface provides the following tunable parameters that assist in performance troubleshooting. Note that these are ndd parameters.

TABLE 5 ce Performance Parameters Tunable Using ndd

Parameter

Values

Description

tx-dma-weight

0-3

Determines the multiplication factor for granting credit to the Tx side during a weighted round robin arbitration.

Values are 0 to 3.

Zero means no extra weighting. The other values are powers of 2 extra weighting, on that traffic.

For example, if tx-dma-weight = 0 and

rx-dma-weight = 3, then as long as Rx traffic is continuously arriving its priority will be eight times greater than Tx to access the PCI

(Default = 0)

rx-dma-weight

0-3

Determines the multiplication factor for granting credit to the Rx side during a weighted round-robin arbitration.

Values are 0 to 3.

(Default = 0)

infinite-burst

0-1

Allows the infinite burst capability to be utilized. When this is in effect and the system supports infinite burst, the adapter will not free the bus until complete packets are transferred across the bus.

Values are 0 or 1.

(Default = 0)

red-dv4to6k

0 to 255

Random early detection and packet drop vectors for when FIFO threshold is greater than 4096 bytes and less than 6144 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bit 0 is set, the first packet out of every eight will be dropped in this region.

(Default = 0)

red-dv6to8k

0 to 255

Random early detection and packet drop vectors for when FIFO threshold is greater than 6144 bytes and less than 8192 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bit 0 is set, the first packet out of every eight will be dropped in this region. (Default = 0)

red-dv8to10k

0 to 255

Random early detection and packet drop vectors for when FIFO threshold is greater than 8192 bytes and less than 10,240 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bits 1 and 6 are set, the second and seventh packets out of every eight will be dropped in this region. (Default = 0)

red-dv10to12k

0 to 255

Random early detection and packet drop vectors for when FIFO threshold is greater than 10,240 bytes and less than 12,288 bytes. Probability of drop can be programmed on a 12.5 percent granularity. If bits 2, 4, and 6 are set, then the third, fifth, and seventh packets out of every eight will be dropped in this region. (Default = 0)


TABLE 6 lists the /etc/system tunable parameters that assist in performance troubleshooting.

TABLE 6 ce Performance Parameters Tunable Using /etc/system

Parameter

Values

Description

ce_ring_size

32-8192

The size of the Rx buffer ring, a ring of buffer descriptors for Rx.

One buffer = 8K. This value must be power of 2. Maximum value is 8192 buffers of 8K each.

Default = 256.

ce_comp_ring_size

0-8192

The size of each Rx completion descriptor ring. It also is power of 2.

Default = 2048

ce_inst_taskqs

0-64

Controls the number of taskqs set up per ce device instance. This value is only meaningful if ce_taskq_disable is false.

Any value less than 64 is meaningful.

Default = 4.

ce_srv_fifo_depth

30-100000

Gives the size of the service FIFO, in number of elements. This variable can be any integer value.

Default = 2048

ce_cpu_threshold

1-1000

Gives the threshold for the number of CPUs required in the system and online before the taskqs are utilized to Rx packets.

Default = 4

ce_taskq_disable

0-1

Disables the use of Task queues and forces all packets to go up to Layer 3 in the interrupt context.

Default depends on whether the number of CPUs in the system exceeds the ce_cpu_threshold

ce_start_cfg

0-1

An enumerated type that can have a value of 0 or and 1.

0 = ce transmit algorithm does not do serialization

1 = ce transmit algorithm does serialization.

Default = 0

ce_tx_ring_size

0-8192

The size of each Tx descriptor ring. It also is power of 2.

Default = 2048

ce_no_tx_lb

0-1

Disables the Tx load balancing and forces all transmission to be posted to a single descriptor ring.

0 = Tx Load balancing is enabled.

1 = Tx Load Balancing is disabled.

Default = 1

ce_bcopy_thresh

0-8192

The mblk size threshold used to decide when to copy a mblk into a pre-mapped buffer, as opposed to using DMA or other methods.

Default = 256

ce_dvma_thresh

0-8192

The mblk size threshold used to decide when to use the fast path DVMA interface to transmit mblk.

Default = 1024

ce_dma_stream_thresh

0-8192

This global variable splits the ddi_dma mapping method further by providing Consistent mapping and Streaming mapping. In the Tx direction, for larger transmissions, Streaming is better than Consistent mappings. If the mblk size is greater than 256 bytes but less than 1024 bytes, then mblk fragments will be transmitted using ddi_dma methods.

Default = 512


The ce interface provides a far more extensive list of kstats that can be used to measure the performance bottlenecks in the driver in the Tx or the Rx. The kstats allow you to decide what corrective tuning can be applied, based on the tuning parameters described previously. The useful statistics are shown in TABLE 7.

TABLE 7 List of ce Specific Interface Statistics

kstat name

Type

Description

rx_ov_flow

counter

Number of times the hardware is unable to receive a packet due to the internal FIFOs being full.

rx_no_buf

counter

Number of times the hardware is unable to receive a packet due to Rx buffers being unavailable.

rx_no_comp_wb

counter

Number of times the hardware is unable to receive a packet due to no space in the completion ring to post Received packet descriptor.

ipackets_cpuXX

counter

Number of packets being directed to load-balancing thread XX.

mdt_pkts

counter

Number of packets sent using multidata interface.

rx_hdr_pkts

counter

Number of packets arriving which are less than 252 bytes in length.

rx_mtu_pkts

counter

Number of packets arriving which are greater than 252 bytes in length.

rx_jumbo_pkts

counter

Number of packets arriving which are greater than 1522 bytes in length.

rx_ov_flow

counter

Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet.

rx_nocanput

counter

Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet.

rx_pkts_dropped

counter

Number of packets dropped due to Service FIFO queue being full.

tx_hdr_pkts

counter

Number of packets hitting the small packet transmission method, copy packet into a pre-mapped DMA buffer.

tx_ddi_pkts

counter

Number of packets hitting the mid range DDI DMA transmission method.

tx_dvma_pkts

counter

Number of packets hitting the top range DVMA fast path DMA transmission method.

tx_jumbo_pkts

counter

Number of packets being sent which are greater than 1522 bytes in length.

tx_max_pend

counter

Measure of the maximum number of packets which was ever queued on a Tx ring.

tx_no_desc

counter

Number of times a packet transmit was attempted and Tx descriptor elements were not available. The packet is postponed until later.

tx_queueX

counter

Number of packets transmitted on a particular queue.

mac_mtu

value

The maximum packet allowed past the MAC.

pci_bus_speed

value

The PCI bus speed that is driving the card.


When rx_ov_flow is incrementing, packet processing is not keeping up with the packet arrival rate. If rx_ov_flow is incrementing while rx_no_buf or rx_no_comp_wb is not, this indicates that the PCI bus is presenting an issue to the flow of packets through the device. This could be because the ce card is plugged into a slower PCI bus. You can confirm the bus speed by looking at the pci_bus_speed statistic. A bus speed of 33 MHz, might not be sufficient to sustain full bidirectional one gigabit Ethernet traffic.

Another scenario that can lead to rx_ov_flow incrementing on its own is sharing the PCI bus with another device that has bandwidth requirements similar to those of the ce card.

These scenarios are hardware limitations. A first step in addressing them is to enable the infinite burst capability on the PCI bus. Use the ndd tuning parameter infinite-burst to achieve that.

Infinite burst will help give ce more bandwidth, but the Tx and Rx of the ce device will still be competing for that PCI bandwidth. Therefore, if the traffic profile shows a bias toward Rx traffic and this condition is leading to rx_ov_flow, you can adjust the bias of PCI transactions in favor of the Rx DMA channel relative to the Tx DMA channel, using ndd parameters rx-dma-weight and tx-dma-weight

Alternatively, you can reorganize the system by giving the ce interface a 66-MHz PCI slot, or separate devices that contend for a shared bus segment by giving each of them a bus segment.

If this doesn't contribute much to reducing the problem, then you should consider using Random Early Detection (RED) to ensure that the impact of dropping packets is minimized with respect to keeping connections alive which would be normally terminated due to regular overflow. The following parameters that allow enabling RED are configurable using ndd: red-dv4to6k, red-dv6to8k, red-dv8to10k, and red-dv10to12k.

The probability that rx_overflow incrementing is the only problem is small. Typically Sun systems have a fast PCI bus and memory subsystem, so delays are seldom induced at that level. It is more likely that the protocol stack software might fall behind and lead to the Rx buffers or completion descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstats parameters rx_no_buf and rx_no_comp_wb will begin to increment. This can mean that there's not enough CPU power to absorb the packets but it can also be due to a bad balance of the buffer ring size versus the completion ring size, leading to the rx_no_comp_wb incrementing without the rx_no_buf incrementing. The default configuration is one buffer to four completion elements. This works great provided that the packets arriving are larger than 256 bytes. If they are not and that traffic dominates, then 32 packets will be packed into a buffer leading to a greater probability that configuration imbalance will occur. For that case, more completion elements need to be made available. This can be addressed using the /etc/system tunables ce_ring_size to adjust the number of available Rx buffers and ce_comp_ring_size to adjust the number of Rx packet completion elements. To understand the traffic profile of the Rx so you can tune these parameters, use kstat to look at the distribution of Rx packets across the rx_hdr_pkts and rx_mtu_pkts.

If ce is being run on a single CPU system and rx_no_buf and rx_no_comp_wb are incrementing, you will have to resort again to RED, or enable Ethernet flow control.

If more than one CPU is available, it is still possible to overwhelm a single CPU. Given that the Rx processing can be split using the alternative Rx data delivery models provided by ce, it might be possible to distribute the processing of incoming packets to more than one CPU, described earlier as Rx load balancing. This will happen by default if the system has four or more CPUs, and it will enable four load-balancing worker threads. The threshold of CPUs in the system and the number of load-balancing worker threads enabled can be managed using the /etc/system tunables ce_cpu_threshold and ce_inst_taskqs.

The number of load balancing worker threads, and how evenly the Rx load is being distributed to each worker thread can be viewed with the ipacket_cpuxx kstats the highest number of xx tells you how many load balancing worker threads are running while value of these parameters give you the spread of the work across the instantiated load balancing worker threads. This, in turn, gives an indication if the load balancing is yielding a benefit. For example, if all ipacket_cpuxx have an approximately even number of packets counted on each then the load balancing is optimal. On the other hand, if only one is incrementing and the others are not, then the benefit of Rx load balancing is nullified.

It is also possible to measure whether the system is experiencing a even spread of CPU activity using mpstat. In the ideal case, if you experience good load balancing as shown in the kstats ipackets_cpuxx, it should also be visible in mpstat that the workload is evenly distributed to multiple CPUs.

If none of this benefit is visible, then disable the load balancing capability completely, using the /etc/system variable ce_taskq_disable.

The Rx load balancing provides packet queues, also known as service FIFOs, between the interrupt threads which fan out the workload and the service FIFO worker threads which drain the service FIFO and complete the workload. These service FIFOs are of fixed size, controlled by the /etc/system variable ce_srv_fifo_depth. It is possible that the service FIFOs can also overflow, and drop packets as the rate of packet arrival exceeds the rate with which the service FIFO draining thread can complete the post processing. These dropped packets can be measured using the rx_pkts_dropped kstat. If this is measured as occurring, you can increase the size of the service FIFO, or you can increase the number of service FIFOs allowing more Rx load balancing. In some cases, it may be possible to eliminate increments in rx_pkts_dropped, but the problem may move to rx_nocanputs, which is generally only addressable by tuning that can be applied by upper layer protocols, although if you're running the Solaris 8 operating system or earlier, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system.

There is a difficulty is maximizing the Rx load balancing, and that's contingent on the Tx ring processing. This is measurable using the lockstat command and will show contention on the ce_start routine at the top as the most contended driver function. This contention cannot be eliminated, but it is possible to employ a new Tx method known as Transmit serialization, which keeps contention to a minimum while forcing the Tx processes on a fixed set of CPUs. Keeping the Tx process on a fixed CPU reduces the risk of CPUs spinning waiting for other CPUs to complete their Tx activity, ensuring CPUs are always kept busy doing useful work. This transmission method can be enabled using the /etc/system variable ce_start_cfg, setting it to 1. When you enable Transmit serialization, you will be trading off Transmit latency for avoiding mutex spins induced by contention.

The Tx side is also subject to an overwhelmed condition, which occurs when the CPU speed exceeds the Ethernet line rate, although this is less likely than any Rx side condition. When the Tx side becomes overwhelmed, tx_max_pending value matches the size of the /etc/system variable ce_tx_ring_size. If this occurs, you know that packets are being postponed because Tx descriptors are being exhausted. Therefore the size of the ce_tx_ring_size should be increased.

The tx_hdr_pkts, tx_ddi_pkts, and tx_dvma_pkts are useful for establishing the traffic profile of an application and matching that profile with the capabilities of a system. The parameters ce_bcopy_thresh, ce_dvma_thresh, and ce_dma_stream_thresh are used for adjusting the transmission method applied to an outgoing packet. These parameters are described in TABLE 7 in terms of mblks, which is the mechanism used to transmit packets in the Solaris operating system. The following output shows how these parameters relate to each other:

mblk size < ce_bcopy_thresh: driver will copy into pre-mapped buffer

mblk size > ce_dvma_thresh: driver uses fast path DVMA interface

ce_dma_stream_thresh < mblk size < ce_dvma_thresh:
           driver uses streaming DMA method

Otherwise: driver uses consistent DMA method.

How to set these parameters is again system dependant and application dependant. The system dependency is associated with memory latency. The rule of thumb to apply here is if the system has a large number of CPUs the memory latency will tend to be larger.

Considering larger memory latency systems it's best to avoid moving data from one memory location to another, so using the premapped buffer for DMA will be more expensive than setting up and tearing down DMA mapping on a per-packet basis.

Furthermore, if the tx_hdr_pkts appears to be incrementing at a higher rate than tx_dvma_pkts, you have an application with a traffic profile that uses a lot of small packets. Therefore, you should adjust the ce_dvma_thresh and ce_bcopy_thresh so that most of the packets hit the tx_dvma_pkts path in the driver and avoid copies. The following may be reasonable parameters, for such a system:

ce_bcopy_thresh = 97
ce_dvma_thresh = 96
ce_dma_stream_thresh = <don't care>

Alternatively, in low memory latency systems, the inverse is true and you would need to adjust ce_dvma_thresh and ce_bcopy_thresh so that most packets take the bcopy route.

ce_bcopy_thresh = 256
ce_dvma_thresh = 255
ce_dma_stream_thresh = <don't care>

The Streaming DMA and Consistent DMA methods are provided as the fall back path, and tend to provide little improvement over the Fast DVMA method or the copy into premapped buffer method. This can be tuned out most of the time, as shown in the previous examples, since it seldom gives improvement over the Fast DVMA method.

You can adjust the DMA thresholds of ce_bcopy_thresh, ce_dvma_thresh, and ce_dma_stream_thresh, using the /etc/system file to push more packets into the preprogrammed DMA versus the per-packet programming. Once the tuning is complete, the statistics can be viewed again to see if the tuning took effect.

The tx_queueX parameter gives a good indication of whether Tx load balancing is happening. Like the Rx side, if no load balancing is visible, meaning all the packets appear to be getting counted by only one tx_queue, then you should switch this feature off and use the ce_no_tx_lb variable.

The mac_mtu gives an indication of the maximum size of packet that will make it through the ce device. It is useful to know if jumbo frames is enabled at the DLPI layer below TCP/IP. If jumbo frames is enabled, then the MTU indicated by mac_mtu will be 9216.

This is helpful as it will show that if there's a mismatch between the DLPI layer MTU and the IP layer MTU, allowing troubleshooting to occur in a layered manner.

Once jumbo frames is successfully configured at the driver layer and the TCP/IP layer, then use the rx_jumbo_pkts and tx_jumbo_pkts, to ensure Transmits and Receives of jumbo frame packets respectively is happening correctly.

  • + Share This
  • 🔖 Save To Your Account