Bulk Transfer Traffic Performance
This section investigates the performance of bulk transfer traffic using the Sun GigaSwift Ethernet MMF adapters hardware (driver name ce). The goal is to achieve maximum throughput and reduce the overhead associated with data transfer.
Before the data is discussed, let's look at the hardware and software environment for this study. The system under test (SUT) is a domain of a Sun Fire 6800 midframe server. It is configured as follows:
CPU PowerEight 900 MHz UltraSPARC_ III+ processors. Eight gigabytes of memory
Operating SystemSolaris 8 OE Release 02/02
Network InterfaceSun GigaSwift Ethernet MMF adapters hardware using 66 MHz PCI interface
Five client machines equipped as follows drive the workload:
Client 1A second domain of the Sun Fire 6800 server with eight 900 MHz UltraSparc III+ processors using Sun GigaSwift Ethernet MMF adapters hardware.
Client 2A Sun Enterprise 6500 with twelve 400 MHz UltraSPARC II processors using Sun Gigabit™ Ethernet adapters P2.0 hardware
Client 3A Sun Enterprise 4500 server with eight 400 MHz UltraSparc II processors using Sun Gigabit Ethernet adapters P2.0 hardware
Client 4A Sun Enterprise 4500 server with eight 400 MHz UltraSparc II processors using Sun Gigabit Ethernet adapters P2.0 hardware
Client 5A Sun Enterprise 450 server with four 400 MHz UltraSparc II processors using Sun™ Gigabit Ethernet adapters P2.0 hardware
All of the gigabit Ethernet cards on client machines use the 66 MHz PCI bus interface. Solaris 8 OE Release 02/02 is running on all of the client machines. Clients are connected to the server using a gigabit switch. MC-Netperf v0.6.1  developed internally is used to generate the workload. MC-Netperf extends the publicly available Netperf  to handle synchronous multiconnection measurements using multicast-based synchronization. Two types of experiments were conducted:
Single-connection test. The second domain of Sun Fire 6800 (Client 1) is used as the client.
10-connection test. Each of the five clients drive two connections.
Runs of 10 minutes are carried out to obtain the numbers discussed in the following section.
Getting an Appropriate Window Size
As described in "Bulk Transfer Traffic Performance" in a switch-based LAN the size of the TCP receive window determines the amount of unacknowledged data the pipe can hold at any given time. The bandwidth delay product determines the pipe size. Since the bandwidth is known to be one Gbps, and the minimal latency is calculated to be 70 to 150 microseconds in "Gigabit Ethernet Latency on a Sun Fire 6800 Server," a pipe must hold at least 1000000000 * 0.000150/8 = 18,750 bytes in transit to achieve one gigabit per second (or 964 Mbps excluding Ethernet, IP, and TCP headers) for packets with a 1,460-byte payload. However, whether or not this 964 Mbps number can be achieved depends on a lot of other factors, and the issue of the receive window size is definitely one of the first that needs to be addressed.
TCP Receive Window Tuning Parameters
When a TCP connection is established, both parties advertise the maximum receive window. The current receive window, which is the actual receive window during the middle of a transmission, is adjusted dynamically based on the receiver's capability. The current send window, although initially small to avoid congestion, ramps up to match the current receive window after the slow start process  if the system is tuned properly. A few parameters should be considered in the tuning process. These parameters are considered for the transmit side:
tcp_xmit_hiwatThis parameter is the high watermark for transmission flow control, tunable by using the ndd command. When the amount of unsent data reaches this level, no more data from the application layer is accepted until the amount drops to below the tcp_xmit_lowat (also tunable by using the ndd command). The default value for this parameter is 24,576 in Solaris 8 OE.
Sending socket buffer sizeThe sending socket buffer is where the application puts data for the kernel to transmit. The size of the socket buffer determines the maximum amount of data the kernel and the application can exchange in each attempt.
For the reception side, these parameters are considered:
tcp_recv_hiwatThe is the high watermark for reception flow control, tunable by using the ndd command. When the application starts to lag behind in reading the data, data starts accumulating in the streamhead. When TCP detects this situation, it starts reducing the TCP receive window by the amount of incoming data on each incoming TCP segment. This process continues until the amount of accumulated data drops to below tcp_xmit_lowat (also tunable by using the ndd command). The default value for this parameter is 24,576 in Solaris 8 OE4.
Receiving socket buffer sizeSimilar to what the sending socket buffer is for on the transmission side. The receiving socket buffer is where the kernel puts data for the application to read.
The parameter tcp_xmit_hiwat determines the default size for the sending socket buffer, so does tcp_recv_hiwat for the receiving socket buffer. However, applications can overwrite the default by creating socket buffers of different sizes when calling the socket library function. Essentially, tuning for the receive window is equivalent to selecting socket buffer size.
At the first glance, it appears that the size of socket buffers should be set to the largest possible value. However, having larger socket buffers means more resources are allocated for each connection. As discussed earlier, a windows of 18,750 bytes may be sufficient to achieve 964 Mbps. Setting socket buffers beyond a certain size will not produce any more benefit. Furthermore, socket buffer sizes determine only the maximum window size. In the middle of TCP data transmission, the current receive window size is the available buffer space in the receiver, and the current send window size is equal to MIN (receive window, send socket buffer). Hence, the size of the send socket buffer should be no smaller than the receive socket buffer to match the sizes of send window and receive window. In the experiments, the sizes of send socket buffer and receive socket buffer are equal.
The window size only applies to each individual connection. For multiple simultaneous connections, the pressure on window size may not be as large as for a single connection. But what exactly is the value needed for the gigabit Ethernet on Sun Fire servers?
Impact of Socket Buffer Size on Single and 10-Connection Throughput
Socket buffers from 16 kilobytes to one megabyte (note that the size of send socket is always matched with that of receive socket) were investigated using the ce interface card. Throughput numbers for both one TCP connection and 10 TCP connections were measured. As FIGURE 1 shows, for a 10-connection sending operation, socket buffers of 48 kilobytes to one megabyte have a significant throughput advantage (up to 20 percent improvement) over socket buffers of smaller sizes. However, for a 10-connection receiving operation, only the 24-kilobyte socket buffer is an under performer. Although 16 kilobytes is quite small, the deficiency in individual connections is more than covered by the existence of multiple connections. For the single connection situation, it appears that 48-kilobyte or larger buffers are needed, but 128-kilobyte or larger buffers do not seem to bring additional benefit. In summary, socket buffers of 64 kilobytes appear to be the best compromise to accommodate both single-connection and 10-connection traffic. Future experiments will use 64-kilobyte socket buffers.
FIGURE 1 Impact of Socket Buffer Size On the Throughput of a ce Card
Reducing Transmission Overhead
Assuming the reception side can receive as fast as the sender can transmit, the sender must minimize the overhead it takes to transmit packets. The associated overhead is mostly in moving data between the socket buffer and the kernel modules, and the number of acknowledgment packets the sender needs to process for smooth pumping of data.
Reducing Overhead to Move Data
Since the Solaris OE must copy the data from the application area to the kernel area for transmission, there is an overhead related to the copy operation. The ndd-tunable parameter tcp_maxpsz_multiplier helps to control the amount of data each copy operation can move (FIGURE 2). Since the TCP module in the kernel needs to process all the pending data before it passes the data down to IP module, this amount should not be too large to prevent the packets from arriving at the hardware in a continuous flow. The default value for this parameter is two in the Solaris 8 OE and the unit is in TCP MSS. But what exactly is the best value for this parameter to support bulk transfer traffic?
FIGURE 2 Effect of tcp.maxpsz Multiplier on Sending Side Performance
To answer this question, 1, 2, 4, 8, 10, 12, 16, 22, and 44 were tried for this parameter using 64-kilobyte socket buffers. Although the maximum allowed value for this parameter is 100, the tests stopped at 44 due to the fact that 64 kilobytes (the socket buffer) divided by 1,460 (the MSS) yields roughly 44.88. As shown in FIGURE 2, this parameter has almost no effect when there are ten connections. For the single-connection case, values of eight or larger deliver about 25 percent higher performance than values of 1, 2, and 4. The best throughput is reached when this parameter is set to 10, so this value is recommended.
Potential Impact of ACK Packets
Another important factor that affects the transmission overhead is the number of acknowledgment (ACK) packets the sender receives for every data packet sent. Each ACK packet is no different than a regular data packet until it reaches the TCP module. Hence, a system must invest a considerable amount of resource to process the ACK packets. The larger the amount of ACK packets, the higher the overhead per data packet.
The ratio of data packets over ACK packets is used to measure this overhead. In Solaris 8 OE, the parameter tcp_deferred_acks_max controls the initial maximal amount of data the receiver (in the same local subnet as the sender) can hold before it must emit an ACK packet. Although the unit of this parameter is in TCP MSS, it is equivalent to the number of packets in the bulk transfer traffic. Hence, setting tcp_deferred_acks_max (use ndd) to eight says the receiver can send one ACK packet for every eight data packets it receives, provided that the timer set by tcp_deferred_acks_interval5 does not time-out. The effect of tcp_deferred_acks_max also depends on the link quality and the status of the network. If ACK packets get lost for some reason, the sender will eventually retransmit data. When the receiver sees this, it will adapt itself to send ACK packets more frequently by reducing the amount of data it can hold without ACKing by one MSS. This process will be triggered for every retransmitted segment. However, in the worst case, the receiver will send an ACK packet for every other data packet it receives, as suggested by RFC 1122.
FIGURE 3 shows how the data packet rate changes when the number of ACK packets gets higher in the first 38 seconds of a connection with poor link quality. This experiment uses the default value of eight (MSS) for the parameter tcp_deferred_acks_max and 100 milliseconds for the parameter tcp_deferred_acks_interval. The packet rate is the highest when the connection first starts (due to the resolution of one-second the behavior of slow-start phase cannot be observed). Even though eight data packets are desirable for each ACK packet, this figure started with a ratio of four. This ratio is also due to the low resolution of this graph. The packet rate reported is the average packet rate during the past second. Using the snoop utility, four retransmitted segments can be observed, which forces the receiver's Solaris 8 OE to adjust four times the amount of data it can hold without ACKing. Two more jumps of the ACK packet rate can be observed in the 6th and the 7th seconds. The ratio of data packets to ACK packets remains about 2.0 for the rest of the connection.
FIGURE 3 Impact of ACK Packets on Data Packet Rate
Data transfer is only a part of any running user application. To have an application running fast, the CPU time spent in data transfer must be minimized. Hence, knowing how much CPU time is dedicated to data transfer helps to plan the capacity requirement.
The CPU utilization for ce cards when the number of CPUs goes from one to eight was evaluated (FIGURE 4 through FIGURE 7). The utilization numbers shown in these figures are reported as the percentage of time that all available CPUs are engaged. A single number can have different meanings when the underlying number of CPUs is different. For instance, 50 percent utilization for a system with four CPUs means two CPUs are busy on average. Fifty percent utilization for a system with eight CPUs means four CPUs are busy on average.
FIGURE 4 Throughput and Amount of Kernel Mode CPU Time Required To Support One TCP Sending Operation By One ce Card
FIGURE 5 Throughput and Amount of Kernel Mode CPU Time Required To Support One TCP Receiving Operation By One ce Card
FIGURE 6 Throughput and Amount of Kernel mode CPU Time Required To support 10 Simultaneous Sending Operations By One ce Card
FIGURE 7 Throughput and Amount of Kernel mode CPU Time Required To support 10 Simultaneous Receiving Operations By One ce Card
One-Card, One-Connection CPU Utilization
In the one-connection case6 (FIGURE 4 and FIGURE 5), two CPUs are sufficient to drive one ce card to its maximum capacity for send-only operations. For reception, three CPUs are necessary to have one ce card deliver 600 Mbps, and five CPUs are needed to deliver 700 Mbps. The best reception performance is achieved with six CPUs (736 Mbps); seven or eight CPUs do not seem to bring additional performance benefit. Even though the overall CPU utilization drops to 15 percent when there are eight CPUs (FIGURE 4), it takes about 8 * 0.15 = 1.2 CPUs to handle network traffic. When there are five CPUs, the number of CPUs to handle network traffic is 5 * 0.26 = 1.30, not far from the 1.2 CPUs needed in the 8-CPU case. In summary, two CPUs are necessary to obtain good performance for one ce card. Diminishing returns are observed for three or more CPUs.
One-Card, Ten-Connection CPU Utilization}
For the 10-connection scenario (FIGURE 6 and FIGURE 7), each additional CPU brings higher sending performance, with eight CPUs achieving 830 Mbps. About 8 * 35% = 2.80 CPUs are dedicated to network traffic. For receiving, ce can reach close to line speed (920 Mbps) with six CPUs, utilizing the power of 6 * 70% = 4.2 CPUs. Diminishing returns are observed when four or more CPUs are added, indicating a recommendation of three CPUs for one ce card.
Note that the preceding numbers are measured using ce driver version 1.115. The Sun engineering team devotes continuous effort to improving both the throughput and utilization. You can expect to observe better than what is reported in your future experiments.