Home > Articles > Operating Systems, Server > Solaris

  • Print
  • + Share This
This chapter is from the book

Case Studies

In this section, we apply the scripts and methods discussed in this chapter to observe and measure applications with DTrace.

Firefox idle

This case study examines the Mozilla Firefox Web browser version 3, running on Oracle Solaris.

The Problem

Firefox is 8.9 percent on-CPU yet has not been used for hours. What is costing 8.9 percent CPU?

# prstat
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 27060 brendan   856M  668M sleep   59    0   7:30:44 8.9% firefox-bin/17
 27035 brendan   150M  136M sleep   59    0   0:20:51 0.4% opera/3
 18722 brendan   164M   38M sleep   59    0   0:57:53 0.1% java/18
  1748 brendan  6396K 4936K sleep   59    0   0:03:13 0.1% screen-4.0.2/1
 17303 brendan   305M  247M sleep   59    0  34:16:57 0.1% Xorg/1
 27754 brendan  9564K 3772K sleep   59    0   0:00:00 0.0% sshd/1
 19998 brendan    68M 7008K sleep   59    0   2:41:34 0.0% gnome-netstatus/1
 27871 root     3360K 2792K cpu0    49    0   0:00:00 0.0% prstat/1
 29805 brendan    54M   46M sleep   59    0   1:53:23 0.0% elinks/1
[...]

Profiling User Stacks

The uoncpu.d script (from the "Scripts" section) was run for ten seconds:

# uoncpu.d firefox-bin
dtrace: script 'uoncpu.d' matched 1 probe
^C
[...output truncated...]

  on-cpu (count @1001hz):
              libmozjs.so`js_FlushPropertyCacheForScript+0xe6
              libmozjs.so`js_DestroyScript+0xc1
              libmozjs.so`JS_EvaluateUCScriptForPrincipals+0x87
              libxul.so`__1cLnsJSContextOEvaluateString6MrknSnsAString_internal_pvpnMn
sIPrincip8
              libxul.so`__1cOnsGlobalWindowKRunTimeout6MpnJnsTimeout__v_+0x59c
              libxul.so`__1cOnsGlobalWindowNTimerCallback6FpnInsITimer_pv_v_+0x2e
              libxul.so`__1cLnsTimerImplEFire6M_v_+0x144
              libxul.so`__1cMnsTimerEventDRun6M_I_+0x51
              libxul.so`__1cInsThreadQProcessNextEvent6Mipi_I_+0x143
              libxul.so`__1cVNS_ProcessNextEvent_P6FpnJnsIThread_i_i_+0x44
              libxul.so`__1cOnsBaseAppShellDRun6M_I_+0x3a
              libxul.so`__1cMnsAppStartupDRun6M_I_+0x34
              libxul.so`XRE_main+0x35e3
              firefox-bin`main+0x223
              firefox-bin`_start+0x7d
               42

The output was many pages long and includes C++ signatures as function names (they can be passed through c++filt to improve readability). The hottest stack is in libmozjs, which is SpiderMonkey—the Firefox JavaScript engine. However, the count for this hot stack is only 42, which, when the other counts from the numerous truncated pages are tallied, is likely to represent only a fraction of the CPU cycles. (uoncpu.d can be enhanced to print a total sample count and the end to make this ratio calculation easy to do.)

Profiling User Modules

Perhaps an easier way to find the origin of the CPU usage is to not aggregate on the entire user stack track but just the top-level user module. This won't be as accurate—a user module may be consuming CPU by calling functions from a generic library such as libc—but it is worth a try:

# dtrace -n 'profile-1001 /execname == "firefox-bin"/ { @[umod(arg1)] = count(); }
tick-60sec { exit(0); }'
dtrace: description 'profile-1001 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  63284                      :tick-60sec 

  libsqlite3.so                                                      1
  0xf0800000                                                         2
  libplds4.so                                                        2
  libORBit-2.so.0.0.0                                                5
  0xf1600000                                                         8
  libgthread-2.0.so.0.1400.4                                        10
  libgdk-x11-2.0.so.0.1200.3                                        14
  libplc4.so                                                        16
  libm.so.2                                                         19
  libX11.so.4                                                       50
  libnspr4.so                                                      314
  libglib-2.0.so.0.1400.4                                          527
  0x0                                                              533
  libflashplayer.so                                               1143
  libc.so.1                                                       1444
  libmozjs.so                                                     2671
  libxul.so                                                       4143

The hottest module was libxul, which is the core Firefox library. The next was libmozjs (JavaScript) and then libc (generic system library). It is possible that libmozjs is responsible for the CPU time in both libc and libxul, by calling functions from them. We'll investigate libmozjs (JavaScript) first; if this turns out to be a dead end, we'll return to libxul.

Function Counts and Stacks

To investigate JavaScript, the DTrace JavaScript provider can be used (see Chapter 8). For the purposes of this case study, let's assume that such a convenient provider is not available. To understand what the libmosjs library is doing, we'll first frequency count function calls:

# dtrace -n 'pid$target:libmozjs::entry { @[probefunc] = count(); }' -p `pgrep firefox-bin`
dtrace: description 'pid$target:libmozjs::entry ' matched 1617 probes
^C

  CloseNativeIterators                                               1
  DestroyGCArenas                                                    1
  JS_CompareValues                                                   1
  JS_DefineElement                                                   1
  JS_FloorLog2                                                       1
  JS_GC                                                              1
[...]
  JS_free                                                        90312
  js_IsAboutToBeFinalized                                        92414
  js_GetToken                                                    99666
  JS_DHashTableOperate                                          102908
  GetChar                                                       109323
  fun_trace                                                     132924
  JS_GetPrivate                                                 197322
  js_TraceObject                                                213983
  JS_TraceChildren                                              228323
  js_SearchScope                                                267826
  js_TraceScopeProperty                                         505450
  JS_CallTracer                                                1923784

The most frequent function called was JS_CallTracer(), which was called almost two million times during the ten seconds that this one-liner was tracing. To see what it does, the source code could be examined; but before we do that, we can get more information from DTrace including frequency counting the user stack trace to see who is calling this function:

# dtrace -n 'pid$target:libmozjs:JS_CallTracer:entry { @[ustack()] =
count(); }' -p `pgrep firefox-bin`
[...]

              libmozjs.so`JS_CallTracer
              libmozjs.so`js_TraceScopeProperty+0x54
              libmozjs.so`js_TraceObject+0xd5
              libmozjs.so`JS_TraceChildren+0x351
              libxul.so`__1cLnsXPConnectITraverse6MpvrnbInsCycleCollectionTraversalCal
lback__I_+0xc7
              libxul.so`__1cQnsCycleCollectorJMarkRoots6MrnOGCGraphBuilder__v_+0x96
              libxul.so`__1cQnsCycleCollectorPBeginCollection6M_i_+0xf1
              libxul.so`__1cbGnsCycleCollector_beginCollection6F_i_+0x26
              libxul.so`__1cZXPCCycleCollectGCCallback6FpnJJSContext_nKJSGCStatus__i_+0xd8
              libmozjs.so`js_GC+0x5ef
              libmozjs.so`JS_GC+0x4e
              libxul.so`__1cLnsXPConnectHCollect6M_i_+0xaf
              libxul.so`__1cQnsCycleCollectorHCollect6MI_I_+0xee
              libxul.so`__1cYnsCycleCollector_collect6F_I_+0x28
              libxul.so`__1cLnsJSContextGNotify6MpnInsITimer__I_+0x375
              libxul.so`__1cLnsTimerImplEFire6M_v_+0x12d
              libxul.so`__1cMnsTimerEventDRun6M_I_+0x51
              libxul.so`__1cInsThreadQProcessNextEvent6Mipi_I_+0x143
              libxul.so`__1cVNS_ProcessNextEvent_P6FpnJnsIThread_i_i_+0x44
              libxul.so`__1cOnsBaseAppShellDRun6M_I_+0x3a
            40190

The stack trace here has been truncated (increase the ustackframes tunable to see all); however, enough has been seen for this and the truncated stack traces to see that they originate from JS_GC()—a quick look at the code confirms that this is JavaScript Garbage Collect.

Function CPU Time

Given the name of the garbage collect function, a script can be quickly written to check the CPU time spent in it (named jsgc.d):

1  #!/usr/sbin/dtrace -s
2
3  #pragma D option quiet
4
5  pid$target::JS_GC:entry
6  {
7          self->vstart = vtimestamp;
8  }
9
10  pid$target::JS_GC:return
11  /self->vstart/
12  {
13          this->oncpu = (vtimestamp - self->vstart) / 1000000;
14          printf("%Y GC: %d CPU ms\n", walltimestamp, this->oncpu);
15          self->vstart = 0;
16  }

Script jsgc.d

               

This specifically measures the elapsed CPU time (vtimestamp) for JS_GC(). (Another approach would be to use the profile provider and count stack traces that included JS_GC().)

Here we execute jsgc.d:

# jsgc.d -p `pgrep firefox-bin`
2010 Jul  4 01:06:57 GC: 331 CPU ms
2010 Jul  4 01:07:38 GC: 316 CPU ms
2010 Jul  4 01:08:18 GC: 315 CPU ms
^C

So, although GC is on-CPU for a significant time, more than 300 ms per call, it's not happening frequently enough to explain the 9 percent CPU average of Firefox. This may be a problem, but it's not the problem. (This is included here for completeness; this is the exact approach used to study this issue.)

Another frequently called function was js_SearchScope(). Checking its stack trace is also worth a look:

# dtrace -n 'pid$target:libmozjs:js_SearchScope:entry { @[ustack()] =
count(); }' -p `pgrep firefox-bin`
dtrace: description 'pid$target:libmozjs:js_SearchScope:entry ' matched 1 probe
^C
[...output truncated...]

              libmozjs.so`js_SearchScope
              libmozjs.so`js_DefineNativeProperty+0x2f1
              libmozjs.so`call_resolve+0x1e7
              libmozjs.so`js_LookupProperty+0x3d3
              libmozjs.so`js_PutCallObject+0x164
              libmozjs.so`js_Interpret+0x9cd4
              libmozjs.so`js_Execute+0x3b4
              libmozjs.so`JS_EvaluateUCScriptForPrincipals+0x58
              libxul.so`__1cLnsJSContextOEvaluateString6MrknSnsAString_internal_pvpnMn
sIPrincipal_pkcIIp1pi_I_+0x2e8
              libxul.so`__1cOnsGlobalWindowKRunTimeout6MpnJnsTimeout__v_+0x59c
              libxul.so`__1cOnsGlobalWindowNTimerCallback6FpnInsITimer_pv_v_+0x2e
              libxul.so`__1cLnsTimerImplEFire6M_v_+0x144
              libxul.so`__1cMnsTimerEventDRun6M_I_+0x51
              libxul.so`__1cInsThreadQProcessNextEvent6Mipi_I_+0x143
              libxul.so`__1cVNS_ProcessNextEvent_P6FpnJnsIThread_i_i_+0x44
              libxul.so`__1cOnsBaseAppShellDRun6M_I_+0x3a
              libxul.so`__1cMnsAppStartupDRun6M_I_+0x34
              libxul.so`XRE_main+0x35e3
              firefox-bin`main+0x223
              firefox-bin`_start+0x7d
             9287

This time, the function is being called by js_Execute(), the entry point for JavaScript code execution (and itself was called by JS_EvaluateUCScriptFor-Principals()). Here we are modifying the earlier script to examine on-CPU time (now jsexecute.d):

1  #!/usr/sbin/dtrace -s
2
3  pid$target::js_Execute:entry
4  {
5          self->vstart = vtimestamp;
6  }
7
8  pid$target::js_Execute:return
9  /self->vstart/
10  {
11          this->oncpu = vtimestamp - self->vstart;
12          @["js_Execute Total(ns):"] = sum(this->oncpu);
13          self->vstart = 0;
14  }

Script jsexecute.d

               

Here we run it for ten seconds:

# jsexecute.d -p `pgrep firefox-bin` -n 'tick-10sec { exit(0); }'
dtrace: script 'jsexecute.d' matched 2 probes
dtrace: description 'tick-10sec ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  64907                      :tick-10sec 


  js_Execute Total(ns):                                     427936779

This shows 428 ms of time in js_Execute() during those ten seconds, and so this CPU cost can explain about half of the Firefox CPU time (this is a single-CPU system; therefore, there is 10,000 ms of available CPU time every 10 seconds, so this is about 4.3 percent of CPU).

The JavaScript functions could be further examined with DTrace to find out why this JavaScript program is hot on-CPU, in other words, what exactly it is doing (the DTrace JavaScript provider would help here, or a Firefox add-on could be tried).

Fetching Context

Here we will find what is being executed: preferably the URL. Examining the earlier stack trace along with the Firefox source (which is publically available) showed the JavaScript filename is the sixth argument to the JS_EvaluateUCScriptFor-Principals() function. Here we are pulling this in and frequency counting:

# dtrace -n 'pid$target::*EvaluateUCScriptForPrincipals*:entry { @[copyinstr(arg5)] =
 count(); } tick-10sec { exit(0); }' -p `pgrep firefox-bin`
dtrace: description 'pid$target::*EvaluateUCScriptForPrincipals*:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  64907                      :tick-10sec 

  http://www.example.com/js/st188.js                                         7056

The name of the URL has been modified in this output (to avoid embarrassing anyone); it pointed to a site that I didn't think I was using, yet their script was getting executed more than 700 times per second anyway, which is consuming (wasting!) at least 4 percent of the CPU on this system.

The Fix

An add-on was already available that could help at this point: SaveMemory, which allows browser tabs to be paused. The DTrace one-liner was modified to print continual one-second summaries, while all tabs were paused as an experiment:

# dtrace -n 'pid$target::*EvaluateUCScriptForPrincipals*:entry { @[copyinstr(arg5)] =
count(); } tick-1sec { printa(@); trunc(@); }' -p `pgrep firefox-bin`
[...]
  1  63140                       :tick-1sec
  http://www.example.com/js/st188.js                                     697

  1  63140                       :tick-1sec
  http://www.example.com/js/st188.js                                     703

  1  63140                       :tick-1sec
file:///export/home/brendan/.mozilla/firefox/3c8k4kh0.default/extensions/%7Be4a8a97b-f
2ed-450b-b12d-ee082ba24781%7D/components/greasemonkey.js                1
  http://www.example.com/js/st188.js                                     126

  1  63140                       :tick-1sec

  1  63140                       :tick-1sec

The execution count for the JavaScript program begins at around 700 executions per second and then vanishes when pausing all tabs. (The output has also caught the execution of greasemonkey.js, executed as the add-on was used.)

prstat(1M) shows the CPU problem is no longer there (shown after waiting a few minutes for the %CPU decayed average to settle):

# prstat
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 27035 brendan   150M  136M sleep   49    0   0:27:15 0.2% opera/4
 27060 brendan   407M  304M sleep   59    0   7:35:12 0.1% firefox-bin/17
 28424 root     3392K 2824K cpu1    49    0   0:00:00 0.0% prstat/1
[...]

Next, the browser tabs were unpaused one by one to identify the culprit, while still running the DTrace one-liner to track JavaScript execution by file. This showed that there were seven tabs open on the same Web site that was running the JavaScript program—each of them executing it about 100 times per second. The Web site is a popular blogging platform, and the JavaScript was being executed by what appears to be an inert icon that links to a different Web site (but as we found out—it is not inert).7 The exact operation of that JavaScript program can now be investigated using the DTrace JavaScript provider or a Firefox add-on debugger.

Conclusion

A large component of this issue turned out to be a rogue JavaScript program, an issue that could also have been identified with Firefox add-ons. The advantage of using DTrace is that if there is an issue, the root cause can be identified—no matter where it lives in the software stack. As an example of this,8 about a year ago a performance issue was identified in Firefox and investigated in the same way—and found to be a bug in a kernel frame buffer driver (video driver); this would be extremely difficult to have identified from the application layer alone.

Xvnc

Xvnc is a Virtual Network Computing (VNC) server that allows remote access to X server–based desktops. This case study represents examining an Xvnc process that is CPU-bound and demonstrates using the syscall and profile providers.

When performing a routine check of running processes on a Solaris system by using prstat(1), it was discovered that an Xvnc process was the top CPU consumer. Looking just at that process yields the following:

solaris# prstat -c -Lmp 5459
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  5459 nobody    86  14 0.0 0.0 0.0 0.0 0.0 0.0   0  36 .2M 166 Xvnc/1

We can see the Xvnc process is spending most of its time executing in user mode (USR, 86 percent) and some of its time in the kernel (SYS, 14 percent). Also worth noting is it is executing about 200,000 system calls per second (SCL value of .2M).

syscall Provider

Let's start by checking what those system calls are. This one-liner uses the syscall provider to frequency count system calls for this process and prints a summary every second:

solaris# dtrace -qn 'syscall:::entry /pid == 5459/ { @[probefunc] =
count(); } tick-1sec { printa(@); trunc(@); }'

  read                                                               4
  lwp_sigmask                                                       34
  setcontext                                                        34
  setitimer                                                         68
  accept                                                         48439
  gtime                                                          48439
  pollsys                                                        48440
  write                                                          97382
  read                                                               4
  lwp_sigmask                                                       33
  setcontext                                                        33
  setitimer                                                         66
  gtime                                                          48307
  pollsys                                                        48307
  accept                                                         48308
  write                                                          97117

Because the rate of system calls was relatively high, as reported by prstat(1M), we opted to display per-second rates with DTrace. The output shows more than 97,000 write() system calls per second and just more than 48,000 accept(), poll(), and gtime() calls.

Let's take a look at the target of all the writes and the requested number of bytes to write:

solaris# dtrace -qn 'syscall::write:entry /pid == 5459/ { @[fds[arg0].fi_pathname,
arg2] = count(); }'
^C

/var/adm/X2msgs                                              26                 8
/devices/pseudo/mm@0:null                                  8192              3752
/var/adm/X2msgs                                              82            361594
/var/adm/X2msgs                                              35            361595

The vast majority of the writes are to a file, /var/adm/X2msgs. The number of bytes to write was 82 bytes and 35 bytes for the most part (more than 361,000 times each). Checking that file yields the following:

solaris# ls -l /var/adm/X2msgs
-rw-r--r--   1 root     nobody   2147483647 Aug 13 15:05 /var/adm/X2msgs
solaris# tail /var/adm/X2msgs
               connection: Invalid argument (22)
  XserverDesktop: XserverDesktop::wakeupHandler: unable to accept new
               connection: Invalid argument (22)
  XserverDesktop: XserverDesktop::wakeupHandler: unable to accept new
               connection: Invalid argument (22)
  XserverDesktop: XserverDesktop::wakeupHandler: unable to accept new
               connection: Invalid argument (22)
  XserverDesktop: XserverDesktop::wakeupHandler: unable to accept new
               connection: Invalid argument (22)

Looking at the file Xvnc is writing to, we can see it is getting very large (more than 2GB), and the messages themselves appear to be error messages. We will explore that more closely in just a minute.

Given the rate of 97,000 writes per second, we can already extrapolate that each write is taking much less than 1 ms (1/97000 = 0.000010), so we know the data is probably being written to main memory (since the file resides on a file system and the writes are not synchronous, they are being satisfied by the in-memory file system cache). We can of course time these writes with DTrace:

solaris# dtrace -qn 'syscall::write:entry /pid == 5459/
{ @[fds[arg0].fi_fs] = count(); }'                    
^C

  specfs                                                         2766
  zfs                                                          533090

solaris# cat -n w.d
1   #!/usr/sbin/dtrace -qs 
2   
3    syscall::write:entry 
4   /pid == 5459 && fds[arg0].fi_fs == "zfs"/ 
5    { 
6           self->st = timestamp; 
7    } 
8    syscall::write:return
9    /self->st/
10  {
11          @ = quantize(timestamp - self->st);
12           self->st = 0;
13  }

solaris#  ./w.d
^C

           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1477349  
            1024 |                                         2312     
            2048 |                                         3100     
            4096 |                                         250      
            8192 |                                         233      
           16384 |                                         145      
           32768 |                                         90       
           65536 |                                         0

Before measuring the write time, we wanted to be sure we knew the target file system type of the file being written, which was ZFS. We used that in the predicate in the w.d script to measure write system calls for this process (along with the process PID test). The output of w.d is a quantize aggregation that displays wall clock time for all the write calls executed to a ZFS file system from that process during the sampling period. We see that most of the writes fall in the 512-nanosecond to 1024-nanosecond range, so these are most certainly writes to memory.

We can determine the user code path leading up to the writes by aggregating on the user stack when the write system call is called:

solaris# dtrace -qn 'syscall::write:entry /pid == 5459 && fds[arg0].fi_fs ==
"zfs"/ { @[ustack()] = count(); }'
^C
[...]
               libc.so.1`_write+0x7
               libc.so.1`_ndoprnt+0x2816
               libc.so.1`fprintf+0x99
               Xvnc`_ZN3rfb11Logger_File5writeEiPKcS2_+0x1a5
               Xvnc`_ZN3rfb6Logger5writeEiPKcS2_Pc+0x36
               Xvnc`_ZN3rfb9LogWriter5errorEPKcz+0x2d
               Xvnc`_ZN14XserverDesktop13wakeupHandlerEP6fd_seti+0x28b
               Xvnc`vncWakeupHandler+0x3d
               Xvnc`WakeupHandler+0x36
               Xvnc`WaitForSomething+0x28d
               Xvnc`Dispatch+0x76
               Xvnc`main+0x3e5
               Xvnc`_start+0x80
            430879

               libc.so.1`_write+0x7
               libc.so.1`_ndoprnt+0x2816
               libc.so.1`fprintf+0x99
               Xvnc`_ZN3rfb11Logger_File5writeEiPKcS2_+0x1eb
               Xvnc`_ZN3rfb6Logger5writeEiPKcS2_Pc+0x36
               Xvnc`_ZN3rfb9LogWriter5errorEPKcz+0x2d
               Xvnc`_ZN14XserverDesktop13wakeupHandlerEP6fd_seti+0x28b
               Xvnc`vncWakeupHandler+0x3d
               Xvnc`WakeupHandler+0x36
               Xvnc`WaitForSomething+0x28d
               Xvnc`Dispatch+0x76
               Xvnc`main+0x3e5
               Xvnc`_start+0x80
            430879

We see two very similar stack frames, indicating a log event is causing the Xvnc process to write to its log file.

We can even use DTrace to observe what is being written to the file, by examining the contents of the buffer pointer from the write(2) system call. It is passed to the copyinstr() function, both to copy the data from user-land into the kernel address space and to treat it as a string:

solaris# dtrace -n 'syscall::write:entry /pid == 5459/ { @[copyinstr(arg1)] =
count(); }'
dtrace: description 'syscall::write:entry ' matched 1 probe
^C

Sun Aug 22 00:09:05 2010
ent (22)
keupHandler: unable to accept new
             st!
Ltd.
See http://www.realvnc.com for information on VNC.
                1
Sun Aug 22 00:09:06 2010
ent (22)
keupHandler: unable to accept new
             st!
                2
[...]
upHandler: unable to accept new connection: Invalid argument (22)XserverDesktop::wakeu
pHandler: unable to accept new connection: Invalid argument (22)XserverDesktop::wakeup
Handler: unable to accept new connection: Invalid argument (22)XserverDesktop::wake
                59
valid argument (22)XserverDesktop::wakeupHandler: unable to accept new connection: I
nvalid argument (22)XserverDesktop::wakeupHandler: unable to accept new connection: In
valid argument (22)XserverDesktop::wakeupHandler: unable to accept new connection: In
                59

This shows the text being written to the log file, which largely contains errors describing invalid arguments used for new connections. Remember that our initial one-liner discovered more than 48,000 accept() system calls per-second—it would appear that these are failing because of invalid arguments, which is being written as an error message to the /var/adm/X2msgs log.

DTrace can confirm that the accept() system calls are failing in this way, by examining the error number (errno) on syscall return:

solaris# dtrace -n 'syscall::accept:return /pid == 5459/ { @[errno] = count(); }'
dtrace: description 'syscall::accept:return ' matched 1 probe
^C
       22           566135

solaris# grep 22 /usr/include/sys/errno.h
#define     EINVAL      22    /* Invalid argument                 */

All the accept() system calls are returning with errno 22, EINVAL (Invalid argument). The reason for this can be investigated by examining the arguments to the accept() system call.

solaris# dtrace -n 'syscall::accept:entry /execname == "Xvnc"/ { @[arg0, arg1,
arg2] = count(); }'
dtrace: description 'syscall::accept:entry ' matched 1 probe
^C

                3                0                0           150059

We see the first argument to accept is 3, which is the file descriptor for the socket. The second two arguments are both NULL, which may be the cause of the EINVAL error return from accept. It is possible it is valid to call accept with the second and third arguments as NULL values,9 in which case the Xvnc code is not handling the error return properly. In either case, the next step would be to look at the Xvnc source code and find the problem. The code is burning a lot of CPU with calls to accept(2) that are returning an error and each time generating a log file write.

While still using the syscall provider, the user code path for another of the other hot system calls can be examined:

solaris# dtrace -n 'syscall::gtime:entry /pid == 5459/ { @[ustack()] = count(); }'
dtrace: description 'syscall::gtime:entry ' matched 1 probe
^C

              libc.so.1`__time+0x7
              Xvnc`_ZN3rfb11Logger_File5writeEiPKcS2_+0xce
              Xvnc`_ZN3rfb6Logger5writeEiPKcS2_Pc+0x36
              Xvnc`_ZN3rfb9LogWriter5errorEPKcz+0x2d
              Xvnc`_ZN14XserverDesktop13wakeupHandlerEP6fd_seti+0x28b
              Xvnc`vncWakeupHandler+0x3d
              Xvnc`WakeupHandler+0x36
              Xvnc`WaitForSomething+0x28d
              Xvnc`Dispatch+0x76
              Xvnc`main+0x3e5
              Xvnc`_start+0x80
           370156

This shows that calls to gtime(2) are part of the log file writes in the application, based on the user function names we see in the stack frames.

profile Provider

To further understand the performance of this process, we will sample the on-CPU code at a certain frequency, using the profile provider.

solaris# dtrace -n 'profile-997hz /arg1 && pid == 5459/ { @[ufunc(arg1)] = count(); }'
dtrace: description 'profile-997hz ' matched 1 probe
^C
[...]
  libc.so.1`memcpy                                                 905
  Xvnc`_ZN14XserverDesktop12blockHandlerEP6fd_set                  957
  libgcc_s.so.1`uw_update_context_1                               1155
  Xvnc`_ZN3rdr15SystemExceptionC2EPKci                            1205
  libgcc_s.so.1`execute_cfa_program                               1278
  libc.so.1`strncat                                               1418
  libc.so.1`pselect                                               1686
  libstdc++.so.6.0.3`_Z12read_uleb128PKhPj                        1700
  libstdc++.so.6.0.3`_Z28read_encoded_value_with_basehjPKhPj      2198
  libstdc++.so.6.0.3`__gxx_personality_v0                         2445
  libc.so.1`_ndoprnt                                              3918

This one-liner shows which user functions were on-CPU most frequently. It tests for user mode (arg1) and the process of interest and uses the ufunc() function to convert the user-mode on-CPU program counter (arg1) into the user function name. The most frequent is a libc function, _ndoprnt(), followed by several functions from the standard C++ library.

For a detailed look of the user-land code path that is responsible for consuming CPU cycles, aggregate on the user stack:

solaris# dtrace -n 'profile-997hz /arg1 && pid == 5459/ { @[ustack()] =
count(); } tick-10sec { trunc(@, 20); exit(0); }'
^c
[...]
              libstdc++.so.6.0.3`__gxx_personality_v0+0x29f
              libgcc_s.so.1`_Unwind_RaiseException+0x88
              libstdc++.so.6.0.3`__cxa_throw+0x64
              Xvnc`_ZN7network11TcpListener6acceptEv+0xb3
              Xvnc`_ZN14XserverDesktop13wakeupHandlerEP6fd_seti+0x13d
              Xvnc`vncWakeupHandler+0x3d
              Xvnc`WakeupHandler+0x36
              Xvnc`WaitForSomething+0x28d
              Xvnc`Dispatch+0x76
              Xvnc`main+0x3e5
              Xvnc`_start+0x80
              125

              libc.so.1`memset+0x10c
              libgcc_s.so.1`_Unwind_RaiseException+0xb7
              libstdc++.so.6.0.3`__cxa_throw+0x64
              Xvnc`_ZN7network11TcpListener6acceptEv+0xb3
              Xvnc`_ZN14XserverDesktop13wakeupHandlerEP6fd_seti+0x13d
              Xvnc`vncWakeupHandler+0x3d
              Xvnc`WakeupHandler+0x36
              Xvnc`WaitForSomething+0x28d
              Xvnc`Dispatch+0x76
              Xvnc`main+0x3e5
              Xvnc`_start+0x80
              213

Note that only the two most frequent stack frames are shown here. We see the event loop in the Xvnc code and visually decoding the mangled function names; we can see a function with network TCPListener accept in the function name. This makes sense for an application like Xvnc, which would be listening on a network socket for incoming requests and data. And we know that there's an issue with the issued accept(2) calls inducing a lot of looping around with the error returns.

We can also take a look at the kernel component of the CPU cycles consumed by this process, again using the profile provider and aggregating on kernel stacks:

solaris# dtrace -n 'profile-997hz /pid == 5459 && arg0/ { @[stack()] = count(); }'
^c
[...]
              unix`mutex_enter+0x10
              genunix`pcache_poll+0x1a5
              genunix`poll_common+0x27f
              genunix`pollsys+0xbe
              unix`sys_syscall32+0x101
               31

              unix`tsc_read+0x3
              genunix`gethrtime+0xa
              unix`pc_gethrestime+0x31
              genunix`gethrestime+0xa
              unix`gethrestime_sec+0x11
              genunix`gtime+0x9
              unix`sys_syscall32+0x101
               41

              unix`tsc_read+0x3
              genunix`gethrtime_unscaled+0xa
              genunix`syscall_mstate+0x4f
              unix`sys_syscall32+0x11d
              111

              unix`lock_try+0x8
              genunix`post_syscall+0x3b6
              genunix`syscall_exit+0x59
              unix`sys_syscall32+0x1a0
              229

The kernel stack is consistent with previously observed data. We see system call processing (remember, this process is doing 200,000 system calls per second), we see the gtime system call stack in the kernel, as well as the poll system call kernel stack. We could measure this to get more detail, but the process profile was only 14 percent kernel time, and given the rate and type of system calls being executed by this process, there is minimal additional value in terms of understanding the CPU consumption by this process in measuring kernel functions.

For a more connected view, we can trace code flow from user mode through the kernel by aggregating on both stacks:

solaris# dtrace -n 'profile-997hz /pid == 5459/ { @[stack(), ustack()] =
count(); } tick-10sec { trunc(@, 2); exit(0); }'
dtrace: description 'profile-997hz ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1 122538                      :tick-10sec
              unix`lock_try+0x8
              genunix`post_syscall+0x3b6
              genunix`syscall_exit+0x59
              unix`sys_syscall32+0x1a0

              libc.so.1`_write+0x7
              libc.so.1`_ndoprnt+0x2816
              libc.so.1`fprintf+0x99
              Xvnc`_ZN3rfb11Logger_File5writeEiPKcS2_+0x1eb
              Xvnc`_ZN3rfb6Logger5writeEiPKcS2_Pc+0x36
              Xvnc`_ZN3rfb9LogWriter5errorEPKcz+0x2d
              Xvnc`_ZN14XserverDesktop13wakeupHandlerEP6fd_seti+0x28b
              Xvnc`vncWakeupHandler+0x3d
              Xvnc`WakeupHandler+0x36
              Xvnc`WaitForSomething+0x28d
              Xvnc`Dispatch+0x76
              Xvnc`main+0x3e5
              Xvnc`_start+0x80
              211

              unix`lock_try+0x8
              genunix`post_syscall+0x3b6
              genunix`syscall_exit+0x59
              unix`sys_syscall32+0x1a0

              libc.so.1`_so_accept+0x7
              Xvnc`_ZN7network11TcpListener6acceptEv+0x18
              Xvnc`_ZN14XserverDesktop13wakeupHandlerEP6fd_seti+0x13d
              Xvnc`vncWakeupHandler+0x3d
              Xvnc`WakeupHandler+0x36
              Xvnc`WaitForSomething+0x28d
              Xvnc`Dispatch+0x76
              Xvnc`main+0x3e5
              Xvnc`_start+0x80
              493

Here we see the event loop calling into the accept(3S) interface in libc and entering the system call entry point in the kernel. The second set of stack frames shows the log write path. One of the stacks has also caught _ndoprnt, which we know from earlier to be the hottest on-CPU function, calling write() as part of Xvnc logging.

Conclusions

The initial analysis with standard operating system tools showed that the single-threaded Xvnc process was CPU bound, spending most of its CPU cycles in user-mode and performing more than 200,000 system calls per second. DTrace was used to discover that the application was continually encountering new connection failures because of invalid arguments (accept(2)) and was writing this message to a log file, thousands of times per second.

  • + Share This
  • 🔖 Save To Your Account