Home > Articles

This chapter is from the book

This chapter is from the book

2.5 Code Spelunking

Fools ignore complexity; pragmatists suffer it; experts avoid it; geniuses remove it.

Alan Perlis

A career in software means a career spent reading and trying to understand other koder’s kode. The term code spelunking, which I coined in this article, tries to give a feel for what this process is like, to wit, crawling around a very large, dangerous space in the dark with primitive tools and a very small lamp. Many years have passed since I wrote this piece, and yet the tools used to spelunk code have not really kept pace with the amount of code any one person might need to dig through; in fact, the amount of code one would need to read, and the number of computer linguistical boundaries one has to cross has multiplied significantly.

A tool that is now sorely needed but that seems to elude us is one that provides useful visualization of a large code base. Most of the current crop of code spelunking tools, be they standalone or part of an IDE, allow a programmer to skip through the code in order to dive down, or up, a call chain. Systems like Doxygen have a rudimentary ability to produce a visual call graph, but these are static, hard to navigate, and quickly fail at scale. If there is one tool that most koders would use every day, it is indeed this sort of tool, because what most of us do, day to day, is dive into a morass of multiplying functions and try not to drown. The following piece is a tour of tools that might keep you afloat.

Try to remember your first day at your first software job. Do you recall what you were asked to do, after the human resources people were done with you? Were you asked to write a piece of fresh code? Probably not. It is far more likely that you were asked to fix a bug, or several, and to try to understand a large, poorly documented collection of source code.

Of course, this doesn’t just happen to new graduates; it happens to all of us whenever we start a new job or look at a new piece of code. With experience we all develop a set of techniques for working with large, unfamiliar source bases. This is what I call code spelunking.

Code spelunking is very different from other engineering practices because it is done long after the initial design and implementation of a system. It is a set of forensic techniques used after the crime has been committed.

There are several questions that code spelunkers need to ask, and tools are available to help them answer these questions. I will look at some of these tools, addressing their shortcomings and pointing out possible improvements.

Source bases are already large, and getting larger all the time. At the time of this writing, the Linux kernel, with support for 17 different processor architectures, is made up of 642 directories, 12,417 files, and more than 5 million lines of code. A complex network server such as Apache is made up of 28 directories and 471 files—encompassing over 158,000 lines of code—whereas an application such as the nvi editor contains 29 directories, 265 files, and over 77,000 lines of code. I believe that these examples are fairly honest representations of what people are confronted with when they start a job.

Of course, even larger systems exist for scientific, military, and financial applications, but those discussed here are more familiar and should help convey an instinctive feeling for the complexity involved in systems that we all come in contact with every day.

Unfortunately, the tools we have to work with often lag behind the code we are trying to explore. There are several reasons for this, but the primary one is that very few companies have been able to build a thriving business on tools. It is much easier to make money selling software that appeals to a broad audience, which software tools do not.

Static vs. dynamic. We can use two scales to compare code spelunking tools and techniques. The first one ranges from static analysis at one end to dynamic analysis at the other. In static analysis you’re not observing a running program but are examining only the source code. A clear example is using the tools find, grep, and wc to give you a better idea of the size of the source code. A code review is an example of a static technique. You print the code, sit down at a table with it, and begin to read.

On the dynamic end of the scale are tools such as debuggers. Here you run the code on real data, using one program (the debugger) to examine another as it executes. Another dynamic tool is a software oscilloscope, which displays a multithreaded program as a series of horizontal time lines—like those on a real oscilloscope—to find deadlocks, priority inversions, and other bugs common to multithreaded programs. Oscilloscopes are used mostly in embedded systems.

Brute force vs. subtlety. The second type of scale measures the level of finesse that is applied to spelunking. At one extreme is a brute-force approach, which often consumes a lot of CPU time or which might generate a large amount of data. An example of brute force is attempting to find a bug by using grep to locate a message that prints out near to where the error occurs.

At the other end of the finesse scale is subtlety. A subtle approach to finding a string within a program involves a tool that builds a database of all the interesting components of a code base (e.g., function name, structure definitions, and the like). You then use that tool to generate a fresh database each time you update your source from the code repository. You would also use it when you want to know something about the source, as it already has the information you need at its virtual fingertips.

Plotting your method. You can use both of these scales to create a two-dimensional graph in which the x-axis represents finesse and the y-axis runs from static to dynamic. You can then plot tools and techniques on this graph so that they can be compared. None of the terms used implies value judgments on the techniques. Sometimes a static, brute-force approach is just what you want. If it would take just as long (or longer) to set up a tool to do some subtle analysis, a brute-force approach makes sense. You don’t often get points for style when code spelunking; people just want results.

One thing to keep in mind when code spelunking is the Scout motto, ”Be prepared.” A little effort spent up front getting your tools in place will always save time in the long run. After a while, all engineers develop a stable of tools that they use to solve problems. I have a few tools that I always keep on hand when code spelunking, and I make sure to prepare the ground with them whenever I start a project. These tools include Cscope and global, which I’ll discuss in more detail later.

A very common scenario in code spelunking is attempting to fix a bug in an unfamiliar piece of code. Debugging is a highly focused task: You have a program, it runs, but not correctly. You must find out why it does this, where it does this, and then repair it. What’s wrong with the program is usually your only known quantity. Finding the needle buried in the haystack is your job, so the first question must be, ”Where does the program make a mistake?”

You can approach the problem in several ways. The approach you choose depends on the situation. If the program is a single file, you could probably find the bug through inspection, but as demonstrated, any truly useful application is much larger than a single file.

Let’s take a theoretical example. Jack takes a job with the Whizzo Company, which produces WhizzoWatcher, a media player application that can play and decode many types of entertainment content. On his first day at work (after he has signed up for health insurance, the stock plan, and the 401K) Jack’s boss e-mails him two bug reports to deal with.

The two bugs that Jack has just been assigned are as follows:

Bug 1: When WhizzoWatcher opens a file of type X, it immediately crashes with no output except a core file. Bug 2: While watching a long movie on DVD (The Lord of the Rings, The Two Towers) the audio sync is lost after about two hours. This does not happen at any particular frame; it varies. WhizzoWatcher 1.0 is a typically organic piece of software. Originally conceived as a ”prototype,” it wowed the VPs and investors and was immediately rushed into production over the objections of the engineers who had written it. It has little or no design documentation, and what exists is generally inaccurate and out of date. The only real source of information on the program is the code itself. Because this was a prototype, several pieces of open source software were integrated into the larger whole and were expected to be replaced when the ”real system” was funded. The total system now consists of about 500 files that spread over 15 directories, some of which were written in-house and some of which were integrated.

Bug 1. This is the easiest bug for Jack to work on; the program crashes when it starts. He can run it in the debugger, and the offending line will be found at the next crash. In terms of code spelunking, he doesn’t have to look at much of the code at first, although becoming generally familiar with the code base will help him in the long run.

Unfortunately, Jack finds that although the reason for the immediate crash is obvious, what caused it is not. A common cause of crashes in C code is dereferencing a null pointer. At this point in the debugger, Jack doesn’t have the previous states of the program, only the state at the moment it crashed, which is actually a very small amount of data. A common technique is to visually inspect the code while stepping up the stack trace to see if some caller stomped on the pointer.

A debugger that could step backward, even over a small number of instructions, would be a boon in this situation. On entry into a function the debugger would create enough storage for all of the function’s local variables and the application’s global variables so that they could be recorded, statement by statement, throughout the function. When the debugger stopped (or the program crashed), it would be possible to step backward up to the start of the function to find out which statement caused the error. For now, Jack is left to read the code and hope to stumble across the real cause of the error.

Bug 2. Attacking Bug 2, where the code doesn’t crash but only produces incorrect results, is more difficult because there isn’t a trivial way to get a debugger to show you when to stop the program and inspect it. If Jack’s debugger supports conditional breakpoints and watchpoints, then these are his next line of defense. A watchpoint or a conditional breakpoint tells the debugger to stop when something untoward happens to a variable, and allows Jack to inspect the code at a point closest to where the problem occurs.

Once Jack has found the problem, it’s time to fix it. The key is to fix the bug without breaking anything else in the system. A thorough round of testing is one way to address this problem, but he’ll feel more comfortable making a fix if he can find out more about what it affects in the system. Jack’s next question ought to be, ”Which routines call the routine I want to fix?”

Attempting to answer this question with the debugger will not work, because Jack cannot tell from the debugger all the sources that will call the offending routine. This is where a subtle, static approach bears fruit. The tool Cscope builds a database out of a chunk of code and allows him to do the following:

Find a C symbol

Find a global definition

Find functions called by a function

Find functions calling a function

Find a text string

Change a text string

Find an egrep pattern

Find a file

Find all files that #include this file

You will note that item 4, ”Find functions calling a function,” answers his question. If the routine he’s about to fix modifies nonlocal variables or structures, he must next answer the question, ”With which functions does my function share data?” This would never come up in a ”properly designed program” (i.e., one written from scratch). Jack would never use a horde of global variables to control his program, because he knows what a nightmare it would be to maintain. Right?

Of course, this is code spelunking, so Jack is already past that point. Using a subtle tool such as Cscope again is tempting, but this turns out to be a case where brute force is his best hope. Generating a list of all the global references in a program (filename and line number) is certainly possible for a compiler, but none of them do this. Although this option would hardly be used in the creation of new code, it would make the process of debugging old code far easier. Jack will have to make do with a combination of find and grep to figure out where all these variables reside in the program.

Code spelunking isn’t something you do only when debugging; it’s how you perform a good code review, reverse-engineer a design document, and conduct a security audit.

During an age when many people use computers for financial and personal transactions, auditing code for security holes has (or should) become commonplace. In order to close these security holes, you have to know what the common attacks are, as well as which sections of the code are vulnerable to attack. Although the attacks are updated almost daily on the Bugtraq mailing list (http://www.securityfocus.com), what we’re concerned with is finding them.

Consider the following example. Jill takes a job with a large bank that serves most of its customers electronically, over the Internet. On her first day her boss tells her that the bank is doing a security audit of its entire system, which is implemented on several different types of hardware and operating systems. One set of Unix servers handles incoming web traffic and then makes requests to a mainframe backend that actually manages the real money.

One possible security hole occurs when a program stores a user’s private data in a way that makes it available to anyone with access to the machine. An example of this is storing a password as plain text in a file or registry key. If Jill had written the software or had access to the person who did, she could quickly find out where the program stores passwords simply by asking. Unfortunately, the person who wrote the login scripts was laid off when the bank moved its headquarters six months ago. Jill will have to find out how this is done without the author’s help.

Unlike in the debugging case, few tools are available to tell Jill something as specific as, ”Where does the program store data X?” Depending on the structure of the program, this could happen anywhere—and often does. Jill can tackle this problem with either a brute-force or a subtle approach. To do the former she would litter the code with debugging statements (i.e., printfs or language equivalent) to find out where the program stores the password. She probably has a few guesses to narrow the field from the entire program down to a few or a dozen locations, but this is still a good deal of work.

A subtler approach would be to run the program in the debugger, attempt to stop the program right after the user enters a password, and then follow the execution until the program performs the storage operation.

A typical way to attack a program is to give it bad input. To find these vulnerabilities, Jill must ask the question, ”Where does the program read data from an untrustworthy source?” Most people would immediately think that any code dealing with network data would be vulnerable, and they would be right—in part. With the advent of networked file systems, the fact that the code is executing a read() statement does not imply that it is reading from a local (i.e., trustable) source.

Whereas Jack’s debugging in my earlier example attempted to zero in on a problem, Jill’s code audit is more of a ”fan-out.” She wants to see as much of the code as possible and understand how it interacts both with its own modules (”How is data passed around?”) and with external entities (”Where do we read and write data?”). This can be an even more daunting task than finding a needle in a haystack and may be more like labeling all the individual straws. In this case, the most important thing for Jill to do is to find the most likely places that will cause problems—that is, the places that are executed most often.

There is a tool that, although not originally intended for this purpose, can help to focus Jill’s efforts: a code profiler. For example, gprof was originally written to tell an engineer which routines within the program are using up all the CPU time and are, therefore, candidates for optimization. The program is run with a workload (let a user bang on it or have it service requests from the network), and then the output is analyzed. A profiler will tell Jill which routines are being called most often. These are obviously the routines to check first. There is no reason for Jill to pore over the large portion of the code that is called infrequently, while the routines called most often may have gaping holes.

Routines that pass bad arguments to system calls are another common security problem. This is most often exploited against network servers in an effort to gain control over a remote machine. Several commercial tools attempt to discover these kinds of problems. A quick and dirty approach is to use a system call tracer, such as ktrace or truss, to make a record of what system calls are executed and what their arguments are. This can give you a good idea of where possible errors lie.

Code spelunking is about asking questions. The challenge is to get your fingers around the right parts of the code and find the answers without having to look at every line (which is a near impossibility anyway). There is one tool I haven’t yet mentioned in this article, and that’s the one sitting inside your head. You can begin applying good engineering practices even though they weren’t used to create the code you’re spelunking.

Keeping a journal of notes on what you’ve found and how things work will help you to create, and keep in mind, a picture of how the software you’re exploring works. Drawing pictures is also a great help; these should be stored electronically along with your notes. Good experiment design of the type you may have learned—and then promptly forgotten—in physics class is also helpful. Just beating on a piece of code until it works is not an efficient way of figuring it out. Setting up an experiment to see why the code behaves a certain way may take a lot of thought, but usually very little code.

Finally, one of my favorite tools is the ”stupid programmer trick.” You get a colleague to look at the code, and then you attempt to explain to him or her what the code does. Nine times out of ten your colleague won’t even need to say anything. Fairly quickly you realize what’s going on, smack yourself on the forehead, say thank you, and go back to work. Through the process of systematically explaining the code out loud, you have figured out the problem.

There is no one tool that will make understanding large code bases easier, but I hope that I’ve shown you a few ways to approach this task. I suspect you’ll find more on your own.

Tool Resources

Global

http://www.gnu.org/software/global/

This is the tool I apply to every source base that I can. Global is really a pair of tools: gtags and htags. The first, gtags, builds a database of interesting connections based on a source tree in C, C++, Java, or YACC. Once the database is built, you can use your editor (both Emacs and vi are supported) to jump around within the source. Want to know where the function you’re calling is defined? Just jump to it. The second tool is htags, which takes the source code and the database generated by gtags and creates an HTML-browsable version of the source code in a subdirectory. This means that, even if you don’t use Emacs or vi, you can easily jump around the source code finding interesting connections. Building the database is relatively speedy, even for large code bases, and should be done each time you update your sources from your source-code control system.

Exuberant Ctags

http://ctags.sourceforge.net

Works on dozens of languages: Ant, Asm, Asp, Awk, Basic, BETA, C, C++, C#, Cobol, DosBatch, Eiffel, Erlang, Flex, Fortran, HTML, Java, JavaScript, Lisp, Lua, Make, MatLab, OCaml, Pascal, Perl, PHP, Python, REXX, Ruby, Scheme, Sh, SLang, SML, SQL, Tcl, Tex, Vera, Verilog, VHDL, Vim, YACC. Very useful in a multilanguage environment.

Cscope

http://cscope.sourceforge.net/

Cscope was originally written at AT&T Bell Labs in the 1970s. It answers many questions, such as: Where is this symbol? Where is something globally defined? Where are all the functions called by this function? Where are all the functions that call this function? Like Global, it first builds a database from your source code. You can then use a command-line tool, Emacs, or one of a few GUIs written to work with the system to get your questions answered.

gprof

https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html

This is the standard profiling tool on most open source Unix systems. Its output is a list of routines ordered by how often they were called and how much of the program’s CPU time was spent executing them. This can be useful in figuring out where to start looking for security holes in a program.

ktrace

This is a standard tool on open source operating systems. The name stands for ”kernel trace.” It will give you a listing of all the system calls made by a program. The output includes the arguments and return values given to and received from the calls. You use ktrace by running a program under it and then dumping the output with kdump. It is available on all open source Unix operating systems.

DTrace

First developed on Solaris, but now available on FreeBSD, Linux, and Windows. The best reference on this tool is still the book by Brendan Gregg and Jim Mauro.1

Valgrind

https://valgrind.org

Useful for finding all kinds of memory leaks in C and C++ programs.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020