Home > Articles

This chapter is from the book

6.2 Sorting and Searching Functions

Sorting and searching are two fundamental operations for which a need arises continually in many applications. The C library provides a number of standard interfaces for performing these tasks. They are general purpose in nature, suitable for working with any kind of data, as opposed to being tuned for particular applications.

All the routines share a common theme: data is managed through void * pointers, and user-provided functions supply ordering. Note also that these APIs apply to in-memory data. Sorting and searching structures in files is considerably more involved and beyond the scope of an introductory text such as this one. (However, the sort command works well for text files; see the sort(1) manpage. Sorting binary files requires that a special-purpose program be written.)

Because no one algorithm works well for all applications, there are several different sets of library routines for maintaining searchable collections of data. This chapter covers only one simple interface for searching. A more advanced interface is described in Section 16.4, “Advanced Searching with Binary Trees,” page 575. Furthermore, we purposely don’t explain the underlying algorithms, since this is a book on APIs, not algorithms and data structures. What’s important to understand is that you can treat the APIs as “black boxes” that do a particular job without needing to understand the details of how they do the job.

6.2.1 Sorting: qsort()

Sorting is accomplished with qsort():

#include <stdlib.h>                                                ISO C

void qsort(void *base, size_t nmemb, size_t size,
           int (*compare)(const void *, const void *));

The name qsort() comes from C. A. R. Hoare’s Quicksort algorithm, which was used in the initial Unix implementation. (Nothing in the POSIX standard dictates the use of this algorithm for qsort(). The GLIBC implementation uses a highly optimized combination of Quicksort and Insertion Sort.)

qsort() sorts arrays of arbitrary objects. It works by shuffling opaque chunks of memory from one spot within an array to another and relies on you, the programmer, to provide a comparison function that allows it to determine the ordering of one array element relative to another. The arguments are as follows:

  • void *base

    The address of the beginning of the array.

  • size_t nmemb

    The total number of elements in the array.

  • size_t size

    The size of each element in the array. The best way to obtain this value is with the C sizeof operator.

  • int (*compare)(const void *, const void *)

    A possibly scary declaration for a function pointer. It says that “compare points to a function that takes two ‘const void *’ parameters, and returns an int.”

Most of the work is in writing a proper comparison function. The return value should mimic that of strcmp(): less than zero if the first value is “less than” the second, zero if they are equal, and greater than zero if the first value is “greater than” the second. It is the comparison function that defines the meaning of “less than” and “greater than” for whatever it is you’re sorting. For example, to compare two double values, we could use this function:

int
dcomp(const void *d1p, const void *d2p)
{
    const double *d1, *d2;

    d1 = (const double *) d1p;                Cast pointers to right type
    d2 = (const double *) d2p;

    if (*d1 < *d2)                            Compare and return right value
        return -1;
    else if (*d1 > *d2)
        return 1;
    else if (*d1 == *d2)
        return 0
    else
        return -1;     /* NaN sorts before real numbers */
}

This shows the general boilerplate for a comparison function: convert the arguments from void * to pointers to the type being compared and then return a comparison value.

For floating-point values, a simple subtraction such as ‘return *d1 - *d2’ doesn’t work, particularly if one value is very small or if one or both values are special “not a number” or “infinity” values. Thus we have to do the comparison manually, including taking into account the not-a-number value (which doesn’t even compare equal to itself!).

6.2.1.1 Example: Sorting Employees

For more complicated structures, a more involved function is necessary. For example, consider the following (rather trivial) struct employee:

struct employee {
    char lastname[30];
    char firstname[30];
    long emp_id;
    time_t start_date;
};

We might write a function to sort employees by last name, first name, and ID number:

int
emp_name_id_compare(const void *e1p, const void *e2p)
{
    const struct employee *e1, *e2;
    int last, first;

    e1 = (const struct employee *) e1p;                          Convert pointers
    e2 = (const struct employee *) e2p;

    if ((last = strcmp(e1->lastname, e2->lastname)) != 0)        Compare last names
        return last;                                             Last names differ

    /* same last name, check first name */
    if ((first = strcmp(e1->firstname, e2->firstname)) != 0)     Compare first names
        return first;                                            First names differ

    /* same first name, check ID numbers */
    if (e1->emp_id < e2->emp_id)                                 Compare employee ID
        return -1;
    else if (e1->emp_id == e2->emp_id)
        return 0;
    else
        return 1;
}

The logic here is straightforward, initially comparing on last names, then first names, and then using the employee ID number if the two names are the same. By using strcmp() on strings, we automatically get the right kind of negative/zero/positive value to return.

The employee ID comparison can’t just use subtraction: suppose long is 64 bits and int is 32 bits, and the two values differ only in the upper 32 bits (say the lower 32 bits are zero). In such a case, the subtraction result would automatically be cast to int, throwing away the upper 32 bits and returning an incorrect value.

Simply by using a different function, we can sort employees by seniority:

int
emp_seniority_compare(const void *e1p, const void *e2p)
{
    const struct employee *e1, *e2;
    double diff;

    e1 = (const struct employee *) e1p;                Cast pointers to correct type
    e2 = (const struct employee *) e2p;

    diff = difftime(e1->start_date, e2->start_date);   Compare times
    if (diff < 0)
        return -1;
    else if (diff > 0)
        return 1;
    else
        return 0;
}

For maximum portability we have used difftime(), which returns the difference in seconds between two time_t values. For this specific case, a cast such as

return (int) difftime(e1->start_date, e2->start_date);

should do the trick, since time_t values are within reasonable ranges. Nevertheless, we instead use a full three-way if statement, just to be safe.

Here is a sample data file, listing nine U.S. presidents:

$ cat presdata.txt
Trump Donald 47 1737396000                    Last name, First name, President number, Inauguration
Biden Joseph 46 1611165600
Trump Donald 45 1484935200
Obama Barack 44 1232474400
Bush George 43 980013600
Clinton William 42 727552800
Bush George 41 601322400
Reagan Ronald 40 348861600
Carter James 39 222631200

ch-general1-sortemp.c shows a simple program that reads this file into a struct employee array and then sorts it, using the two different comparison functions just presented:

  1  /* ch-general1-sortemp.c --- Demonstrate qsort() with two comparison functions. */
  2
  3  #include <stdio.h>
  4  #include <stdlib.h>
  5  #include <string.h>
  6  #include <time.h>
  7
  8  struct employee {
  9      char lastname[30];
 10       char firstname[30];
 11       long emp_id;
 12       time_t start_date;
 13  };
 14
 15  /* emp_name_id_compare --- compare by name, then by ID */
 16
 17  int
 18  emp_name_id_compare(const void *e1p, const void *e2p)
 19  {
     … as shown previously, omitted to save space …
 41  }
 42
 43  /* emp_seniority_compare --- compare by seniority */
 44
 45  int
 46  emp_seniority_compare(const void *e1p, const void *e2p)
 47  {
     … as shown previously, omitted to save space …
 61  }
 62
 63  /* main --- demonstrate sorting */
 64
 65  int
 66  main(void)
 67  {
 68  #define NPRES 20
 69      struct employee presidents[NPRES];
 70      int i, npres;
 71      char buf[BUFSIZ];
 72
 73      /* Very simple code to read data: */
 74      for (npres = 0; npres < NPRES && fgets(buf, BUFSIZ, stdin) != NULL;
 75                      npres++) {
 76          sscanf(buf, "%s %s %ld %ld\n",
 77                  presidents[npres].lastname,
 78                  presidents[npres].firstname,
 79                  & presidents[npres].emp_id,
 80                  & presidents[npres].start_date);
 81      }
 82
 83      /* npres is now number of actual lines read. */
 84
 85      /* First, sort by name */
 86      qsort(presidents, npres, sizeof(struct employee), emp_name_id_compare);
 87
 88      /* Print output */
 89      printf("Sorted by name:\n");
 90      for (i = 0; i < npres; i++)
 91          printf("\t%s %s\t%ld\t%s",
 92                  presidents[i].lastname,
 93                  presidents[i].firstname,
 94                  presidents[i].emp_id,
 95                  ctime(& presidents[i].start_date));
 96
 97      /* Now, sort by seniority */
 98      qsort(presidents, npres, sizeof(struct employee), emp_seniority_compare);
 99
100      /* And print again */
101      printf("Sorted by seniority:\n");
102      for (i = 0; i < npres; i++)
103          printf("\t%s %s\t%ld\t%s",
104                  presidents[i].lastname,
105                  presidents[i].firstname,
106                  presidents[i].emp_id,
107                  ctime(& presidents[i].start_date));
108  }

Lines 74–81 read in the data. Note that any use of scanf() requires “well behaved” input data. If, for example, any name is more than 29 characters, there’s a problem. In this case, we’re safe, but production code must be considerably more careful.

Line 86 sorts the data by name and employee ID, and then lines 89–95 print the sorted data. Similarly, line 98 re-sorts the data, this time by seniority, with lines 101–107 printing the results. When compiled and run, the program produces the following results:

$ ch-general1-sortemp < presdata.txt
Sorted by name:
        Biden Joseph     46       Wed Jan 20 13:00:00 2021
        Bush George      41       Fri Jan 20 13:00:00 1989
        Bush George      43       Sat Jan 20 13:00:00 2001
        Carter James     39       Thu Jan 20 13:00:00 1977
        Clinton William  42       Wed Jan 20 13:00:00 1993
        Obama Barack     44       Tue Jan 20 13:00:00 2009
        Reagan Ronald    40       Tue Jan 20 13:00:00 1981
        Trump Donald     45       Fri Jan 20 13:00:00 2017
        Trump Donald     47       Mon Jan 20 13:00:00 2025
Sorted by seniority:
        Carter James     39       Thu Jan 20 13:00:00 1977
        Reagan Ronald    40       Tue Jan 20 13:00:00 1981
        Bush George      41       Fri Jan 20 13:00:00 1989
        Clinton William  42       Wed Jan 20 13:00:00 1993
        Bush George      43       Sat Jan 20 13:00:00 2001
        Obama Barack     44       Tue Jan 20 13:00:00 2009
        Trump Donald     45       Fri Jan 20 13:00:00 2017
        Biden Joseph     46       Wed Jan 20 13:00:00 2021
        Trump Donald     47       Mon Jan 20 13:00:00 2025

(We’ve used 1:00 p.m. as an approximation for the time when each president started working.)4

One point is worth mentioning: qsort() rearranges the data in the array. If each array element is a large structure, a lot of data will be copied back and forth as the array is sorted. It may pay instead to set up a separate array of pointers, each of which points at one element of the array, and then use qsort() to sort the pointer array, accessing the unsorted data through the sorted pointers.

The price paid is the extra memory to hold the pointers and modification of the comparison function to use an extra pointer indirection when comparing the structures. The benefit returned can be a considerable speedup, since only a four- or eight-byte pointer is moved around at each step, instead of a large structure. (Our struct employee is 68 bytes in size on a 32-bit system and 80 bytes in size on a 64-bit one. Swapping four-byte pointers moves 17 times less data than does swapping structures. Even swapping eight-byte pointers moves 10 times less data than does swapping structures.) For millions of in-memory structures, the difference can be significant, and this may be even more important if you’re developing for an embedded system with very limited memory.

6.2.1.2 Example: Sorting Directory Contents

In Section 5.3, “Reading Directories,” page 123, we demonstrated that directory entries are returned in physical directory order. Most of the time, it’s much more useful to have directory contents sorted in some fashion, such as by name or by modification time. Several routines make it easy to do this, using qsort() as the underlying sorting agent:

#include <dirent.h>                                                    POSIX

int scandir(const char *dir, struct dirent ***namelist,
            int (*select)(const struct dirent *),
            int (*compare)(const struct dirent **, const struct dirent **));
int alphasort(const struct dirent **a, const struct dirent **b);

int versionsort(const struct dirent **a, const struct dirent **b);     GLIBC

The scandir() and alphasort() functions were first made available in 4.2 BSD. They have always been widely supported, and are now standardized by POSIX. versionsort() is a GNU extension.

scandir() reads the directory named by dir, creates an array of struct dirent pointers by using malloc(), and sets *namelist to point to the beginning of that array. Both the array of pointers and the pointed-to struct dirent structures are allocated with malloc(); it is up to the calling code to use free() to avoid memory leaks.

Use the select function pointer to choose entries of interest. When this value is NULL, all valid directory entries are included in the final array. Otherwise, (*select)() is called for each entry, and those entries for which it returns nonzero (true) are included in the array.

The compare function pointer compares two directory entries. It is passed to qsort() for use in sorting.

alphasort() compares file names lexicographically. It uses the strcoll() function for comparison. strcoll() is similar to strcmp() but takes locale-related sorting rules into consideration (see Section 15.2.3, “String Collation: strcoll() and strxfrm(),” page 512).

versionsort() is a GNU extension, that uses the GNU strverscmp() function to compare file names (see strverscmp(3)). To make a long story short, this function understands common file name versioning conventions and compares appropriately.

ch-general1-sortdir.c shows a program similar to ch-fileinfo-catdir.c. However, it uses scandir() and alphasort() to do the work:

 1  /* ch-general1-sortdir.c --- Demonstrate scandir(), alphasort(). */
 2
 3  #include <stdio.h>              /* for printf() etc. */
 4  #include <errno.h>              /* for errno */
 5  #include <stdlib.h>             /* for free() */
 6  #include <string.h>             /* for strerror() */
 7  #include <sys/types.h>          /* for system types */
 8  #include <dirent.h>             /* for directory functions */
 9
10  char *myname;
11  int process(const char *dir);
12
13  /* main --- loop over directory arguments */
14
15  int
16  main(int argc, char **argv)
17  {
18      int i;
19      int errs = 0;
20
21      myname = argv[0];
22
23      if (argc == 1)
24          errs = process(".");    /* default to current directory */
25      else
26          for (i = 1; i < argc; i++)
27              errs += process(argv[i]);
28
29      return (errs != 0);
30  }
31
32  /* nodots --- ignore dot files, for use by scandir() */
33
34  int
35  nodots(const struct dirent *dp)
36  {
37      return (dp->d_name[0] != '.');
38  }
39
40  /*
41   * process --- do something with the directory, in this case,
42   *             print inode/name pairs on standard output.
43   *             Return 0 if all OK, 1 otherwise.
44   */
45
46  int
47  process(const char *dir)
48  {
49      DIR *dp;
50      struct dirent **entries;
51      int nents, i;
52
53      nents = scandir(dir, & entries, nodots, alphasort);
54      if (nents < 0) {
55          fprintf(stderr, "%s: scandir failed: %s\n", myname,
56                           strerror(errno));
57          return 1;
58      }
59
60      for (i = 0; i < nents; i++) {
61          printf("%8ld %s\n", entries[i]->d_ino, entries[i]->d_name);
62          free(entries[i]);
63      }
64
65      free(entries);
66
67      return 0;
68  }

The main() program (lines 1–30) follows the standard boilerplate we’ve used before. The nodots() function (lines 34–38) acts as the select parameter, choosing only file names that don’t begin with a period.

The process() function (lines 46–68) is quite simple, with scandir() doing most of the work. Note how each element is released separately with free() (line 62) and how the entire array is also released (line 65).

When run, the directory contents do indeed come out in sorted order, without . and ..:

$ ch-general1-sortdir                    Default action displays current directory
 9207429 00-preface.texi
 9208690 01-intro.texi
 9205291 02-cmdline.texi
 9207424 03-memory.texi
 9207611 04-fileio.texi
 9207616 05-fileinfo.texi
 9208272 06-general1.texi
...

6.2.2 Binary Searching: bsearch()

A linear search is pretty much what it sounds like: you start at the beginning, and check each element of the array being searched until you find what you need. For something simple like finding integers, this usually takes the form of a for loop. Consider this function:

/* ifind --- linear search, return index if found or -1 if not */

int
ifind(int x, const int array[], size_t nelems)
{
    size_t i;

    for (i = 0; i < nelems; i++)
        if (array[i] == x) /* found it */
            return i;

    return -1;
}

The advantage to linear searching is that it’s simple; it’s easy to write the code correctly the first time. Furthermore, it always works. Even if elements are added to the end of the array or removed from the array, there’s no need to sort the array.

The disadvantage to linear searching is that it’s slow. On average, for an array containing nelems elements, a linear search for a random element does ‘nelems / 2’ comparisons before finding the desired element. This becomes prohibitively expensive, even on modern high-performance systems, as nelems becomes large. Thus, you should only use linear searching on small arrays.

Unlike a linear search, binary searching requires that the input array already be sorted. The disadvantage here is that if elements are added, the array must be re-sorted before it can be searched. (When elements are removed, the rest of the array contents must still be shuffled down. This is not as expensive as re-sorting, but it can still involve a lot of data motion.)

The advantage to binary searching—and it’s a significant one—is that binary searching is blindingly fast, requiring at most ⌊log2(N)⌋ + 1 comparisons, where N is the number of elements in the array. The bsearch() function is declared as follows:

#include <stdlib.h>                                               ISO C

void *bsearch(const void *key, const void *base, size_t nmemb,
              size_t size, int (*compare)(const void *, const void *));

The parameters and their purposes are similar to those of qsort():

  • const void *key

    The object being searched for in the array.

  • const void *base

    The start of the array.

  • size_t nmemb

    The number of elements in the array.

  • size_t size

    The size of each element, obtained with sizeof.

  • int (*compare)(const void *, const void *)

    The comparison function. It must work the same way as the qsort() comparison function, returning negative/zero/positive according to whether the first parameter is less than/equal to/greater than the second one.

bsearch() returns NULL if the object is not found. Otherwise, it returns a pointer to the found object. If more than one array element matches key, it is unspecified which one is returned. Thus, as with qsort(), make sure that the comparison function accounts for all relevant parts of the searched data structure.

ch-general1-searchemp.c shows bsearch() in practice, extending the struct employee example used previously:

 1  /* ch-general1-searchemp.c --- Demonstrate bsearch(). */
 2
 3  #include <stdio.h>
 4  #include <errno.h>
 5  #include <string.h>
 6  #include <stdlib.h>
 7  #include <time.h>
 8
 9  struct employee {
10       char lastname[30];
11       char firstname[30];
12       long emp_id;
13       time_t start_date;
14  };
15
16  /* emp_id_compare --- compare by ID */
17
18  int
19  emp_id_compare(const void *e1p, const void *e2p)
20  {
21      const struct employee *e1, *e2;
22
23      e1 = (const struct employee *) e1p;
24      e2 = (const struct employee *) e2p;
25
26      if (e1->emp_id < e2->emp_id)
27          return -1;
28      else if (e1->emp_id == e2->emp_id)
29          return 0;
30      else
31          return 1;
32  }
33
34  /* print_employee --- print an employee structure */
35
36  void
37  print_employee(const struct employee *emp)
38  {
39      printf("%s %s\t%ld\t%s", emp->lastname, emp->firstname,
40              emp->emp_id, ctime(& emp->start_date));
41  }

Lines 9–14 define the struct employee; it’s the same as before. Lines 18–32 serve as the comparison function, for both qsort() and bsearch(). It compares on employee ID number only. Lines 36–41 define print_employee(), which is a convenience function for printing the structure, since this is done from multiple places. The code continues:

 43  /* main --- demonstrate sorting */
 44
 45  int
 46  main(int argc, char **argv)
 47  {
 48  #define NPRES 20
 49      struct employee presidents[NPRES];
 50      int i, npres;
 51      char buf[BUFSIZ];
 52      struct employee *the_pres;
 53      struct employee key;
 54      int id;
 55      FILE *fp;
 56
 57      if (argc != 2) {
 58          fprintf(stderr, "usage: %s datafile\n", argv[0]);
 59          exit(1);
 60      }
 61
 62      if ((fp = fopen(argv[1], "r")) == NULL) {
 63          fprintf(stderr, "%s: %s: could not open: %s\n", argv[0],
 64                          argv[1], strerror(errno));
 65          exit(1);
 66      }
 67
 68      /* Very simple code to read data: */
 69      for (npres = 0; npres < NPRES && fgets(buf, BUFSIZ, fp) != NULL;
 70                      npres++) {
 71          sscanf(buf, "%s %s %ld %ld",
 72                  presidents[npres].lastname,
 73                  presidents[npres].firstname,
 74                  & presidents[npres].emp_id,
 75                  & presidents[npres].start_date);
 76      }
 77      fclose(fp);
 78
 79      /* npres is now number of actual lines read. */
 80
 81      /* First, sort by id */
 82      qsort(presidents, npres, sizeof(struct employee), emp_id_compare);
 83
 84      /* Print output */
 85      printf("Sorted by ID:\n");
 86      for (i = 0; i < npres; i++) {
 87          putchar('\t');
 88          print_employee(& presidents[i]);
 89      }
 90
 91      for (;;) {
 92          printf("Enter ID number: ");
 93          if (fgets(buf, BUFSIZ, stdin) == NULL)
 94                  break;
 95
 96          sscanf(buf, "%d\n", & id);
 97          key.emp_id = id;
 98          the_pres = (struct employee *) bsearch(& key, presidents, npres,
 99                          sizeof(struct employee), emp_id_compare);
100
101          if (the_pres != NULL) {
102                  printf("Found: ");
103                  print_employee(the_pres);
104          } else
105                  printf("Employee with ID %d not found!\n", id);
106      }
107
108      putchar('\n');  /* Print a newline on EOF. */
109
110      exit(0);
111  }

The main() function starts with argument checking (lines 57–60). It then reads the data from the named file (lines 68–77). Standard input cannot be used for the employee data, since that is reserved for prompting the user for the employee ID to search for.

Lines 82–89 sort and print the employees. The program then goes into a loop, starting on line 91. It prompts for an employee ID number, exiting the loop upon end-of-file. To search the array, we use the struct employee named key. It’s enough to set just its emp_id field to the entered ID number; none of the other fields are used in the comparison (line 97).

If an entry is found with the matching key, bsearch() returns a pointer to it. Otherwise it returns NULL. The return is tested on line 101, and appropriate action is then taken. Finally, line 108 prints a newline character so that the system prompt will come out on a fresh line. Here’s a transcript of what happens when the program is compiled and run:

$ ch-general1-searchemp presdata.txt                         Run the program
Sorted by ID:
        Carter James    39      Thu Jan 20 13:00:00 1977
        Reagan Ronald   40      Tue Jan 20 13:00:00 1981
        Bush George     41      Fri Jan 20 13:00:00 1989
        Clinton William 42      Wed Jan 20 13:00:00 1993
        Bush George     43      Sat Jan 20 13:00:00 2001
        Obama Barack    44      Tue Jan 20 13:00:00 2009
        Trump Donald    45      Fri Jan 20 13:00:00 2017
        Biden Joseph    46      Wed Jan 20 13:00:00 2021
        Trump Donald    47      Mon Jan 20 13:00:00 2025
Enter ID number:  45                                         Enter a valid number
Found: Trump Donald     45      Fri Jan 20 13:00:00 2017     It's found
Enter ID number:  29                                         Enter an invalid number
Employee with ID 29 not found!                               It's not found
Enter ID number:  40                                         Try another good one
Found: Reagan Ronald    40      Tue Jan 20 13:00:00 1981     This one is found too
Enter ID number:  ^D                                         CTRL-D entered for EOF
$                                                            Ready for next command

Additional, more advanced APIs for searching data collections are described in Section 16.4, “Advanced Searching with Binary Trees,” page 575.

6.2.2.1 Example: Sorting and Searching Together

Sorting data and searching it go together. We next present an example from real-world code, drawn this time from the current version of the original Unix awk program.5 The code comes from that portion of awk that matches regular expressions.

By way of introduction, regular expression matching in awk uses a technique based on automata theory. In particular, it uses a Deterministic Finite Automaton, or DFA. The DFA matcher starts out in an initial state. Based on the input characters, it moves from one state to the next until it arrives at the terminating state or determines that it cannot do so. In the former case, the regular expression it represents has matched the input text; in the latter, it has not. Figure 6.2 shows the transitions among states for the regular expression ‘(a|b)c’.

FIGURE 6.2

Figure 6.2: A very simple DFA

State 1 is the initial state, and State 4 is the final state. The figure is very simplistic. From any given state, it may be possible to move to multiple following states, based on the current input character. This is termed a state transition.

In the early days of computing, when characters were always a single byte in size, the representation of a state transition was very simple. One could use a two-dimensional table, where the row is indexed by the current state and the column is indexed by the current input character. The value stored in the table would indicate the next state to move to:

next_state = transition[current_state][current_char];

In such a representation, each row took up a maximum of 256 entries. And indeed, awk used such a representation:

...
if ((ns = f->gototab[s][*p]) != 0)
    s = ns;
...

Here, f is the DFA, ns is the next state, s is the current state, and *p is the current input character. gototab is short for “goto table”: a table indicating which state to go to next.

The problem is that in today’s world, characters are no longer just a single byte in size. In particular, the Unicode character set has code points (character values) that are normally represented with at least four bytes, and are encoded in files using one to four bytes. (There’s more on this later in the book; see Section 15.4.2, “Multibyte Character Encodings,” page 545.) There are well over a million valid Unicode code points—1,114,112 to be exact—thus it’s not practical for each row in a transition table to have over a million entries.

To solve this problem, Brian Kernighan moved the goto table into a different structure, accessed via functions. The data structure for a gototab consisted of a “goto table entry”:

typedef struct gtt { /* gototab entry */
    unsigned int ch;                    The character of interest
    unsigned int state;                 The state to move to
} gtt;

Initially, he left the size of each row at (approximately) 256 entries, and he used a linear search within the row to find the current input character, and from there return the next state. The function is named get_gototab():

static int get_gototab(fa *f, int state, int ch) /* hide gototab implementation */
{
    int i;
    for (i = 0; i < f->gototab_len; i++) {
        if (f->gototab[state][i].ch == 0)
            break;
        if (f->gototab[state][i].ch == ch)
            return f->gototab[state][i].state;
    }
    return 0;
}

During the matching, if input characters are not found in the gototab, they and their next state are added to it.6 This happens in set_gototab():

static int set_gototab(fa *f, int state, int ch, int val) /* hide gototab implementation */
{
    int i;
    for (i = 0; i < f->gototab_len; i++) {
        if (f->gototab[state][i].ch == 0 || f->gototab[state][i].ch == ch) {
            f->gototab[state][i].ch = ch;
            f->gototab[state][i].state = val;
            return val;
        }
    }
    overflo(__func__);
    return val; /* not used anywhere at the moment */
}

The new implementation worked, but if an input file contained many Unicode characters, awk’s matching became unacceptably slow, and it was possible to exceed the fixed limit on the number of entries.

Your author decided to see if these problems could be fixed without too much disruption in the code. This involved three changes:

  • Dynamically growing the size of the gototab as needed.

  • Looking up the entries in the gototab with a binary search.

  • Keeping the entries sorted so that binary search would work.

First off, the structures got rearranged a little, in order to accommodate the dynamic sizing:

typedef struct gtte { /* gototab entry */
    unsigned int ch;
    unsigned int state;
} gtte;

typedef struct gtt {  /* gototab */
    size_t allocated;
    size_t inuse;
    gtte *entries;
} gtt;

For both sorting and searching, we need a comparison function:

static int entry_cmp(const void *l, const void *r)
{
    const gtte *left, *right;

    left = (const gtte *) l;
    right = (const gtte *) r;

    return left->ch - right->ch;
}

Finding an entry uses a classic call to bsearch():

static int get_gototab(fa *f, int state, int ch) /* hide gototab implementation */
{
    gtte key;
    gtte *item;

    key.ch = ch;
    key.state = 0;  /* irrelevant */
    item = bsearch(& key, f->gototab[state].entries,
            f->gototab[state].inuse, sizeof(gtte),
            entry_cmp);

    if (item == NULL)
        return 0;
    else
        return item->state;
}

Adding a new character and state is more complicated, because the list must be kept sorted.7 In pseudocode, though, it’s rather straightforward:

if the array is empty:
    add the character and state
    increment the count of entries in use
    return
else if the current character is greater than the last one in the array:
    // since it's greater, the array stays sorted
    grow the array if necessary
    add the character and state
    increment the count of entries in use
    return
else:
    binary search the list to see if we already have the current character
    if so, update the state and return

// At this point, the current character isn't in the list, so it must be added:
grow the array if necessary
add the character and state
increment the count of entries in use
sort the array
return

With that introduction, here is the new version of set_gototab():

static int set_gototab(fa *f, int state, int ch, int val) /* hide gototab implementation */
{
    if (f->gototab[state].inuse == 0) {
        f->gototab[state].entries[0].ch = ch;
        f->gototab[state].entries[0].state = val;
        f->gototab[state].inuse++;
        return val;
    } else if ((unsigned)ch > f->gototab[state].entries[f->gototab[state].inuse-1].ch) {
        // not seen yet, insert and return
        gtt *tab = & f->gototab[state];
        if (tab->inuse + 1 >= tab->allocated)
            resize_gototab(f, state);

        f->gototab[state].entries[f->gototab[state].inuse].ch = ch;
        f->gototab[state].entries[f->gototab[state].inuse].state = val;
        f->gototab[state].inuse++;
        return val;
    } else {
        // maybe we have it, maybe we don't
        gtte key;
        gtte *item;

        key.ch = ch;
        key.state = 0;  /* irrelevant */
        item = (gtte *) bsearch(& key, f->gototab[state].entries,
                f->gototab[state].inuse, sizeof(gtte),
                entry_cmp);

        if (item != NULL) {
            // we have it, update state and return
            item->state = val;
            return item->state;
        }
        // otherwise, fall through to insert and reallocate.
    }

    gtt *tab = & f->gototab[state];
    if (tab->inuse + 1 >= tab->allocated)
        resize_gototab(f, state);
    f->gototab[state].entries[tab->inuse].ch = ch;
    f->gototab[state].entries[tab->inuse].state = val;
    ++tab->inuse;
    qsort(f->gototab[state].entries,
        f->gototab[state].inuse, sizeof(gtte), entry_cmp);

    return val; /* not used anywhere at the moment */
}

The end result is that awk’s regular expression matching is now both free of the previous fixed limit, and acceptably performant. Binary search saves the day!

We return to this suite of functions in Section 17.6.1.1, “Valgrind Example: gototab in the One True Awk,” page 642.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.