Windows Parallelism, Fast File Searching, and Speculative Processing
Sequential file processing (SFP) and data stream processing, while basic, is an essential computational task that's used for searching, transformation, archiving, encryption, and countless other tasks. Windows CMD and POSIX (Linux, UNIX, and Cygwin) both support commands to perform SFP operations.
Nonetheless, many SFP implementations (including the Windows and POSIX commands) are much slower than necessary, as they fail to exploit two basic techniques: parallelism and memory-mapped files. These techniques are enabled by three advances in commodity computer systems, including laptops:
- Multi-core CPU chips and multi-chip systems enable parallelism; that is, independent tasks running concurrently on separate cores. The programming challenge consists of determining how to subdivide the problem into parallel tasks; it's easy in some cases and very difficult in others.
- Large physical memories such as 8GB are common even on inexpensive desktops, and much larger memories are available on high-end systems.
- 64-bit systems allow for mapping large files easily and efficiently, although memory-mapped files are effective for smaller files on 32-bit systems. In most cases, processing the mapped memory is faster than conventional file I/O.
Example programs in my book Windows System Programming, Fourth Edition (WSP4)  show how to speed up sequential file processing significantly in two specific SFP patterns:
- File transformation, in which an input file is transformed (example: encryption) and written to an output file.
- File processing, in which multiple input files are processed concurrently to obtain information such as counts of lines, words, and characters (specifically, the POSIX wc command).
In a recent paper, "Sequential File Processing: A Fast Path from Native Code to .NET with Visits to Parallelism and Memory Mapped Files," (SFP-Fast) , I've shown how the same techniques can be extended from Windows native code to C#/.NET managed code.
These two examples are cases of Toub's  "delightful parallelism," in which the file processing task can easily be divided into smaller "stripes" (collections of regularly spaced, fixed-size blocks) to be processed concurrently by separate tasks. Not all SFP problems are so delightfully parallel, however, and now I'll show how to solve a harder problem that requires "speculative processing" (Toub, p. 94). The performance results using native Windows code are excellent, relative to the Windows commands. The paper will show performance graphs, show the solution's limitations, and describe the solution strategy. Complete code is in the WSP4 Examples file, and spreadsheets with raw performance data are downloadable from the author's website.
A Brief Parallelism Review
Parallelism is a hot topic, widely discussed (see the "References" section). Nearly all writers, including this one, agree that parallelism is a difficult topic and that, in general, it's challenging to find and exploit the parallelism in many computational problems. It's even more difficult to convert legacy code to exploit parallelism. Parallel programmers face challenges including but not limited to shared data, locking, deadlocks, race conditions, false sharing, and identifying computational tasks that can run concurrently.
However, this complexity shouldn't stop us from exploiting parallelism when it's relatively easy to achieve in practical situations, such as the file processing and file transformation tasks mentioned earlier. Furthermore, the experience and confidence gained from solving simpler problems may help to solve some harder problems in the future.
A Harder Problem: File Comparison
The benchmark problem solved in this article involves speculative processing:
- We search data streams for a pattern or specific condition.
- We need to identify the first occurrence(s) of the condition.
- Once an occurrence has been detected, other search tasks terminate as soon as possible in order to maximize performance.
More generally, Toub describes speculative processing this way:
Speculation is the pattern of doing something that may not be needed in case it actually is needed. This is increasingly relevant to parallel computing, where we can take advantage of multiple cores to do more things in advance of their actually being needed. Speculation trades off throughput for reduced latency, by utilizing resources to do more work in case that extra work could pay dividends.
This article's specific benchmark problem is to implement a fast version of the Windows COMP command (called compMP), which compares two files, byte by byte, and displays the mismatches in order of occurrence. compMP differs from the COMP command in several ways that are useful but challenging to implement:
- The search terminates after identifying eight mismatches (an arbitrary limit).
- Optional command-line parameters specify a block size (default: 4,096) and the number of concurrent threads (default: the number of processors).
- If the number of threads is negative, the program uses conventional file processing with ReadFile Windows calls, reading block size blocks for each read operation. This allows for comparison with both the fast memory-mapped/parallel implementation and with the real Windows COMP command.
The implementation challenges occur because mismatches can be identified at any time in different tasks (threads) and may not be found in the order in which they occur in the two files. Therefore, the solution needs to keep track of the earliest mismatches and, when the maximum mismatch count (8) is reached, other search tasks should terminate if they're searching past the last mismatch location.
Some Performance Results
This section gives performance results to show that the effort is worthwhile. Following sections will discuss the key implementation considerations.
First, in order to become familiar with the command syntax, Figure 1 shows results on two short files (3,968 bytes) which differ in two locations. The program name is compMP (memory-mapped, parallel compare), and there is also a run of Windows COMP. Notice the command-line parameters to specify block size and thread count (which is negative in one case to force conventional file processing).
Figure 2 shows performance results using timep (a time utility in WSP4) to measure the elapsed (real), user, and system time on a four-core 64-bit system. Results, arranged from fastest to slowest, are shown for four threads, two threads, one thread, conventional file access (ReadFile), and Windows COMP.
The compared files are identical and 640,000,000 bytes long. This size was selected as follows:
- The size is sufficiently large to consume measurablebut not excessivetime.
- A large file can stress a system with less memory or a 32-bit CPU, revealing scalability issues.
- The size (640,000,000 bytes) is not a power of two and hence not a multiple of the block sizes, which is important for thorough testing of some implementations.
- Many applications, including emerging cloud applications, must process large data sets (this size is small by such standards), as described, for example, by Gannon and Reed .
Summarizing these results:
- The fast implementation, using memory-mapping and four tasks, consumed 0.562 seconds real (elapsed) time. The user time is larger, as it's the sum of times on four processors. The default number of tasks (four on the test system) is nearly optimal (eight threads doesn't produce better results) and is significantly faster than the one-thread and two-thread cases.
- The conventional file processing real-time is 2.353 seconds (more than three times the four-thread result and about double the one-thread case). The block size (65,536) is important; smaller block sizes are slower, as there are more ReadFile system calls. Examples: 512-byte blocks: 15.1s; 2,048-byte blocks: 5.7s; 8,192-byte blocks: 2.9s.
- Windows COMP required more than 13 seconds (there was a small delay to respond to the prompt).
Figure 3 is a graph showing compMP and Windows COMP results for different file sizes, using both conventional file processing (65,536 byte blocks) and memory-mapped, parallel file processing. As Figure 3 shows, compMP is not efficient for very large files (discussed in the next section).
Overcoming Large File Limitations
compMP, with a thread for every processor (the red line), is significantly faster than other solutions for files of "moderate" size. Efficiency requires sufficient physical memory to hold each file, and the test system RAM size is 8GB, not enough for two 5GB files.
The four threads cause page thrashing, as the threads can be accessing the file at different locations; and the elapsed time, while highly variable, usually exceeds 500 seconds. Two threads provide slightly better performance (about 400 seconds), but one thread (the green line) is best.
The elapsed time is off the charts for the 5,120GB files for all solutions, but here are the numerical results for the three single-threaded solutions:
- ReadFile (conventional file processing) requires about 60 seconds, and is the fastest of the three single-threaded solutions.
- compMP with one thread (the green line) requires about 185 seconds.
- Windows COMP (light blue line) requires about 185 seconds, and is the slowest solution in all cases.
These results suggest a simple adaptive strategy (not implemented) for compMP, which would appear to beat Windows COMP in all cases:
- Test the file size. If two files don't fit conveniently in the available physical memory, use the ReadFile code. Experiment to find the Windows overhead to determine the file-size threshold.
- If enough physical memory exists, use the memory-mapped file, multithread code.
Results on older 32-bit systems are similar for smaller files. The 320MB file test worked well, but the 640MB files couldn't be mapped due to virtual address space limitations. compMP should extend its adaptive strategy for 32-bit operation.
About the Solution
The source code is available at the WSP4 support site. A ReadMe.txt file is available, and you should open the compMP solution in the Projects2008_64 folder, which contains Visual Studio 2008 64-bit projects. The source file is CHAPTR14\compMP.c, and the project settings are consistent with other WSP4 examples, as described in ReadMe.txt.
Source-code comments explain the important code features, but here's a summary of important points:
- The memory-mapping and thread management are standard, corresponding to patterns that WSP4 (Chapters 5 and 710) describes and uses extensively.
- The file-comparison threads used "shared state," which maintains an array of up to eight mismatch entries, with each mismatch entry containing file position and byte values in the two files.
- The shared state contains a lock; in this case, a "Slim Reader/Writer" (SRW) lock (not available with Windows XP and Server 2003). A conventional mutex or CRITICAL_SECTION would also work, but neither is as efficient as the SRW lock.
- The mismatch table is maintained in order, sorted by the file location. See the LogMismatch function for the table update logic and the exclusive SRW locking. The implementation is simple and adequate given the small table size, and this logic is invoked only when there's a mismatch, which will never occur in the normal case of identical files.
- The comparison is performed in eight-byte units as Microsoft C++ __int64 data items. If there's a mismatch, the program identifies the one or more bytes that mismatch. This is much more efficient (by a factor of three or more, in some cases) than comparing byte by byte. Other code optimizations are possible but haven't been attempted.
- The thread shutdown logic in the ReadCompare function is passive; there's no need for any type of thread cancellation. Each thread periodically (at the start of each block) examines the thread state to determine whether the mismatch table is full, and whether the thread's file location is greater than the maximum location in the table. A thread might run a bit longer than necessary, but locking the shared state and test before each byte comparison would be inefficient. Furthermore, no cancellation technique assures immediate response.
- Locking before all shared-state access erects memory barriers, ensuring that the latest values are available.
A Harder Search Problem
File comparison, which in this case searches a file pair for a limited number of mismatches, lends itself to speculative processing because there's no state as we scan the file, other than the shared mismatch data. Two bytes are either the same or different. If they're different, we log the mismatch. In either case, we can then compare the next byte pair. There's no problem when scanning an individual block, as there's no concern about the state when starting to scan the block.
Searching a file for a regular expression is an instance of a more complex problem, as there's no way to know the state when starting to process a file block. You need to process the file sequentially from beginning to end. If the search pattern is very simple, such as a constant string, a little ingenuity can overcome this problem, by allowing a thread to continue past the end of a block into the next block.
The general regular expression doesn't have an easy, apparent solution using speculative or any other form of parallel processing, but "whole file parallelism" works well. That is, if several files are specified on the command line, create a distinct thread to process each file. Use shared state to accumulate and order the pattern matches. WSP4 has a wc (word count) solution using whole-file parallelism.
Linux and UNIX Results
It's straightforward to convert the code to POSIX, although the Windows-specific OS calls and data types should be replaced carefully. I haven't performed this conversion, but earlier results with whole-file parallelism with the wc command indicate that we can expect similar results for compMP.
A future article will extend the solution from native code to C#/.NET code and exploit new .NET 4.0 features, thereby extending the solution in SFP-Fast. This article will also explore other speculative processing patterns.
 Asanovic, K., et al. "A View of the Parallel Computing Landscape." Commun. ACM 52, 10 (Oct. 2009), 5667.
 Butenhof, D. R. Programming with POSIX Threads. Addison-Wesley, 1997.
 Cormen, T. H., et al. Introduction to Algorithms, Third Edition. The MIT Press, 2009. (Chapter 27.)
 Duffy, J. Concurrent Programming on Windows. Addison-Wesley, 2008.
 Gannon, G. and D. Reed. "Parallelism and the Cloud." Dr. Dobb's Journal, October 16, 2009.
 Hart, J. M. [SFP-Fast] "Sequential File Processing: A Fast Path from Native Code to .NET with Visits to Parallelism and Memory Mapped Files." June 14, 2010. (To appear in Dr. Dobb's Journal.)
 Toub, S. "Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4." February 16, 2010.
 Wilson, G. V. "Deja Parallel All Over Again." Dr. Dobb's Journal, June 19, 2005.
Johnson (John) M. Hart is a consultant specializing in Microsoft Windows and .NET application development, open systems computing, and technical training and writing. He has over 25 years' experience as a software engineer, manager, engineering director, and architect at Cilk Arts, Inc. (now part of Intel); ArrAy, Inc. (now part of Sierra Atlantic, Inc.); HP; and Apollo Computer. He served as a computer science professor at the University of Kentucky for nine years, and authored all four editions of Windows System Programming. John still enjoys pushing lots of bits around as fast as possible.