Home > Articles > Programming > Windows Programming

  • Print
  • + Share This
From the author of

Searching Files

It takes about 21 characters and the Regex class to determine whether a string of characters contains a succession of characters consistent with a Social Security number. Writing C# code to perform the same operation from scratch would take considerably more effort. The more complex the search operation, the more code you would have to write to perform the search. While regular expressions can be terse, they only get gradually more complex for searches that would require significantly more custom C# code.

For example, suppose you want to extract all of the email addresses from a web page. You could write a lot of code to search through the various kinds of metacharacters found in the average HTML file, but just a modicum of regular expression code and you can quickly perform the search. Listing 1 contains code that searches the U.S. Senate mail listing for all of the email addresses of U.S. senators.

Listing 1—Using Regular Expressions to Search a Web Page for Email Addresses

[STAThread]
static void Main(string[] args)
{

 FileStream stream = File.OpenRead("index.txt");
 byte[] text = new byte[stream.Length];
 stream.Read(text, 0, (int)stream.Length);

 ASCIIEncoding encoding = new ASCIIEncoding();

 string content = encoding.GetString(text);
 MatchCollection matches = Regex.Matches(content,
  @"mailto:\w+@\w+.senate.gov");

 foreach( Match match in matches )
 {
  Console.WriteLine(match.Value);
 }

 Console.ReadLine();
}

Listing 1 opens the saved HTML file using a FileStream. A byte array is allocated based on the length of the stream, and the content is read from the file into the stream in one statement. The System.Text.RegularExpressions.ASCIIEncoding class is used to convert the byte array into an ASCII string. From this point, we have a string that can be used as an input string, which we can compare to a properly formatted string that matches emailing addresses for U.S. senators.

The regular expression is mailto:\w+@\w+.senate.gov; this expression will return all of the senatorial email addresses found in this page. From here it isn't much of a stretch to create an automated mailing listing. (This kind of code can be valuable for legitimate purposes, which doesn't include spamming Carl Levin, Joe Lieberman, and Ted Kennedy.)

CAUTION

If you use automated mailing lists to spam the U.S. Senate, you might get to check into Kevin Mitnick's old room at Lompoc federal prison. I don't have to tell you that spam is annoying and not to do it.

  • + Share This
  • 🔖 Save To Your Account