Mining the Deep Web with Mashups
Enterprise software is like a matryoshka nesting doll—you know, those Russian dolls that stack inside one another and get perpetually smaller until you reach the last, indivisible one. With applications, we begin with the user interface and can work backward through the servers, their hosting architecture, the systems and protocols involved—all the way back to the specific developers who wrote the code. But depending on our place in an IT or business organization, most of us never bother to peel this onion (and yes, this analogy is fitting because I fully expect many of you would cry if you did).
What does this have to do with mashups? Read on.
Most user-created mashups integrate applications "at the glass." At best, they may leverage a handful of public APIs that represent the functionality a site has chosen to expose. Yet, behind any site can be a treasure trove of information ripe for the taking. This data is generally referred to as the Deep Web. One study indicates that it may be more than 500 times larger than the public veneer crawled by search engines like Google or Yahoo!.
This is where mashups enter the picture. A handful of tools provide an easy way for a developer or power user to impose his or her own API on a site. (In my book Mashup Patterns: Designs and Examples for the Modern Enterprise, I call this the "API Enabler pattern.") Two popular tools for this purpose:
- JackBe's Presto (registration or login required for access) provides this capability through a partnership with Dapper.
- openkapow, provided for free by commercial vendor Kapow Technologies, is another site where you can experiment with this functionality.
How Mashups Can Help Your Business
When you're shopping online for yourself, you probably use a price-comparison mashup like Google Shopping to find the lowest price for the item you want. That data comes from the "surface" web. But how many times have you clicked through on the best price, only to find ridiculous shipping charges? Then it's back to the list to find the second-lowest price and begin the checkout process again.
Behind that simple advertised price is additional information on shipping, ratings, rebates, and availability that the shallow web doesn't expose. A really valuable price guide would take these variables into consideration. Using the top five results and your general location (obtained from your IP address), it would mimic a purchase at the underlying site all the way up through checkout—without paying, of course. Ideally, a search for rebates and coupons would be mashed in, too. You'd wind up with results that were truly useful, like the one shown in Figure 1.
This is a simple consumer-facing example, of course. What kinds of hidden information could a company grab that might give it a competitive advantage? I can think of several right off the bat:
- Repeatedly order a product to see rival companies' stock levels and estimated delivery times.
- Book rooms at competing hotels to make sure that your travel department isn't just getting a good price, but one that includes the most free services (Internet access, gym passes, and so on).
- Try to reserve equipment for a future date. If your competition can't meet your fictitious "needs," target special advertising for that period to steal customers that your rivals have turned away.
Traditional search engines ignore the Deep Web, which is one of the reasons that MIT researchers have been working on Morpheus, a series of data transformations intended to peek past web forms and into the public databases behind them. That brings up an important point: The Deep Web isn't hidden—most of its data is freely accessible. To see what lies beneath, you (or a simulated you—the mashup acting on your behalf) only have to ask.
Just as there's a public Deep Web, there's an arguably more valuable "Deep Intranet." You might think that employees within a firm wouldn't have any problems accessing various internal data, but anyone who has worked for a large corporation knows that that's not the case. I'm not talking about circumventing security procedures or accessing privileged content; sometimes teams that own data are just too overburdened to offer a public interface. Maybe bureaucratic processes make obtaining new feeds unnecessarily painful.
If there's any type of existing interface on the system, you're all set. It doesn't even have to be web-based. The products I've already mentioned (and others, like Denodo Data Mashup, Serena, and IBM Mashup Center) are also capable of reaching into databases and other repositories.