We start this section by reviewing the changes that were made to add the Visit fact and customer information to the existing data warehouse, and then give a detailed description of the data mining sections of the solution.
Adding Visit Information to the Data Warehouse
To add the Visit fact table and associated dimensions to the database, we need to supplement the existing ETL procedures to load data from the e-commerce application's tables. As always, when adding a new data source to the warehouse, you need to perform data mapping to match any existing dimensions. We have already discussed that for the Customer dimension we will be using the store loyalty card number to map Internet profiles to customer records. Date and Time of Day keys are usually simple to map; however, because time stamps in Web server logs are either in the local time of the server or in UTC (coordinated universal time), we need to check this before implementing the ETL.
The Referrer Domain dimension will be sourced from the equivalent table in the e-commerce database, but if you are implementing ETL to extract this information from log files (see the sidebar "Extracting Information from IIS Logs"), you need to parse the URL of the referring page to extract the domain name. The Browser Platform attributes such as OperatingSystem and BrowserVersion also need to be extracted from the User Agent field in the log files.
Customer Dimension Changes
The new customer attributes can easily be added to the cube definition by refreshing the data source view (DSV) in BI Development Studio to pick up the new columns, and then adding these as attributes to the Customer dimension. However, they may not be in the best format for analysis purposes—having the specific date that a user first visited the site is not very illuminating for users of the cube. In fact, they would probably be better served by being able to select customers based on groups that show how long they have been Internet site users (for example, "3–6 months").
We can add this information into the DSV as a named calculation on Customer or add it to the underlying view in the database. You can implement the MonthsSinceFirstInternetVisit named calculation by using the DateDiff function to work out the number of months between the date of the first visit and the current system date:
DateDiff(m, DateFirstInternetVisit, GetDate())
Instead of showing the user a long list of numbers, it would be better to group the numbers together into ranges, such as 1–3 months, 3–6 months, and so on. Although we could do this manually using a lookup table of month ranges, we can take advantage of the Analysis Services discretization feature to do this for us. After adding the MonthsSinceFirstInternetVisit attribute to the Customer dimension, change the DiscretizationMethod property of the attribute to Automatic to allow Analysis Services to decide on the best method of grouping these time periods. If you want to specify the approximate number of groups (or "buckets") that are created, set the DiscretizationBucketCount property, too.
Visit Measure Group
Because we want to be able to look at both sales and visit information together, we can add the Visit fact table to the existing cube as a new measure group. Dimensions that are not used in the Visit measure group (such as Product and Promotion) will be grayed out in the Dimension Usage table of the cube editor.
One measure to be careful of is the Duration measure. Although this measure is additive across time (for example, we could determine the total duration that a group of customers spent on the site in the month of January), using the information by summing up the facts in this way does not make a lot of business sense. The Duration measure is there to provide an indication of how long people spent on the site; and so, we can change the AggregateFunction property of this measure to AverageOfChildren to display this information in the way that users will expect.
How We Will Be Using Data Mining
As discussed in the section "High-Level Architecture," we chose the Microsoft Clustering and Microsoft Association algorithms for our solution. Knowing which algorithm is appropriate for your business problem will take some experimentation and research in the documentation. In fact, in a lot of cases, there is no obvious candidate at the outset, and you will need to try different algorithms against the same underlying data to see which is most appropriate. The data mining designer also includes a Mining Accuracy Chart tab that you can use to compare algorithms.
The first decision we need to make is where the data will come from. Analysis Services can use either the relational tables in your data source view or the actual cube itself as the source of data for the models. Because data mining is even more sensitive to flawed data than most applications, it is important to ensure that you perform as much data cleansing as possible against the source data prior to processing your models; so, at the very least, you should probably be using the tables in your data warehouse rather than directly using source systems.
However, using the cube as the source for data mining has a number of benefits, so we will be using that approach for this solution. The cube data has already been supplemented with additional attributes and calculated measures that the data mining algorithms can take advantage of. Also, the load process for data mining models can take some time, so using the cube as the source means that the aggregates will be used if applicable, potentially speeding up the processing time.
Approaching the Customer-Segmentation Problem
Because users can slice and dice information by all the attributes in a dimension rather than just predefined drilldown hierarchies, analysts could use the new Internet-related attributes that we added to drill down through the data and start to understand how customers' online activities affect measures such as total sales or profitability. For example, they can learn that frequent visitors to the site often have high sales amounts, but that this isn't always the case—some frequent visitors are "just looking."
To really do a good job of targeting the DVD marketing campaign to customers likely to act on the information, analysts need to perform a segmentation exercise where all customers that have similar attributes are categorized into groups. Because the list of customers is huge and there is a large number of attributes, we can start this categorization process by using a data mining algorithm to search through the customers and group them into clusters.
The Microsoft Clustering algorithm is a great tool for segmentation and works by looking for relationships in the data and generating a list of clusters, as shown in Figure 10-4, and then gradually moving clusters around until they are a good representation of the data.
Figure 10-4 Clusters of data
Getting Started with Data Mining
We start the data mining process by creating a new mining model for the customer segmentation exercise, using an existing Analysis Services project that contains the cubes and dimensions with both the data warehouse information (such as in-store sales) and the new Internet information described earlier.
In Analysis Services data mining, we define a mining structure that describes the underlying data that will be used for data mining. Each mining structure can contain multiple mining models, such as a clustering model and an association model, that all use the same underlying data but in different ways.
The wizard will create the mining structure and model and open the structure in the data mining designer. The underlying data that you selected is shown on the Mining Structure tab, and the Customer Internet Segmentation model is the only model in the list on the Mining Models tab.
Before working with the model, you need to deploy the solution and process the mining model. During processing, Analysis Services applies the algorithm you selected (Microsoft Clustering) to the data from the cube to allocate all the customers to their appropriate clusters—your next task in data mining is to understand the information that has been produced and relate it to the real world.
Looking at the Clusters Created
The Mining Model Viewer tab in the model designer enables you to view the model that has been processed. Each algorithm produces a different type of model, so there are specific viewers for each model. The initial view for clusters is the Cluster Diagram, which shows all the clusters in the model with lines connecting them. Each cluster is positioned closer to other similar clusters, and the darkness of the line connecting two clusters shows the level of similarity. The shading of each cluster by default is related to the population of the cluster (that is, how many customers it contains—the darker clusters have the most customers).
For our Customer Internet Segmentation model, we can see ten clusters named Cluster 1 through Cluster 10. Each cluster represents a group of customers with similar attributes, such as customers who are fairly new to our Internet site and have not made a lot of purchases yet. Our task at this stage is to understand the kinds of customers in each cluster and hopefully come up with some more meaningful names for the clusters.
We can start by using the Cluster Diagram's shading variable and state parameters to look at each attribute and see which clusters contain the most customers with the selected attribute. For example, if I select Sales Amount > 1275 in Figure 10-6, I can see that Cluster 5 and Cluster 8 contain the most customers who have total sales of more than $1,275, as shown in Figure 10-6.
Figure 10-6 Cluster diagram
You can use the cluster diagram to help you comprehend each cluster by looking at one variable at a time. To really understand and compare the composition of clusters (that is, what types of customers are in each group), you need to use the Cluster Profiles and Cluster Discrimination views. We can see in the diagram that Cluster 1 contains a fairly high percentage of customers with high sales and is arranged near to Cluster 2 and Cluster 6, but we need more complete information to be able to assign a meaningful name to these clusters.
Understanding the Composition of Clusters
The Cluster Profiles view shows all the clusters that were identified as columns, and each attribute that you selected for your model as the rows, as shown in Figure 10-7. Looking first at Cluster 1, we can see that all the customers in the group have an Internet Purchaser attribute of False, as well as an Internet Visitor of False. So, the mining algorithm has grouped customers together who have never visited the site or purchased anything online—all their purchases have been at a physical store. Note that we can come to this rather useful conclusion only because we understand the underlying business, which is a key point about data mining.
Figure 10-7 Cluster profiles
To give Cluster 1 the more sensible name of Store-Only Buyers, right-click the cluster name and select Rename Cluster. So, we now have a single cluster identified; what about the others? If you look at the next column, you can see that Cluster 2 differs from Store-Only Buyers in that all the customers in the cluster have actually visited the site, but they just haven't made any purchases online yet. We can call this cluster Browsers because they are customers who are (so far) using the site for information gathering only.
Cluster 6 contains visitors who have also made a purchase, but if we look closely at the Months Internet Purchaser and Months Internet User attributes, we learn that they are all relative newcomers to our site—all of them have been visitors and purchasers for between zero and three months (they are "Newbies"). We can continue the process of looking at each cluster, but the rest of the clusters are not quite so clear-cut, so we need a better tool for differentiating between them.
Discriminating Between Similar Clusters
If you look at the profiles of Clusters 8 and 9 in BI Development Studio, you will notice that they both have multiple values for the number of months that customers have been Internet visitors and Internet purchasers. This illustrates an important point about the clusters that the algorithm identifies: Every customer in the group does not have to have exactly the same value for every attribute. This is somewhat confusing when you start working with clusters; for example, you might have named a cluster Urban Professionals and then discover that it also contains a customer who lives in the countryside.
The reason for this is that the customer, when you look at all of his or her attributes together, is most similar to the customers who live in urban areas and have professional occupations. So naming a cluster Urban Professionals does not necessarily imply that it contains absolutely no manual laborers who live in the suburbs, but rather gives a high-level shorthand for the predominant combination of attributes in that cluster.
Because the clusters identified are therefore sometimes ambiguous, we need a way of discriminating between similar clusters to find out what exactly makes them different. We can use the Cluster Discrimination view, as shown in Figure 10-8, to select the two clusters we are interested in comparing and get an idea of what the most important differences are.
Figure 10-8 Cluster discrimination
We can see in the discrimination view that although the cluster profiles of 8 and 9 look similar, in fact Cluster 9 contains mostly customers who have made a visit and purchase in the past few months, are fairly frequent visitors, and have spent a lot of money with us—we could call this group Frequent Visitors. Cluster 8, on the other hand, contains mostly customers who have not visited the site for many months, although in the past they have spent some money with us. This cluster is probably one that we want to pay careful attention to, because they may now be doing their shopping with a competitor. That is, they may be Defectors.
With the cluster profile and discrimination views, we can understand the clusters well enough to give them meaningful names, so we can now turn our attention to providing this information back to users to enable them to perform analyses on the data using the clusters.
Analyzing with Data Mining Information
Analysis Services allows you to create a special type of dimension called a Data Mining dimension, which is based on a data mining model and can be included in a cube just like an ordinary dimension. The Data Mining dimension includes all the clusters that were identified by the algorithm, including any specific names that you assigned to them.
Adding a Data Mining Dimension to a Cube
We will use the data mining model we created in the previous Quick Start exercise and create a new dimension called Customer Internet Segmentation, as well as a new cube that includes this dimension. The existing Visit and Sales measure groups from the e-commerce cube will be linked into the new cube to be analyzed by the new dimension.
To create the dimension, open the data mining structure and go to the Mining Structure tab. Select Create a Data Mining Dimension on the Mining Model menu. Specify the dimension and cube names and click OK. Before you can use the new objects, you must deploy the solution and process the new dimension and cube.
Using the Data Mining Dimension
Because the dimension has been added to the cube, marketing database analysts can use the new segmentation to understand measures such as profitability or total sales for each of the clusters and refine the set of customers who will be targeted by the direct mail campaign to publicize the site's new DVD products. The list of customers can be provided either from a drillthrough action in a BI client tool or by building a Reporting Services customer list report that allows the user to select the cluster and other attributes.
Creating a Model for Product Recommendations
Our approach for product recommendations is based on the idea that we can use a mining model to look at every customer and the DVDs that they have bought, and then look for patterns of DVDs that often occur together. The Association Rules mining model is often used for this kind of analysis (sometimes called market basket analysis) and produces a set of rules that say, for example, if the customer is buying a DVD of The Godfather, what are the other movies that other buyers of The Godfather have purchased?
Each of these rules has a probability associated with them. For example, many customers may also have bought similar films, such as The Godfather Part II or Goodfellas, so the rules that relate The Godfather to these DVDs would have a high probability. If only a single customer bought It's a Wonderful Life and The Godfather, this rule would have a low probability. In data mining terminology, the number of times that a set of items occurs together is called the support, so the example of It's a Wonderful Life and The Godfather appearing together would have a support of 1.
We can use these rules to make a prediction: For a specific DVD, give me a list of the most probable DVDs that a customer might also enjoy.
Asking the Right Question
The best way to successfully set up a sensible data mining model is to be precise about the question you ask. Because we are looking for DVDs that sell well together, is the question we are asking "Which other DVDs have been bought during the same shopping trip?" or rather "Which other DVDs did customers also buy at some point?." If you were doing product recommendations on groceries, the first question would probably be the most sensible. The reason is that if someone is buying beer and wine today, we can probably recommend ice and potato chips because those are often sold in the same transaction.
However, in our case, we are trying to determine the likes and dislikes of consumers, which have a longer duration than just a single transaction. We are really trying to understand what kind of movies customers enjoy, so the second question is more appropriate for this business solution. To set up the model, we need to look at each customer and determine the list of DVDs that they have purchased. The data we are looking for looks something like Table 10-1. In data mining terminology, the customer would be the case, and the list of products for each customer would be a nested table.
Table 10-1. Customer DVD Purchase History
The Godfather Part II
The Godfather Part II
It's a Wonderful Life
Understanding the Product Recommendation Rules
Once again, you need to deploy and process the model before you can view the results. The algorithm produces a list of products related to each other for a customer, and you can view these in the Itemsets tab of the mining model viewer. Each itemset also shows the support or number of times that the set occurred together. If you click the Rules tab, you can see the main feature of this mining model (see Figure 10-10): A set of rules that can be used to calculate the probability that a new DVD will be appropriate based on the existing DVDs in the customer's shopping basket.
Figure 10-10 Product recommendation rules
For the Association Rules algorithm, the settings that you choose have a big impact on the set of rules created. You can change these settings by right-clicking the model in the Mining Models tab and selecting Set Algorithm Parameters. If you end up with long processing times and too many rules, you could increase the minimum probability parameter, which would discard rules with a low probability, or you could increase the minimum support, which would discard rules that do not occur very often in the data. If, on the other hand, you end up with too few rules, decrease the minimum probability and support.
When we have finished creating and training the mining model, we can move on to using the mining model to make predictions for our Web application.
Add Data Mining Intelligence into a Web Application
Web applications that include dynamic content typically access relational databases to provide information to users, usually by using a data access library such as ADO.NET to execute an SQL query against the database, then looping through the resulting rows to create a Web page. The process for adding data mining information to a Web application is similar. A programming library called ADOMD.NET provides classes for querying Analysis Services, and the Data Mining eXtensions (DMX) query language is used to request information from the mining model.
Querying the Mining Model Using DMX
The DMX language is similar to standard SQL, but there are enough differences that you will need to spend some time working with the language before it becomes natural. As with all new query languages, it is often better to start out using graphical designers, such as the prediction query builder in SQL Server Management Studio, before moving on to hand-coding queries. There is also a thriving user community for Analysis Services data mining, and there are lots of samples available to get you started at www.sqlserverdatamining.com.
The prediction DMX query that we have built returns a long list of products, some of them with fairly low probabilities. What we actually need for the Web site is the top five or so best recommendations, so we can add a numeric parameter to the query to specify the maximum number of results to return. Our final DMX query looks like the following:
SELECT Predict ([Product], INCLUDE_STATISTICS, 5) FROM [Product Recommendations] NATURAL PREDICTION JOIN (SELECT (SELECT 'The Godfather' AS [Product Name] UNION SELECT 'Battlestar Galactica Season I' AS [Product Name]) AS [Product]) AS a
Executing DMX Queries from a Web Application
Our Web application is an ASP.NET application, so we can take advantage of the ADOMD.NET classes to execute the DMX query. The DMX query that we designed earlier returns the information in a hierarchical format, and we would need to navigate through it to extract the product names. We can simplify the code by using the SELECT FLATTENED syntax, which returns the information as a simple list of products that we can add to the Web page.