Home > Articles > Data > SQL Server

This chapter is from the book

Technical Solution

We start this section by reviewing the changes that were made to add the Visit fact and customer information to the existing data warehouse, and then give a detailed description of the data mining sections of the solution.

Adding Visit Information to the Data Warehouse

To add the Visit fact table and associated dimensions to the database, we need to supplement the existing ETL procedures to load data from the e-commerce application's tables. As always, when adding a new data source to the warehouse, you need to perform data mapping to match any existing dimensions. We have already discussed that for the Customer dimension we will be using the store loyalty card number to map Internet profiles to customer records. Date and Time of Day keys are usually simple to map; however, because time stamps in Web server logs are either in the local time of the server or in UTC (coordinated universal time), we need to check this before implementing the ETL.

The Referrer Domain dimension will be sourced from the equivalent table in the e-commerce database, but if you are implementing ETL to extract this information from log files (see the sidebar "Extracting Information from IIS Logs"), you need to parse the URL of the referring page to extract the domain name. The Browser Platform attributes such as OperatingSystem and BrowserVersion also need to be extracted from the User Agent field in the log files.

Customer Dimension Changes

The new customer attributes can easily be added to the cube definition by refreshing the data source view (DSV) in BI Development Studio to pick up the new columns, and then adding these as attributes to the Customer dimension. However, they may not be in the best format for analysis purposes—having the specific date that a user first visited the site is not very illuminating for users of the cube. In fact, they would probably be better served by being able to select customers based on groups that show how long they have been Internet site users (for example, "3–6 months").

We can add this information into the DSV as a named calculation on Customer or add it to the underlying view in the database. You can implement the MonthsSinceFirstInternetVisit named calculation by using the DateDiff function to work out the number of months between the date of the first visit and the current system date:

DateDiff(m, DateFirstInternetVisit, GetDate())

Instead of showing the user a long list of numbers, it would be better to group the numbers together into ranges, such as 1–3 months, 3–6 months, and so on. Although we could do this manually using a lookup table of month ranges, we can take advantage of the Analysis Services discretization feature to do this for us. After adding the MonthsSinceFirstInternetVisit attribute to the Customer dimension, change the DiscretizationMethod property of the attribute to Automatic to allow Analysis Services to decide on the best method of grouping these time periods. If you want to specify the approximate number of groups (or "buckets") that are created, set the DiscretizationBucketCount property, too.

Visit Measure Group

Because we want to be able to look at both sales and visit information together, we can add the Visit fact table to the existing cube as a new measure group. Dimensions that are not used in the Visit measure group (such as Product and Promotion) will be grayed out in the Dimension Usage table of the cube editor.

One measure to be careful of is the Duration measure. Although this measure is additive across time (for example, we could determine the total duration that a group of customers spent on the site in the month of January), using the information by summing up the facts in this way does not make a lot of business sense. The Duration measure is there to provide an indication of how long people spent on the site; and so, we can change the AggregateFunction property of this measure to AverageOfChildren to display this information in the way that users will expect.

How We Will Be Using Data Mining

As discussed in the section "High-Level Architecture," we chose the Microsoft Clustering and Microsoft Association algorithms for our solution. Knowing which algorithm is appropriate for your business problem will take some experimentation and research in the documentation. In fact, in a lot of cases, there is no obvious candidate at the outset, and you will need to try different algorithms against the same underlying data to see which is most appropriate. The data mining designer also includes a Mining Accuracy Chart tab that you can use to compare algorithms.

The first decision we need to make is where the data will come from. Analysis Services can use either the relational tables in your data source view or the actual cube itself as the source of data for the models. Because data mining is even more sensitive to flawed data than most applications, it is important to ensure that you perform as much data cleansing as possible against the source data prior to processing your models; so, at the very least, you should probably be using the tables in your data warehouse rather than directly using source systems.

However, using the cube as the source for data mining has a number of benefits, so we will be using that approach for this solution. The cube data has already been supplemented with additional attributes and calculated measures that the data mining algorithms can take advantage of. Also, the load process for data mining models can take some time, so using the cube as the source means that the aggregates will be used if applicable, potentially speeding up the processing time.

Approaching the Customer-Segmentation Problem

Because users can slice and dice information by all the attributes in a dimension rather than just predefined drilldown hierarchies, analysts could use the new Internet-related attributes that we added to drill down through the data and start to understand how customers' online activities affect measures such as total sales or profitability. For example, they can learn that frequent visitors to the site often have high sales amounts, but that this isn't always the case—some frequent visitors are "just looking."

To really do a good job of targeting the DVD marketing campaign to customers likely to act on the information, analysts need to perform a segmentation exercise where all customers that have similar attributes are categorized into groups. Because the list of customers is huge and there is a large number of attributes, we can start this categorization process by using a data mining algorithm to search through the customers and group them into clusters.

The Microsoft Clustering algorithm is a great tool for segmentation and works by looking for relationships in the data and generating a list of clusters, as shown in Figure 10-4, and then gradually moving clusters around until they are a good representation of the data.

Figure 10-4

Figure 10-4 Clusters of data

Getting Started with Data Mining

We start the data mining process by creating a new mining model for the customer segmentation exercise, using an existing Analysis Services project that contains the cubes and dimensions with both the data warehouse information (such as in-store sales) and the new Internet information described earlier.

In Analysis Services data mining, we define a mining structure that describes the underlying data that will be used for data mining. Each mining structure can contain multiple mining models, such as a clustering model and an association model, that all use the same underlying data but in different ways.

The wizard will create the mining structure and model and open the structure in the data mining designer. The underlying data that you selected is shown on the Mining Structure tab, and the Customer Internet Segmentation model is the only model in the list on the Mining Models tab.

Before working with the model, you need to deploy the solution and process the mining model. During processing, Analysis Services applies the algorithm you selected (Microsoft Clustering) to the data from the cube to allocate all the customers to their appropriate clusters—your next task in data mining is to understand the information that has been produced and relate it to the real world.

Looking at the Clusters Created

The Mining Model Viewer tab in the model designer enables you to view the model that has been processed. Each algorithm produces a different type of model, so there are specific viewers for each model. The initial view for clusters is the Cluster Diagram, which shows all the clusters in the model with lines connecting them. Each cluster is positioned closer to other similar clusters, and the darkness of the line connecting two clusters shows the level of similarity. The shading of each cluster by default is related to the population of the cluster (that is, how many customers it contains—the darker clusters have the most customers).

For our Customer Internet Segmentation model, we can see ten clusters named Cluster 1 through Cluster 10. Each cluster represents a group of customers with similar attributes, such as customers who are fairly new to our Internet site and have not made a lot of purchases yet. Our task at this stage is to understand the kinds of customers in each cluster and hopefully come up with some more meaningful names for the clusters.

We can start by using the Cluster Diagram's shading variable and state parameters to look at each attribute and see which clusters contain the most customers with the selected attribute. For example, if I select Sales Amount > 1275 in Figure 10-6, I can see that Cluster 5 and Cluster 8 contain the most customers who have total sales of more than $1,275, as shown in Figure 10-6.

Figure 10-6

Figure 10-6 Cluster diagram

You can use the cluster diagram to help you comprehend each cluster by looking at one variable at a time. To really understand and compare the composition of clusters (that is, what types of customers are in each group), you need to use the Cluster Profiles and Cluster Discrimination views. We can see in the diagram that Cluster 1 contains a fairly high percentage of customers with high sales and is arranged near to Cluster 2 and Cluster 6, but we need more complete information to be able to assign a meaningful name to these clusters.

Understanding the Composition of Clusters

The Cluster Profiles view shows all the clusters that were identified as columns, and each attribute that you selected for your model as the rows, as shown in Figure 10-7. Looking first at Cluster 1, we can see that all the customers in the group have an Internet Purchaser attribute of False, as well as an Internet Visitor of False. So, the mining algorithm has grouped customers together who have never visited the site or purchased anything online—all their purchases have been at a physical store. Note that we can come to this rather useful conclusion only because we understand the underlying business, which is a key point about data mining.

Figure 10-7

Figure 10-7 Cluster profiles

To give Cluster 1 the more sensible name of Store-Only Buyers, right-click the cluster name and select Rename Cluster. So, we now have a single cluster identified; what about the others? If you look at the next column, you can see that Cluster 2 differs from Store-Only Buyers in that all the customers in the cluster have actually visited the site, but they just haven't made any purchases online yet. We can call this cluster Browsers because they are customers who are (so far) using the site for information gathering only.

Cluster 6 contains visitors who have also made a purchase, but if we look closely at the Months Internet Purchaser and Months Internet User attributes, we learn that they are all relative newcomers to our site—all of them have been visitors and purchasers for between zero and three months (they are "Newbies"). We can continue the process of looking at each cluster, but the rest of the clusters are not quite so clear-cut, so we need a better tool for differentiating between them.

Discriminating Between Similar Clusters

If you look at the profiles of Clusters 8 and 9 in BI Development Studio, you will notice that they both have multiple values for the number of months that customers have been Internet visitors and Internet purchasers. This illustrates an important point about the clusters that the algorithm identifies: Every customer in the group does not have to have exactly the same value for every attribute. This is somewhat confusing when you start working with clusters; for example, you might have named a cluster Urban Professionals and then discover that it also contains a customer who lives in the countryside.

The reason for this is that the customer, when you look at all of his or her attributes together, is most similar to the customers who live in urban areas and have professional occupations. So naming a cluster Urban Professionals does not necessarily imply that it contains absolutely no manual laborers who live in the suburbs, but rather gives a high-level shorthand for the predominant combination of attributes in that cluster.

Because the clusters identified are therefore sometimes ambiguous, we need a way of discriminating between similar clusters to find out what exactly makes them different. We can use the Cluster Discrimination view, as shown in Figure 10-8, to select the two clusters we are interested in comparing and get an idea of what the most important differences are.

Figure 10-8

Figure 10-8 Cluster discrimination

We can see in the discrimination view that although the cluster profiles of 8 and 9 look similar, in fact Cluster 9 contains mostly customers who have made a visit and purchase in the past few months, are fairly frequent visitors, and have spent a lot of money with us—we could call this group Frequent Visitors. Cluster 8, on the other hand, contains mostly customers who have not visited the site for many months, although in the past they have spent some money with us. This cluster is probably one that we want to pay careful attention to, because they may now be doing their shopping with a competitor. That is, they may be Defectors.

With the cluster profile and discrimination views, we can understand the clusters well enough to give them meaningful names, so we can now turn our attention to providing this information back to users to enable them to perform analyses on the data using the clusters.

Analyzing with Data Mining Information

Analysis Services allows you to create a special type of dimension called a Data Mining dimension, which is based on a data mining model and can be included in a cube just like an ordinary dimension. The Data Mining dimension includes all the clusters that were identified by the algorithm, including any specific names that you assigned to them.

Adding a Data Mining Dimension to a Cube

We will use the data mining model we created in the previous Quick Start exercise and create a new dimension called Customer Internet Segmentation, as well as a new cube that includes this dimension. The existing Visit and Sales measure groups from the e-commerce cube will be linked into the new cube to be analyzed by the new dimension.

To create the dimension, open the data mining structure and go to the Mining Structure tab. Select Create a Data Mining Dimension on the Mining Model menu. Specify the dimension and cube names and click OK. Before you can use the new objects, you must deploy the solution and process the new dimension and cube.

Using the Data Mining Dimension

Because the dimension has been added to the cube, marketing database analysts can use the new segmentation to understand measures such as profitability or total sales for each of the clusters and refine the set of customers who will be targeted by the direct mail campaign to publicize the site's new DVD products. The list of customers can be provided either from a drillthrough action in a BI client tool or by building a Reporting Services customer list report that allows the user to select the cluster and other attributes.

Creating a Model for Product Recommendations

Our approach for product recommendations is based on the idea that we can use a mining model to look at every customer and the DVDs that they have bought, and then look for patterns of DVDs that often occur together. The Association Rules mining model is often used for this kind of analysis (sometimes called market basket analysis) and produces a set of rules that say, for example, if the customer is buying a DVD of The Godfather, what are the other movies that other buyers of The Godfather have purchased?

Each of these rules has a probability associated with them. For example, many customers may also have bought similar films, such as The Godfather Part II or Goodfellas, so the rules that relate The Godfather to these DVDs would have a high probability. If only a single customer bought It's a Wonderful Life and The Godfather, this rule would have a low probability. In data mining terminology, the number of times that a set of items occurs together is called the support, so the example of It's a Wonderful Life and The Godfather appearing together would have a support of 1.

We can use these rules to make a prediction: For a specific DVD, give me a list of the most probable DVDs that a customer might also enjoy.

Asking the Right Question

The best way to successfully set up a sensible data mining model is to be precise about the question you ask. Because we are looking for DVDs that sell well together, is the question we are asking "Which other DVDs have been bought during the same shopping trip?" or rather "Which other DVDs did customers also buy at some point?." If you were doing product recommendations on groceries, the first question would probably be the most sensible. The reason is that if someone is buying beer and wine today, we can probably recommend ice and potato chips because those are often sold in the same transaction.

However, in our case, we are trying to determine the likes and dislikes of consumers, which have a longer duration than just a single transaction. We are really trying to understand what kind of movies customers enjoy, so the second question is more appropriate for this business solution. To set up the model, we need to look at each customer and determine the list of DVDs that they have purchased. The data we are looking for looks something like Table 10-1. In data mining terminology, the customer would be the case, and the list of products for each customer would be a nested table.

Table 10-1. Customer DVD Purchase History

Customer

DVD

Customer 3283

The Godfather

The Godfather Part II

Dark City

Customer 3981

The Godfather Part II

Goodfellas

Customer 5488

The Godfather

It's a Wonderful Life

...

...

Understanding the Product Recommendation Rules

Once again, you need to deploy and process the model before you can view the results. The algorithm produces a list of products related to each other for a customer, and you can view these in the Itemsets tab of the mining model viewer. Each itemset also shows the support or number of times that the set occurred together. If you click the Rules tab, you can see the main feature of this mining model (see Figure 10-10): A set of rules that can be used to calculate the probability that a new DVD will be appropriate based on the existing DVDs in the customer's shopping basket.

Figure 10-10

Figure 10-10 Product recommendation rules

For the Association Rules algorithm, the settings that you choose have a big impact on the set of rules created. You can change these settings by right-clicking the model in the Mining Models tab and selecting Set Algorithm Parameters. If you end up with long processing times and too many rules, you could increase the minimum probability parameter, which would discard rules with a low probability, or you could increase the minimum support, which would discard rules that do not occur very often in the data. If, on the other hand, you end up with too few rules, decrease the minimum probability and support.

When we have finished creating and training the mining model, we can move on to using the mining model to make predictions for our Web application.

Add Data Mining Intelligence into a Web Application

Web applications that include dynamic content typically access relational databases to provide information to users, usually by using a data access library such as ADO.NET to execute an SQL query against the database, then looping through the resulting rows to create a Web page. The process for adding data mining information to a Web application is similar. A programming library called ADOMD.NET provides classes for querying Analysis Services, and the Data Mining eXtensions (DMX) query language is used to request information from the mining model.

Querying the Mining Model Using DMX

The DMX language is similar to standard SQL, but there are enough differences that you will need to spend some time working with the language before it becomes natural. As with all new query languages, it is often better to start out using graphical designers, such as the prediction query builder in SQL Server Management Studio, before moving on to hand-coding queries. There is also a thriving user community for Analysis Services data mining, and there are lots of samples available to get you started at www.sqlserverdatamining.com.

The prediction DMX query that we have built returns a long list of products, some of them with fairly low probabilities. What we actually need for the Web site is the top five or so best recommendations, so we can add a numeric parameter to the query to specify the maximum number of results to return. Our final DMX query looks like the following:

SELECT Predict ([Product], INCLUDE_STATISTICS, 5)
FROM [Product Recommendations]
NATURAL PREDICTION JOIN
  (SELECT
     (SELECT 'The Godfather' AS [Product Name]
       UNION SELECT 'Battlestar Galactica Season I' AS [Product Name])
 AS [Product]) AS a

Executing DMX Queries from a Web Application

Our Web application is an ASP.NET application, so we can take advantage of the ADOMD.NET classes to execute the DMX query. The DMX query that we designed earlier returns the information in a hierarchical format, and we would need to navigate through it to extract the product names. We can simplify the code by using the SELECT FLATTENED syntax, which returns the information as a simple list of products that we can add to the Web page.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020