Home > Articles > Programming > Windows Programming

This chapter is from the book

Working with Mining Models

Building a Mining Model

Now that the database has been created and populated, the next step is to create a mining model. Mining models can be created with the Mining Model Editor in Analysis Manager or programmatically with Decision Support Objects (DSO). Using DSO to create mining models is useful when you need to programmatically automate the mining-model process. For the most part, you will use Analysis Manager to create mining models.

Analysis Manager (see Figure 5.3) allows you to create and manage databases used by Analysis Services. A database node in Analysis Manager does not represent the physical storage of a large database. Instead, it represents the database that Analysis Services will use to hold the mining-model definitions and the results of processing these mining models. It will, however, reference a physical data source.

Figure 5.3

Figure 5.3 Screenshot of Analysis Manager, the utility used to create and manage mining models with Analysis Services. The Data Link Properties dialog box, used to specify the data source, is visible.

Each database in Analysis Manager is associated with one or more data sources. These data sources can be either relational databases or data warehouses. Data sources are created by right-clicking the Data Sources node and selecting New Data Source. From the Data Link Properties dialog, a data provider is selected along with the connection information. Analysis Services supports SQL Server, Access, and Oracle databases as data sources.

Mining models are the blueprint for how the data should be analyzed or processed. Each model represents a case set, or a set of cases (see Table 5.1). The mining-model object stores the information necessary to process the model—for instance, what queries are needed to get the data fields, what data fields are input columns or predictable columns, and what relationship each column has with other columns. Input columns are attributes whose values will be used to generate results for the predictable columns. In some cases, the attribute may serve both as an input column and a predictable column.

Once the model is processed, the data associated with the mining model represents what was learned from the data. The actual data from the training dataset is not stored in the Analysis Services database. The results of analyzing that data, however, are saved.

To quickly demonstrate the process of creating these models, we will walk through the process of creating a mining model using Analysis Manager.

To begin, open Analysis Manager from the Analysis Services menu item. You will then need to create a new database and specify the data source by executing the following steps:

  1. Right-click the server name in the left-hand pane and select New Database...

  2. Specify ‘SavingsMart’ as the Database Name and click OK

  3. Expand the newly added SavingsMart node, right-click Data Sources, and select New Data Source...

  4. From the Data Link Properties dialog box, select Microsoft OLE DB Provider for SQL Server and click Next

  5. Enter the SQL connection information for your SQL Server and test the connection before closing the Data Link Properties dialog

The next thing to do is create the mining model using the mining-model wizard. To do so, execute the following steps:

  1. Right-click Mining Models in the left-hand pane and select New Mining Model...

  2. Click Next on the Welcome Dialog

  3. Click Next on the Select Source Dialog because we will be using Relational Data

  4. Select the vw_Shipments view from the Available Tables list box in the Select Case Tables dialog and click Next. It would have been possible to select multiple tables, but utilizing the view allows access to a calculated field indicating the number of days between shipments.

  5. Click Next to accept the default of Microsoft Decision Trees as the data-mining technique.

  6. Click Next to accept ShipmentID as the default Case Key Column.

  7. Select the Finish the mining model in the editor checkbox and click Next.

  8. Name the model "Analyze Shipments" and click Finish.

  9. From the Relational Mining Model Editor, as seen in Figure 5.4, click Insert and Column... and then select the column named DaysSinceLastShipped. Once added, change the usage to Input and Predictable (note that a diamond icon now appears next to the column). Then go to the Advanced Properties and enter DISCRETIZED(CLUSTERS) as the content type.

  10. Figure 5.4

    Figure 5.4 Screenshot of the Schema tab in the Relational Mining Model Editor after the columns have been added for the Analyze Shipments mining model.

  11. Click Insert and Column... and then select the column named StoreID. Once added, change the usage to Input and Predictable.

  12. Click Insert and Column... and then select the column named Quantity. Once added, change the usage to Predictable, and from the Advanced Properties tab, enter DISCRETIZED(CLUSTERS) as the content type.

  13. Click Insert and Column... and then select the column named VendorName.

  14. Click Insert and Column... and then select the column named ProductType.

  15. Click Tools and Process Mining Model... Click OK when asked to save the mining model. Then click OK to start a full process of the mining model. This process will take several minutes to run if you loaded data for all five stores. When complete, the message "Processing Complete" will appear in green text.

Training the Mining Model

Training the mining model is accomplished by processing the results of a mining model using Analysis Manager. Alternatively, the same thing could be accomplished using a scripting language known as Data Definition Language (DDL) and a connection to the Analysis Server. We can see what DDL commands are used to train the model through the Process dialog box, as shown in Figure 5.5.

Figure 5.5

Figure 5.5 Screenshot of the dialog that appears when full process is initiated for a mining model. Note the DDL syntax used to create the model and then train it by populating it with historical data.

DDL is useful in cases when you want to programmatically process a mining model. The language can be executed through a connection to the Analysis Server. It is also useful for demonstrating how Analysis Manager processes a mining model.

A mining model is created using the CREATE MINING MODEL syntax. The syntax is similar to Transact SQL and should be instantly familiar to SQL developers. The CREATE statement for this mining model is as follows:

CREATE MINING MODEL [Analyze Shipments]( 
    [Shipment Id] LONG KEY, 
    [Days Since Last Shipped] LONG DISCRETIZED(CLUSTERS) PREDICT, 
    [Store Id] LONG DISCRETE, 
    [Quantity] LONG DISCRETIZED(CLUSTERS) PREDICT_ONLY, 
    [Vendor Name] TEXT DISCRETE, 
    [Product Type] TEXT DISCRETE) 
USING Microsoft_Decision_Trees

With the preceding statement, we are creating a new mining model named Analyze Shipments. The model utilizes Shipment ID as the case key. Days Since Last Shipped, and Quantity are each defined as predictable columns, but Days Since Last Shipped also functions as an input column. The remaining columns, Store ID, Vendor Name, and Product Type, are input columns only. Mining-model columns are defined as either input, predictable, or input and predictable.

The process of training a model involves the insertion of data into the mining model using the INSERT INTO syntax, as follows:

INSERT INTO [Analyze Shipments]
   (SKIP,[Days Since Last Shipped], [Store Id], [Quantity],
   [Vendor Name], [Product Type]) 
OPENROWSET('SQLOLEDB.1','Provider=SQLOLEDB.1;Integrated
   Security=SSPI;Persist Security Info=False;Initial
   Catalog=SavingsMart;Data Source=(local)', 
   'SELECT DISTINCT "dbo"."vw_Shipments"."ShipmentID" 
   AS "Shipment Id", "dbo"."vw_Shipments"."DaysSinceLastShipped"
   AS "Days Since Last Shipped", "dbo"."vw_Shipments"."StoreID"
   AS "Store Id", "dbo"."vw_Shipments"."Quantity" AS "Quantity",
   "dbo"."vw_Shipments"."VendorName" AS "Vendor Name",
   "dbo"."vw_Shipments"."ProductType" AS "Product Type" 
   FROM "dbo"."vw_Shipments"')

The mining model will not store the actual data, but will store the prediction results instead once the mining algorithm is processed. In the preceding statement, the OPENROWSET keyword was used to specify the location of the physical data source.

Interpreting the Results

To examine the results from processing the model, select the Content tab. Figure 5.6 is a screenshot of the content detail when analyzing DaysSinceLastShipped. This screen indicates that VendorName was the most significant factor affecting DaysSinceLastShipped. We know this because it is the first split on the tree. For nodes that have additional branches, two lines will follow the node. To view the additional branches, double-click that node and the detail page will drill down to the next level.

Figure 5.6

Figure 5.6 Screenshot of the content detail after processing the Analyze Shipments mining model. These results indicate the highest prediction level for the DaysSinceLastShipped column.

The Content Navigator box—seen in the top-right corner—offers an easy way to see all the mining-model results and drill down into a certain path. The Attributes box shows the totals associated with each node, grouped according to a clustering algorithm. In Figure 5.6, the cursor is selecting the outermost node labeled All. In this example, the attributes are shown for all the cases analyzed.

If you attached your database using the file provided, your processing results should be the same as the ones we are about to interpret. The first thing to notice is that the darkest-shaded node is the one where the Vendor Name is Store Brand. Nodes that resulted in a higher data density, or more cases analyzed, will be shaded in a darker color. This result is not surprising, because 127 of the 500 products available, or 25 percent, are represented by the Store Brand. This can be confirmed in Query Analyzer with the following query:

SELECT v.VendorName,(COUNT(ProductID)/500.0) AS 'Percent'
FROM Products p
LEFT JOIN Vendors v ON p.VendorID = v.VendorID
GROUP BY v.VendorName
ORDER BY 'Percent' DESC

If the Store Brand node is double-clicked, the detail pane will show the next branching of the tree (see Figure 5.7). For the Store Brand Node, the first branching distinguishes between the different stores. If we click on the node Store ID = 2 and look at the attributes, the value with the highest probability is 119.33. This indicates that for all products where the Vendor name is Store Brand and the Store ID is 2, it is highly probable that there should be 119 days between shipments.

Figure 5.7

Figure 5.7 Screenshot of the Content Detail Editor as it displays the predictions for days since last shipped. In this example, the node path is where Store ID is 2 and the vendor name is Store Brand.

If we examine the attributes for the remaining nodes, we will see that predictions can be made for all the stores. For Store 1, there is one additional branching that distinguishes between a product type of Snack Foods versus all other product types. When the Store ID is 1, vendor name is Store Brand, and product type is snack foods, there is a 58 percent probability that there will be 60 days between shipments. When we examine the attributes where product type is not snack foods, there is a 43 percent probability that there will be 119 days between shipments and a 53 percent probability that there will be 85 days between shipments. In this case, we could say that the 53 percent probability wins the toss, but that might not always be the best decision. This will be discussed further in the next chapter.

If you use the drop-down box above the Content Detail pane named Prediction Tree to select the quantity column, you will see that the main factor affecting quantity is the days since last shipped (see Figure 5.8). This is possible because the column DaysSinceLastShipped was defined as an input and a predictable column.

Figure 5.8

Figure 5.8 Screenshot of the Content Detail Editor as it displays the predictions for Quantity. In this example, the node path examined is where Days Since Last Shipped is less than or equal to 48 and Store ID is not equal to 1 and vendor name is not equal to Kamp.

The next factor affecting quantity is the vendor name. In the case where the vendor name is NOT Kamp, Store ID is an additional factor. In Figure 5.8 we can see that when the days since last shipped is less than or equal to 48 and the Store ID is NOT 1 and the vendor name is NOT Kamp, there is a 98 percent probability that the quantity should be 200. When the Store ID is equal to 1, the prediction drops to a 72 percent probability that the quantity will be 200.

The next chapter will involve interpreting the results from the mining model and then applying the predictions to a new shipment strategy. The goal of the new shipment strategy will be to reduce Savings Mart’s operational costs by reducing the total number of shipments.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020