Home > Articles > Home & Office Computing > Microsoft Applications

This chapter is from the book

Planning the Workbook Structure

Your workbook needs three types of worksheets: one type to collect the data from your web queries, one type to bring the query data together in a single place, and one type to run whatever analyses you decide are needed. (You also need a VBA module to hold your code, but that is covered in a later section of this chapter.)

Those three types of worksheet are discussed in the next three sections.

Query Sheets

If you haven’t used Excel to retrieve data from the Web, you might be surprised at how easy it is. I’m not speaking here of using your browser to get to a web page, selecting and copying some or part of its contents, and then pasting back into the workbook. I’m speaking of queries that can execute automatically and on a predetermined schedule, thus enabling you to walk away and let the computer gather data without you micromanaging it.

Suppose that you want to retrieve data from Amazon about a book entitled Statistical Analysis with Excel. You open a new workbook and rename an unused worksheet to something such as “Stats.”

Next, start your browser if necessary and navigate to Amazon’s page for that book. When you’re there, copy the page’s full address from the browser’s address box—drag across the address to highlight it and either press Ctrl + C or use the browser’s Edit menu to choose the Copy command.

Switch back to Excel and click the Ribbon’s Data tab. Click the From Web button in the Get External Data group. The New Web Query window displays as shown in Figure 1.1.

Figure 1.1

Figure 1.1. What you see at first depends on your brows-er’s home page.

Drag through the address that appears in the Address box and press Ctrl + V to paste the address you copied. When you click the Go button, the query window opens the Amazon page (see Figure 1.2).

Figure 1.2

Figure 1.2. Web pages normally consist of several tables, rectangular areas that divide the page into different segments.

Typically you see one or more square icons (a black arrow on a yellow background) that identify the locations of different parts of the web page, called tables. You can select them by clicking them. When you click one of these icons to select it, the icon turns green to indicate that you want to download the data in that table.

When you position your mouse pointer over a yellow icon, a heavy border appears around the table that’s associated with the icon. This helps you select one or more tables to download.

Nevertheless, I recommend that you select the entire page by clicking the icon in the upper-left corner of the window. If you select only a table or tables that contain the data you’re interested in, it could easily happen that a day, week, or month from now the page might be changed, so that the data you want appears in a different table. Then your query will miss the data.

But if you select the entire web page instead of just a table or tables, the page’s owner would have to remove the data you’re interested in completely for you to miss it, and in that case it wouldn’t matter how much of the page you selected to download.

The individual table icons are useful mainly when you want to do a one-time-only download from a web page. Then you might want to avoid downloading a lot of extraneous stuff that would just make it harder to find the data you’re after. In the type of case I’m describing here, though, you’ll let Excel do the finding for you.

Furthermore, you don’t save much time or bandwidth by selecting just a subset of the web page. In most cases you’re picking up a few thousand bytes of text at most in an entire page.

Speed of Execution

Nevertheless, you should be aware of a speed versus version tradeoff. I have learned that using Excel 2007 and 2010, web queries can take significantly more time to complete than in earlier versions of Excel. Among the changes made to Excel 2007 was the addition of much more thorough checks of the data returned from web pages for malicious content.

I’ve found that when using Excel 2002, it takes about 30 seconds to execute eight web queries in the way I’m describing here. Using Excel 2010, it takes nearly three times as long.

The basic intent of the project I’m describing here is to automatically and regularly update your downloaded data, so it probably seems unimportant to worry about whether the process takes half a minute or a minute and a half. Occasionally, though, for one reason or another I want to get an immediate update of the information and so I force the process to run manually. On those occasions I’d rather not wait.

Perhaps I shouldn’t, but I trust the results of my Amazon queries to be free of malicious content, so I run my app on the more quick-footed Excel 2002. It’s safer, though, to give yourself the added protections afforded by Excel 2007 or 2010, and if you can stand the extra minute or so of query execution time then by all means you should use the slower, safer way.

Bringing the Data Back

After you have clicked a yellow icon to turn it green (using, I hope, the one in the upper-left corner of the New Web Query window so that you get the entire page), click the Import button at the bottom of the New Web Query window. After a few seconds, the Import Data window appears as shown in Figure 1.3.

Figure 1.3

Figure 1.3. You can import immediately, but it’s a good idea to check the property settings first.

Accept the default destination of cell A1 and click OK. (The reason to use cell A1 becomes apparent when it’s time to refresh the query using VBA, later in this chapter.) There are some useful properties to set, so I recommend that you click the Properties button before you complete the import. The Properties window that displays is shown in Figure 1.4.

Figure 1.4

Figure 1.4. By default the Save Query Definition checkbox is filled, but you should verify the setting when you create the query.

Be sure that the Save Query Definition box is checked. That way you can repeatedly run the query without having to define it all over again.

The Enable Background Refresh checkbox requires a little explanation. If it is filled, any VBA procedure that is running continues running as the query executes, and other direct actions by the user can proceed normally. Sometimes that can cause problems if a procedure depends on the results of the query: If the procedure expects to find data that isn’t available yet, you might get a run-time error or a meaningless result. Therefore, I usually clear the Enable Background Refresh checkbox.

The various options in the Properties dialog box listed under Data Formatting and Layout are subtly different but can be important. I spend three pages detailing the differences in another Que book, Managing Data with Excel, and I don’t propose to do it again here. For present purposes, you might just as well accept the default values.

Click OK when you have made any changes you want and then click OK in the Import Data window. Excel completes the query, usually within a few seconds, and writes the results to the worksheet (see Figure 1.5).

Figure 1.5

Figure 1.5. When the query has finished executing, you wind up with a haystack of text and numbers. The next step is to find the needle.

Finding the Data

After the data has been retrieved from the web page, the next task is to locate the piece or pieces of information you’re looking for. I want to stress, though, that you need do this once only for each product you’re tracking—and quite possibly just once for all the products. It depends on how the web page administrator is managing the data.

What you need to look for is a string of text that’s normally a constant: one that doesn’t change from hour to hour or day to day. Figure 1.6 shows a portion of the results of a query.

Figure 1.6

Figure 1.6. In this case, the string Sellers Rank in cell A199 uniquely locates the product ranking.

If you use Excel’s Find feature to scan a worksheet for the string Sellers Rank, you can locate the worksheet cell that also contains the ranking for the product. With just a little more work, which you can easily automate and which I describe in the next section about VBA code, you can isolate the actual ranking from the surrounding text; it’s that ranking that you’re after.

Why not just note the cell address where the ranking is found after the query is finished? That would work fine if you could depend on the web page’s layout remaining static. But the website administrator has only to add an extra line, or remove one, above the data’s current location, and that will throw off the location of the cell with the data you’re after. No, you have to look for it each time, and the Find operation occurs very fast anyway.

Summary Sheets

After you’ve acquired the data from a web page and isolated the figure you’re looking for, you need a place to put that figure plus other relevant information such as date and time. That place is normally a separate worksheet. You normally expect to be querying the same web page repeatedly, as hours and days elapse. Therefore, you’ll want to store information that you’ve already retrieved somewhere that won’t get overwritten the next time the query runs.

So, establish an unused worksheet and name it something appropriate such as Summary or Synthesis or All Products. There are a few structural rules covered in the next section that you’ll find helpful to follow. But you can include some other useful analyses on the summary sheet, as long as they don’t interfere with the basic structure.

Structuring the Summary Sheet

Figure 1.7 shows the structures that I put on my summary sheet.

Figure 1.7

Figure 1.7. You can put snapshot analyses supported by worksheet functions on the summary sheet.

In Figure 1.7, the first few columns are reserved for the rankings that I have obtained via web queries from the appropriate Amazon pages. I also store the date and time the queries finished executing in column A. That time data provides my basis for longitudinal summaries: a baseline for the forecasting analyses that I discuss in Chapters 3, 4, 5, and 9.

It’s at this point that you have a decision to make. It’s nice to be able to retrieve data about sales rankings for products such as books. If you’ve written a good one, it’s gratifying to see the rankings drop as time goes by. (Remember, high rankings are better: A rank of 1 is a best seller.) But you likely never got paid a royalty or a commission, or had to fill a reorder, strictly on the basis of a sales ranking. It’s the sales themselves that you’re ultimately seeking: Granted that intermediate objectives such as clicks and conversions and rankings are important indicators, they don’t directly represent revenue.

Identifying Sales

So how do you translate sales rankings into a count of sales? I started by tracking sales rankings on Amazon for about a week and noticed some points of interest.

Telling a Sale from No Sale

A jump from a lower ranking to a higher ranking probably means the sale of at least one item. If the item has no sales during a given period, its ranking declines as other items do sell and move up.

Ranking Sales

How do you rank sales? You can’t do it strictly on the number sold. A book, for example, might have sold 200 copies over a two-year period. Another book might have sold 100 copies since it was published last week. The second book is clearly performing better than the first, so you have to combine elapsed time somehow with number sold. Amazon doesn’t say, but my guess would be that the rankings are based in part on the ratio of sales to days elapsed since publication—in other words, sales per day.

Improved Rankings Without Sales

There are periods when an item’s ranking improves very gradually over a period of hours. There’s no reason to believe that an improvement from a ranking of, say, 20,000 to 19,999 indicates a sale. More likely it is a result of another day passing and the rankings recalculating accordingly. That means that before you conclude a sale took place, you need a minimum criterion.

Deciding on a Criterion

The criterion should be a rate, not a constant number. If a book jumps from a ranking of 200,101 to 200,001, that 100-place increase is considerably different from a book that jumps from a ranking of 101 to 1. I decided to conclude that a sale had taken place if an increase in rankings equaled or exceeded ten percent of the prior ranking. So, if a book ranked 15,000 at 3:00 p.m. and 13,000 at 4:00 p.m.:

  • (15000 − 13000)/15000 = 0.13 or 13%

I conclude that a sale took place.

Structuring the Formula

Suppose that I have two rankings for a given product, one taken at 3:00 p.m. in cell C18 and one taken at 4:00 p.m. in cell C19. If I want to test whether a sale took place between 3:00 p.m. and 4:00 p.m., I can enter this formula in, say, L19:

  • =IF((C18-C19)/C18>0.1,1,0)

The formula returns a 1 if the difference between the two rankings is positive (for example, an increase from a ranking of 1,000 to 900 is positive) and exceeds 10% of the earlier ranking. The formula returns a zero otherwise. After the formula is established on the work-sheet, I use the same VBA code that re-executes the queries to copy the formula to the next available row.

Snapshot Formulas

I also like to watch two other statistics that don’t depend on an ordered baseline of data the way that sales estimates do. These are the total sales per book and the minimum (that is, the highest) sales ranking attained by each book since I started tracking the rankings.

I use Excel’s MIN() and SUM() functions to get those analyses. I put them at the top of the columns so that they won’t interfere with the results that come back from the web pages as the queries execute over time.

Figure 1.8 shows what those analyses look like.

Figure 1.8

Figure 1.8. You can put snapshot analyses supported by worksheet functions on the summary sheet.

So, for example, cell J2 might contain this formula:

  • =SUM(J5:J1000000)

It sums the values from the fifth to the millionth row in column J, which contains a 1 for every assumed sale, and a 0 otherwise. The result tells me the number of copies of this book that I assume have been sold by Amazon, judging by the changes in sales rankings.

To get the minimum, best sales ranking for the same book, I use this formula in cell T2:

  • =MIN(B4:B1000000)

The formula returns the smallest numeric value for the fourth to the millionth row in column B.

Notice that the range address in these formulas uses a constant, 1,000,000. There are more elegant ways of making sure that you capture all relevant cells (such as dynamic range names and tables), but this one is simple, pragmatic, and doesn’t slow down processing.

More Complicated Breakdowns

Figure 1.9 shows a table I use for a quick monthly tabulation of sales of certain books. I get similar tables for daily breakdowns, but they are pivot tables and can get a little cumbersome when you have as many as a couple of hundred days to summarize. The table in Figure 1.9 is driven by worksheet functions and is a quick monthly overview instead of a more detailed daily analysis.

Figure 1.9

Figure 1.9. This table counts sales per month and projects sales for a full month.

Column S in Figure 1.9 simply contains the numbers of the months that I’m interested in: May through December. When I get to May 2012, it will be necessary to add a column with the year, to distinguish May 2011 from May 2012.

Columns T and U contain array formulas. I describe the formulas in column T here; the formulas in column U work the same way but use different columns as the data sources.

The array formulas in column T check the month number in column S against the month implied by the date in column A. If the two month indicators are equal, the formula sums the values in columns J and K. I go into the topic of array formulas, what they require, and why they’re sometimes needed, in Chapter 5, “Forecasting a Time Series: Regression.” For now, though, it’s enough to be aware that you need to enter these formulas with a special keyboard sequence. Instead of simply typing the formula and pressing Enter, you type it and then press Ctrl + Shift + Enter. This procedure is called array-entering a formula. Excel notices that you have array-entered a formula, and in that case it surrounds the formula as shown in the formula box with curly braces.

It’s useful to take the formula apart to see what’s going on inside it. Here’s the full formula for the August figures:

  • =SUM(IF(MONTH($A$3:$A$1000000)=$S5238,J$3:K$1000000,0))

Working from the inside out, consider this fragment:

  • MONTH($A$3:$A$1000000)

The MONTH() function returns the month of a date, and column A consists of dates following the second row. When you apply the MONTH function to an array of dates, as is done here, you get an array of month numbers. (That’s one reason the formula must be entered as an array formula using Ctrl + Shift + Enter: The MONTH() function normally expects to evaluate a single date value, not a whole array of them. Array-entering the formula alerts Excel to the possibility that a function that normally takes a single value will be taking multiple values.)

That MONTH() fragment is surrounded by an IF() function. The IF() function tests whether the month numbers returned by MONTH() equal the number in cell S5238. In Figure 1.9, cell S5238 contains the number 8—the eighth month, or August. So the IF() function begins by converting all those month numbers from MONTH() to TRUE and FALSE values, depending on whether MONTH() returns an 8.

The next argument to the IF() function in this case is the range J3:K1000000. That range contains the 1’s and 0’s calculated from the changes in the sales rankings. Those 1’s and 0’s are retained by the IF() given that the logical test, month equals 8, is true. The IF() function retains any row where the month of the date in column A is 8, and in that case it holds on to the values found in columns J and K.

What if a date in column A does not fall in the month of August? Then the logical test posed by the IF() function fails, and the IF()’s third argument, 0, is used.

Here’s the formula as I’ve discussed it so far:

  • IF(MONTH($A$3:$A$1000000)=$S5238,J$3:K$1000000,0)

More briefly: If the dates in column A are in the month shown in S5238 (which is 8), use the associated values in columns J and K. Otherwise, use the number 0.

Finally, surround the fragment with the SUM() function:

  • =SUM(IF(MONTH($A$3:$A$1000000)=$S5238,J$3:K$1000000,0))

That is, take the sum of the values in columns J and K if they were observed during August, using zero instead if they were observed during some other month.

The same formula, adjusted to total two different columns, is in cell U5238:

  • =SUM(IF(MONTH($A$3:$A$1000000)=$S5238,L$3:M$1000000,0))

Here, we total the values found in columns L and M if the associated date is from August. Columns L and M contain sales estimates for a different book than columns J and K. (You probably have noticed that a book gets two columns because there is a paperback edition as well as a Kindle edition of each.)

Column V simply totals the results from columns T and U:

  • =T5238+U5238

And column W calculates the average number of books sold during the month:

  • =V5238/31

That calculation is a little tricky during the current month. Cell W5242 uses this formula while the current date is still within December:

  • =V5242/(TODAY()-DATEVALUE(“11/30/2011”))

I don’t want to divide by 31 before we reach the end of the month, so I need to calculate the number of days that have elapsed during the current month. That number is the difference between the value returned by the TODAY() function (which gives today’s date) minus the value of the last day in the prior month. So if today is December 20, the result of the subtraction is 20.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020