Planning the Workbook Structure
Your workbook needs three types of worksheets: one type to collect the data from your web queries, one type to bring the query data together in a single place, and one type to run whatever analyses you decide are needed. (You also need a VBA module to hold your code, but that is covered in a later section of this chapter.)
Those three types of worksheet are discussed in the next three sections.
If you haven’t used Excel to retrieve data from the Web, you might be surprised at how easy it is. I’m not speaking here of using your browser to get to a web page, selecting and copying some or part of its contents, and then pasting back into the workbook. I’m speaking of queries that can execute automatically and on a predetermined schedule, thus enabling you to walk away and let the computer gather data without you micromanaging it.
Suppose that you want to retrieve data from Amazon about a book entitled Statistical Analysis with Excel. You open a new workbook and rename an unused worksheet to something such as “Stats.”
Next, start your browser if necessary and navigate to Amazon’s page for that book. When you’re there, copy the page’s full address from the browser’s address box—drag across the address to highlight it and either press Ctrl + C or use the browser’s Edit menu to choose the Copy command.
Switch back to Excel and click the Ribbon’s Data tab. Click the From Web button in the Get External Data group. The New Web Query window displays as shown in Figure 1.1.
Figure 1.1. What you see at first depends on your brows-er’s home page.
Drag through the address that appears in the Address box and press Ctrl + V to paste the address you copied. When you click the Go button, the query window opens the Amazon page (see Figure 1.2).
Figure 1.2. Web pages normally consist of several tables, rectangular areas that divide the page into different segments.
Typically you see one or more square icons (a black arrow on a yellow background) that identify the locations of different parts of the web page, called tables. You can select them by clicking them. When you click one of these icons to select it, the icon turns green to indicate that you want to download the data in that table.
When you position your mouse pointer over a yellow icon, a heavy border appears around the table that’s associated with the icon. This helps you select one or more tables to download.
Nevertheless, I recommend that you select the entire page by clicking the icon in the upper-left corner of the window. If you select only a table or tables that contain the data you’re interested in, it could easily happen that a day, week, or month from now the page might be changed, so that the data you want appears in a different table. Then your query will miss the data.
But if you select the entire web page instead of just a table or tables, the page’s owner would have to remove the data you’re interested in completely for you to miss it, and in that case it wouldn’t matter how much of the page you selected to download.
The individual table icons are useful mainly when you want to do a one-time-only download from a web page. Then you might want to avoid downloading a lot of extraneous stuff that would just make it harder to find the data you’re after. In the type of case I’m describing here, though, you’ll let Excel do the finding for you.
Furthermore, you don’t save much time or bandwidth by selecting just a subset of the web page. In most cases you’re picking up a few thousand bytes of text at most in an entire page.
Speed of Execution
Nevertheless, you should be aware of a speed versus version tradeoff. I have learned that using Excel 2007 and 2010, web queries can take significantly more time to complete than in earlier versions of Excel. Among the changes made to Excel 2007 was the addition of much more thorough checks of the data returned from web pages for malicious content.
I’ve found that when using Excel 2002, it takes about 30 seconds to execute eight web queries in the way I’m describing here. Using Excel 2010, it takes nearly three times as long.
The basic intent of the project I’m describing here is to automatically and regularly update your downloaded data, so it probably seems unimportant to worry about whether the process takes half a minute or a minute and a half. Occasionally, though, for one reason or another I want to get an immediate update of the information and so I force the process to run manually. On those occasions I’d rather not wait.
Perhaps I shouldn’t, but I trust the results of my Amazon queries to be free of malicious content, so I run my app on the more quick-footed Excel 2002. It’s safer, though, to give yourself the added protections afforded by Excel 2007 or 2010, and if you can stand the extra minute or so of query execution time then by all means you should use the slower, safer way.
Bringing the Data Back
After you have clicked a yellow icon to turn it green (using, I hope, the one in the upper-left corner of the New Web Query window so that you get the entire page), click the Import button at the bottom of the New Web Query window. After a few seconds, the Import Data window appears as shown in Figure 1.3.
Figure 1.3. You can import immediately, but it’s a good idea to check the property settings first.
Accept the default destination of cell A1 and click OK. (The reason to use cell A1 becomes apparent when it’s time to refresh the query using VBA, later in this chapter.) There are some useful properties to set, so I recommend that you click the Properties button before you complete the import. The Properties window that displays is shown in Figure 1.4.
Figure 1.4. By default the Save Query Definition checkbox is filled, but you should verify the setting when you create the query.
Be sure that the Save Query Definition box is checked. That way you can repeatedly run the query without having to define it all over again.
The Enable Background Refresh checkbox requires a little explanation. If it is filled, any VBA procedure that is running continues running as the query executes, and other direct actions by the user can proceed normally. Sometimes that can cause problems if a procedure depends on the results of the query: If the procedure expects to find data that isn’t available yet, you might get a run-time error or a meaningless result. Therefore, I usually clear the Enable Background Refresh checkbox.
The various options in the Properties dialog box listed under Data Formatting and Layout are subtly different but can be important. I spend three pages detailing the differences in another Que book, Managing Data with Excel, and I don’t propose to do it again here. For present purposes, you might just as well accept the default values.
Click OK when you have made any changes you want and then click OK in the Import Data window. Excel completes the query, usually within a few seconds, and writes the results to the worksheet (see Figure 1.5).
Figure 1.5. When the query has finished executing, you wind up with a haystack of text and numbers. The next step is to find the needle.
Finding the Data
After the data has been retrieved from the web page, the next task is to locate the piece or pieces of information you’re looking for. I want to stress, though, that you need do this once only for each product you’re tracking—and quite possibly just once for all the products. It depends on how the web page administrator is managing the data.
What you need to look for is a string of text that’s normally a constant: one that doesn’t change from hour to hour or day to day. Figure 1.6 shows a portion of the results of a query.
Figure 1.6. In this case, the string Sellers Rank in cell A199 uniquely locates the product ranking.
If you use Excel’s Find feature to scan a worksheet for the string Sellers Rank, you can locate the worksheet cell that also contains the ranking for the product. With just a little more work, which you can easily automate and which I describe in the next section about VBA code, you can isolate the actual ranking from the surrounding text; it’s that ranking that you’re after.
Why not just note the cell address where the ranking is found after the query is finished? That would work fine if you could depend on the web page’s layout remaining static. But the website administrator has only to add an extra line, or remove one, above the data’s current location, and that will throw off the location of the cell with the data you’re after. No, you have to look for it each time, and the Find operation occurs very fast anyway.
After you’ve acquired the data from a web page and isolated the figure you’re looking for, you need a place to put that figure plus other relevant information such as date and time. That place is normally a separate worksheet. You normally expect to be querying the same web page repeatedly, as hours and days elapse. Therefore, you’ll want to store information that you’ve already retrieved somewhere that won’t get overwritten the next time the query runs.
So, establish an unused worksheet and name it something appropriate such as Summary or Synthesis or All Products. There are a few structural rules covered in the next section that you’ll find helpful to follow. But you can include some other useful analyses on the summary sheet, as long as they don’t interfere with the basic structure.
Structuring the Summary Sheet
Figure 1.7 shows the structures that I put on my summary sheet.
Figure 1.7. You can put snapshot analyses supported by worksheet functions on the summary sheet.
In Figure 1.7, the first few columns are reserved for the rankings that I have obtained via web queries from the appropriate Amazon pages. I also store the date and time the queries finished executing in column A. That time data provides my basis for longitudinal summaries: a baseline for the forecasting analyses that I discuss in Chapters 3, 4, 5, and 9.
It’s at this point that you have a decision to make. It’s nice to be able to retrieve data about sales rankings for products such as books. If you’ve written a good one, it’s gratifying to see the rankings drop as time goes by. (Remember, high rankings are better: A rank of 1 is a best seller.) But you likely never got paid a royalty or a commission, or had to fill a reorder, strictly on the basis of a sales ranking. It’s the sales themselves that you’re ultimately seeking: Granted that intermediate objectives such as clicks and conversions and rankings are important indicators, they don’t directly represent revenue.
So how do you translate sales rankings into a count of sales? I started by tracking sales rankings on Amazon for about a week and noticed some points of interest.
Telling a Sale from No Sale
A jump from a lower ranking to a higher ranking probably means the sale of at least one item. If the item has no sales during a given period, its ranking declines as other items do sell and move up.
How do you rank sales? You can’t do it strictly on the number sold. A book, for example, might have sold 200 copies over a two-year period. Another book might have sold 100 copies since it was published last week. The second book is clearly performing better than the first, so you have to combine elapsed time somehow with number sold. Amazon doesn’t say, but my guess would be that the rankings are based in part on the ratio of sales to days elapsed since publication—in other words, sales per day.
Improved Rankings Without Sales
There are periods when an item’s ranking improves very gradually over a period of hours. There’s no reason to believe that an improvement from a ranking of, say, 20,000 to 19,999 indicates a sale. More likely it is a result of another day passing and the rankings recalculating accordingly. That means that before you conclude a sale took place, you need a minimum criterion.
Deciding on a Criterion
The criterion should be a rate, not a constant number. If a book jumps from a ranking of 200,101 to 200,001, that 100-place increase is considerably different from a book that jumps from a ranking of 101 to 1. I decided to conclude that a sale had taken place if an increase in rankings equaled or exceeded ten percent of the prior ranking. So, if a book ranked 15,000 at 3:00 p.m. and 13,000 at 4:00 p.m.:
- (15000 − 13000)/15000 = 0.13 or 13%
I conclude that a sale took place.
Structuring the Formula
Suppose that I have two rankings for a given product, one taken at 3:00 p.m. in cell C18 and one taken at 4:00 p.m. in cell C19. If I want to test whether a sale took place between 3:00 p.m. and 4:00 p.m., I can enter this formula in, say, L19:
The formula returns a 1 if the difference between the two rankings is positive (for example, an increase from a ranking of 1,000 to 900 is positive) and exceeds 10% of the earlier ranking. The formula returns a zero otherwise. After the formula is established on the work-sheet, I use the same VBA code that re-executes the queries to copy the formula to the next available row.
I also like to watch two other statistics that don’t depend on an ordered baseline of data the way that sales estimates do. These are the total sales per book and the minimum (that is, the highest) sales ranking attained by each book since I started tracking the rankings.
I use Excel’s MIN() and SUM() functions to get those analyses. I put them at the top of the columns so that they won’t interfere with the results that come back from the web pages as the queries execute over time.
Figure 1.8 shows what those analyses look like.
Figure 1.8. You can put snapshot analyses supported by worksheet functions on the summary sheet.
So, for example, cell J2 might contain this formula:
It sums the values from the fifth to the millionth row in column J, which contains a 1 for every assumed sale, and a 0 otherwise. The result tells me the number of copies of this book that I assume have been sold by Amazon, judging by the changes in sales rankings.
To get the minimum, best sales ranking for the same book, I use this formula in cell T2:
The formula returns the smallest numeric value for the fourth to the millionth row in column B.
Notice that the range address in these formulas uses a constant, 1,000,000. There are more elegant ways of making sure that you capture all relevant cells (such as dynamic range names and tables), but this one is simple, pragmatic, and doesn’t slow down processing.
More Complicated Breakdowns
Figure 1.9 shows a table I use for a quick monthly tabulation of sales of certain books. I get similar tables for daily breakdowns, but they are pivot tables and can get a little cumbersome when you have as many as a couple of hundred days to summarize. The table in Figure 1.9 is driven by worksheet functions and is a quick monthly overview instead of a more detailed daily analysis.
Figure 1.9. This table counts sales per month and projects sales for a full month.
Column S in Figure 1.9 simply contains the numbers of the months that I’m interested in: May through December. When I get to May 2012, it will be necessary to add a column with the year, to distinguish May 2011 from May 2012.
Columns T and U contain array formulas. I describe the formulas in column T here; the formulas in column U work the same way but use different columns as the data sources.
The array formulas in column T check the month number in column S against the month implied by the date in column A. If the two month indicators are equal, the formula sums the values in columns J and K. I go into the topic of array formulas, what they require, and why they’re sometimes needed, in Chapter 5, “Forecasting a Time Series: Regression.” For now, though, it’s enough to be aware that you need to enter these formulas with a special keyboard sequence. Instead of simply typing the formula and pressing Enter, you type it and then press Ctrl + Shift + Enter. This procedure is called array-entering a formula. Excel notices that you have array-entered a formula, and in that case it surrounds the formula as shown in the formula box with curly braces.
It’s useful to take the formula apart to see what’s going on inside it. Here’s the full formula for the August figures:
Working from the inside out, consider this fragment:
The MONTH() function returns the month of a date, and column A consists of dates following the second row. When you apply the MONTH function to an array of dates, as is done here, you get an array of month numbers. (That’s one reason the formula must be entered as an array formula using Ctrl + Shift + Enter: The MONTH() function normally expects to evaluate a single date value, not a whole array of them. Array-entering the formula alerts Excel to the possibility that a function that normally takes a single value will be taking multiple values.)
That MONTH() fragment is surrounded by an IF() function. The IF() function tests whether the month numbers returned by MONTH() equal the number in cell S5238. In Figure 1.9, cell S5238 contains the number 8—the eighth month, or August. So the IF() function begins by converting all those month numbers from MONTH() to TRUE and FALSE values, depending on whether MONTH() returns an 8.
The next argument to the IF() function in this case is the range J3:K1000000. That range contains the 1’s and 0’s calculated from the changes in the sales rankings. Those 1’s and 0’s are retained by the IF() given that the logical test, month equals 8, is true. The IF() function retains any row where the month of the date in column A is 8, and in that case it holds on to the values found in columns J and K.
What if a date in column A does not fall in the month of August? Then the logical test posed by the IF() function fails, and the IF()’s third argument, 0, is used.
Here’s the formula as I’ve discussed it so far:
More briefly: If the dates in column A are in the month shown in S5238 (which is 8), use the associated values in columns J and K. Otherwise, use the number 0.
Finally, surround the fragment with the SUM() function:
That is, take the sum of the values in columns J and K if they were observed during August, using zero instead if they were observed during some other month.
The same formula, adjusted to total two different columns, is in cell U5238:
Here, we total the values found in columns L and M if the associated date is from August. Columns L and M contain sales estimates for a different book than columns J and K. (You probably have noticed that a book gets two columns because there is a paperback edition as well as a Kindle edition of each.)
Column V simply totals the results from columns T and U:
And column W calculates the average number of books sold during the month:
That calculation is a little tricky during the current month. Cell W5242 uses this formula while the current date is still within December:
I don’t want to divide by 31 before we reach the end of the month, so I need to calculate the number of days that have elapsed during the current month. That number is the difference between the value returned by the TODAY() function (which gives today’s date) minus the value of the last day in the prior month. So if today is December 20, the result of the subtraction is 20.