Home > Articles

This chapter is from the book

Planning the Workbook Structure

Your workbook needs three types of worksheets: one type to collect the data from your web queries, one type to bring the query data together in a single place, and one type to run whatever analyses you decide are needed. (You also need a VBA module to hold your code, but that is covered in a later section of this chapter.)

Those three types of worksheet are discussed in the next three sections.

Query Sheets

If you haven’t used Excel to retrieve data from the Web, you might be surprised at how easy it is. I’m not speaking here of using your browser to get to a web page, selecting and copying some or part of its contents, and then pasting back into the workbook. I’m speaking of queries that can execute automatically and on a predetermined schedule, thus enabling you to walk away and let the computer gather data without you micromanaging it.

Suppose that you want to retrieve data from Amazon about a book entitled Statistical Analysis with Excel. You open a new workbook and rename an unused worksheet to something such as “Stats.”

Next, start your browser if necessary and navigate to Amazon’s page for that book. When you’re there, copy the page’s full address from the browser’s address box—drag across the address to highlight it and either press Ctrl + C or use the browser’s Edit menu to choose the Copy command.

Switch back to Excel and click the Ribbon’s Data tab. Click the From Web button in the Get External Data group. The New Web Query window displays as shown in Figure 1.1.

01fig01.jpg

Figure 1.1 What you see at first depends on your browser’s home page.

Drag through the address that appears in the Address box and press Ctrl + V to paste the address you copied. When you click the Go button, the query window opens the Amazon page (see Figure 1.2).

01fig02.jpg

Figure 1.2 Web pages sometimes consist of several tables, rectangular areas that divide the page into different segments.

Typically you see one, and sometimes more than one, square icon (a red arrow on a yellow background) that identifies the location of part of the web page, called a table. You can select a table by clicking its icon. When you click one of these icons to select it, the icon turns blue to indicate that you want to download the data in that table.

I recommend that you select the entire page by clicking the icon in the upper-left corner of the window. If you select only a table or tables that contain the data you’re interested in, it could easily happen that a day, week, or month from now the page might be changed, so that the data you want appears on the same page but in a different table. Then your query will miss the data.

But if you select the entire web page instead of just a table or tables, the page’s owner would have to remove the data you’re interested in completely for you to miss it, and in that case it wouldn’t matter how much of the page you selected to download.

The individual table icons are useful mainly when you want to do a one-time-only download from a web page. Then you might want to avoid downloading a lot of extraneous stuff that would just make it harder to find the data you’re after. In the type of case I’m describing here, though, you’ll let Excel do the finding for you.

Furthermore, you don’t save much time or bandwidth by selecting just a subset of the web page. In most cases you’re picking up a couple of megabytes of text at most in an entire page.

Speed of Execution

Nevertheless, you should be aware of a speed versus version tradeoff. I have learned that using Excel 2007 and more recent versions, web queries can take significantly more time to complete than in earlier versions of Excel. Among the changes made to Excel 2007 was the addition of much more thorough checks of the data returned from web pages for malicious content.

I’ve found that when using an earlier version, it takes about 30 seconds to execute eight web queries in the way I’m describing here. Using Excel 2016, it takes about three times as long.

The basic intent of the project I’m describing here is to automatically and regularly update your downloaded data, so it probably seems unimportant to worry about whether the process takes half a minute or a minute and a half. Occasionally, though, for one reason or another I want to get an immediate update of the information and so I force the process to run manually. On those occasions I’d rather not wait.

Bringing the Data Back

After you have clicked a red-and-yellow icon to turn it blue (using, I hope, the icon in the upper-left corner of the New Web Query window so that you get the entire page), click the Import button at the bottom of the New Web Query window. After a few seconds, the Import Data window appears as shown in Figure 1.3.

01fig03.jpg

Figure 1.3 You can import immediately, but it’s a good idea to check the property settings first.

Accept the default destination of cell A1 and click OK. (The reason to use cell A1 becomes apparent when it’s time to refresh the query using VBA, later in this chapter.) There are some useful properties to set, so I recommend that you click the Properties button before you complete the import. The Properties window that displays is shown in Figure 1.4.

Be sure that the Save Query Definition box is checked. That way you can repeatedly and automatically run the query without having to define it all over again.

The Enable Background Refresh checkbox requires a little explanation. If it is filled, any VBA procedure that is running continues running as the query executes, and other direct actions by the user can proceed normally. Sometimes that can cause problems if a procedure depends on the results of the query: If the procedure expects to find data that isn’t available yet, you might get a runtime error or a meaningless result. Therefore, I usually clear the Enable Background Refresh checkbox.

01fig04.jpg

Figure 1.4 By default the Save Query Definition checkbox is filled, but you should verify the setting when you create the query.

The various options in the Properties dialog box listed under Data Formatting and Layout are subtly different but can be important. I spend three pages detailing the differences in another Que book, Managing Data with Excel, and I don’t propose to do it again here. For present purposes, you might just as well accept the default values.

Click OK when you have made any changes you want and then click OK in the Import Data window. Excel completes the query, usually within a few seconds, and writes the results to the worksheet (see Figure 1.5).

01fig05.jpg

Figure 1.5 When the query has finished executing, you wind up with a haystack of text and numbers. The next step is to find the needle.

Finding the Data

After the data has been retrieved from the web page, the next task is to locate the piece or pieces of information you’re looking for. I want to stress, though, that you need do this once only for each product you’re tracking—and quite possibly just once for all the products. It depends on how the web page administrator is managing the data.

What you need to look for is a string of text that’s normally a constant: one that doesn’t change from hour to hour or day to day. That text string should be near—ideally, adjacent to—the data you want to retrieve. Figure 1.6 shows a portion of the results of a query.

01fig06.jpg

Figure 1.6 In this case, the string Sellers Rank: in cell A296 uniquely locates the product ranking.

If you use Excel’s Find feature to scan a worksheet for the string Sellers Rank:, you can locate the worksheet cell that also contains the ranking for the product. With just a little more work, which you can easily automate and which I describe in the next section about VBA code, you can isolate the actual ranking from the surrounding text; it’s that ranking that you’re after.

Why not just note the cell address where the ranking is found after the query is finished? That would work fine if you could depend on the web page’s layout remaining static. But the website administrator has only to add an extra line, or remove one, above the data’s current location, and that will throw off the location of the cell with the data you’re after. No, you have to look for it each time, and the Find operation takes place very quickly anyway.

Summary Sheets

After you’ve acquired the data from a web page and isolated the figure you’re looking for, you need a place to put that figure, plus other relevant information such as date and time. That place is normally a separate worksheet. You expect to be querying the same web page repeatedly, as hours and days elapse. Therefore, you’ll want to store information that you’ve already retrieved somewhere that won’t get overwritten the next time the query runs.

So, establish an unused worksheet and name it something appropriate such as Summary or Synthesis or All Products. There are a few structural rules covered in the next section that you’ll find helpful to follow. But you can include some other useful analyses on the summary sheet, as long as they don’t interfere with the basic structure.

Structuring the Summary Sheet

Figure 1.7 shows some of the structures that I put on my summary sheet.

01fig07.jpg

Figure 1.7 You can put snapshot analyses supported by worksheet functions on the summary sheet.

In Figure 1.7, the first few columns are reserved for the rankings of several books that I have obtained via web queries from the appropriate Amazon pages. I also store the date and time the queries finished executing in column A. That time data provides my basis for longitudinal summaries: a baseline for the forecasting analyses that I discuss in Chapters 3 through 6 and 11.

It’s at this point that you have a decision to make. It’s nice to be able to retrieve data about sales rankings for products such as books. If you’ve written a good one, it’s gratifying to see the rankings drop as time goes by. (Remember, numerically lower rankings are better: A rank of 1 is a best seller.) But you likely never got paid a royalty or a commission, or had to fill a re-order, strictly on the basis of a sales ranking. It’s the sales themselves that you’re ultimately seeking: Granted that intermediate objectives such as clicks and conversions and rankings are key performance indicators, they don’t directly represent revenue.

From Rankings to Sales

So how do you translate sales rankings into a count of sales? I started by tracking sales rankings on Amazon for about a week and noticed some points of interest.

Telling a Sale from No Sale

A jump from a poorer ranking to a better ranking probably means the sale of at least one item. If the product has no sales during a given period, its ranking declines as other products do sell and move up. (It is of course possible that a book might experience sales even as its ranking declines, if its competition experiences greater sales.)

Ranking Sales

How do you rank sales? You can’t do it strictly on the number sold. A book, for example, might have sold 200 copies over a two-year period. Another book might have sold 100 copies since it was published last week. The second book is clearly performing better than the first, so you somehow have to combine elapsed time with number sold. Amazon doesn’t say, but my guess would be that the rankings are based in part on the ratio of sales to days elapsed since publication—in other words, sales per day.

Improved Rankings Without Sales

There are periods when an item’s ranking improves very gradually over a period of hours. There’s no reason to believe that an improvement from a ranking of, say, 20,000 to 19,999 indicates a sale. More likely it is a result of another day passing and the rankings recalculating accordingly. That means that before you conclude a sale took place, you need a credible minimum criterion.

Deciding on a Criterion

The criterion should be a rate, not a constant number. If a book jumps from a ranking of 200,101 to 200,001, that 100-place increase is considerably different from a book that jumps from a ranking of 101 to 1. I decided to conclude that a sale had taken place if an increase in rankings equaled or exceeded ten percent of the prior ranking. So, if a book ranked 15,000 at 3:00 p.m. and 13,000 at 4:00 p.m.:

(15000 − 13000)/15000 = 0.13 or 13%

I conclude that a sale took place.

Structuring the Formula

Suppose that I have two rankings for a given product, one taken at 3:00 p.m. in cell C18 and one taken at 4:00 p.m. in cell C19. If I want to test whether a sale took place between 3:00 p.m. and 4:00 p.m., I can enter this formula in, say, L19:

=IF((C18-C19)/C18>0.1,1,0)

The formula returns a 1 if the difference between the two rankings is positive (for example, an increase from a ranking of 1,000 to 900 is positive) and exceeds 10% of the earlier ranking. The formula returns a zero otherwise. After the formula is established on the worksheet, I use the same VBA code that re-executes the queries to copy the formula to the next available row.

Snapshot Formulas

I also like to watch one statistic that doesn’t depend on an ordered baseline of data the way that sales estimates do. That is the minimum (that is, the highest) sales ranking attained by each book since I started tracking the rankings.

I use Excel’s MIN( ) function to get that analysis for each book (in hard copy format). I put the functions at the top of the columns so that they won’t interfere with the results that come back from the web pages as the queries execute over time.

Figure 1.8 shows what those analyses look like. To get the minimum, best sales ranking for the same book, I use this formula in cell M2:

= MIN(OFFSET(B4,0,0,COUNTA($A:$A),1))

01fig08.jpg

Figure 1.8 You can put snapshot analyses supported by worksheet functions on the summary sheet.

The formula returns the smallest numeric value for the fourth to the final populated row in column B.

Customizing Your Formulas

As shown in the final section of this chapter, I keep a monthly total of sales by book title in a pivot table that immediately follows the final and most current row of downloaded data. But because I often find it useful to know the total sales on the current day, I make a special provision for that. I keep it next to the pivot table that records monthly sales. It totals the number of books sold on the current day, regardless of the book title.

The formula that returns the current day’s total uses two named ranges, ranges that relocate themselves as each hour passes. The ranges are named RecentDays and RecentSales. Both range names take advantage of Excel’s mixed addressing feature.

For the formula to work properly, it needs to locate the rows that contain the data that was downloaded on the current day. Suppose that I select cell N2500 and define the range name RecentDays as referring to this range on the worksheet named Summary:

Summary!$A2468:$A2497

The name I have defined exists 14 columns to the left of cell N2500, and it occupies the rows from 33 to 4 rows above cell N2500—that is, the range encompasses 30 rows.

But the range name involves a mixed reference: a combination of a fixed and a relative reference. In this case, the column is fixed (by the dollar signs) to column A. We could enter a formula that uses the defined name in, say, Column B, and then copy and paste the that formula into column C. The formula would still refer to cells in column A because the dollar signs fix it there.

But if we copied and pasted the formula into cell N2501—that is, one row below where it was originally entered—the relative portion of the range’s address would increment by one row. As used in cell N2501, the mixed reference in RecentDays would point to $A2469:$A2498. The reference is relative to the location of whatever cell uses it.

I use the same approach with the range name RecentSales, which also uses a mixed reference but captures more than just one column. Here’s what RecentSales refers to, again assuming that cell N2500 is active when I define the name:

=Summary!$J2469:$Q2498

On the Summary sheet of the Collector workbook, column A contains the date and time when the rankings in columns B through I were downloaded. Columns J through Q contain a 1 (if the change in rankings indicate that a particular book was sold) or a 0 (otherwise).

Suppose that the most recent download wrote the ranking in row 2241. Suppose further that we array enter this formula in cell N2244:

=SUM(IF(DAY(RecentDays)=DAY(NOW(  )),RecentSales,0))

When the range names RecentDays and RecentSales are used in a cell in row 2244, the relative portions of the mixed references cause the names to refer to values in rows 2212 through 2241. So the array formula does the following:

  • It checks to see if the value that identifies the current day (returned by the function DAY) equals the days in the range RecentDays (that’s the range A2212:A2241 as the name is used in row 2244). If the values are equal, then the record was downloaded today.

  • If a record in rows 2212 through 2241 was downloaded today, the array formula includes it in the sum of the values in the range named RecentSales. As used in the array formula, that is the range of 1s and 0s in J2212 through Q2241.

And that results in the number of inferred sales made today. Because the named ranges include 30 rows, the formula does not have to look all the way back to the beginning of the data in row 5. Because the named ranges include 30 rows, the formula is pretty sure to include all the hourly downloads that took place on the current day. The IF function ensures that the SUM function will ignore records downloaded on any prior day.

As you’ll see in this chapter’s section on VBA, when a new record is captured from a website and written to the worksheet, a new row is inserted at the bottom of the existing set of records. So doing pushes the cell with the array formula one row further down the worksheet, and the defined names now refer to ranges one row further down than before the insertion took place. So the array formula always points to a range of cells from 33 to 4 rows above the array formula itself.

You will doubtless have many questions that you’d like to ask of your data. The little summary formula that this section has described is idiosyncratic and of limited value beyond the specific data set that the Collector project downloads. I’ve spent this much time on it principally to demonstrate that Excel’s worksheet tools, formulas, functions and names are powerful enough to accommodate very specialized requirements.

It’s time to look closely at the VBA code needed to automate the project.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020