Home > Articles

  • Print
  • + Share This
This chapter is from the book

3.7 Simplistic Evaluation of Classifiers

We have everything lined up for the fireworks! We have data, we have methods, and we have an evaluation scheme. As the Italians say, “Andiamo!” Let’s go!

3.7.1 Learning Performance

Shortly, we’ll see a simple Python program to compare our two learners: k-NN and NB. Instead of using the names imported by our setup statement from mlwpy import * at the start of the chapter, it has its imports written out. This code is what you would write in a stand-alone script or in a notebook that doesn’t import our convenience setup. You’ll notice that we rewrote the train_test_split call and we also made the test set size significantly bigger. Why? Training on less data makes it a harder problem. You’ll also notice that I sent an extra argument to train_test_split: random_state=42 hacks the randomness of the train-test split and gives us a repeatable result. Without it, every run of the cell would result in different evaluations. Normally we want that, but here I want to be able to talk about the results knowing what they are.

In [11]:

# stand-alone code
from sklearn import (datasets, metrics,
                     model_selection as skms,
                     naive_bayes, neighbors)

# we set random_state so the results are reproducible
# otherwise, we get different training and testing sets
# more details in Chapter 5
iris = datasets.load_iris()
(iris_train_ftrs, iris_test_ftrs,
 iris_train_tgt, iris_test_tgt) = skms.train_test_split(iris.data,

models = {'kNN': neighbors.KNeighborsClassifier(n_neighbors=3),
          'NB' : naive_bayes.GaussianNB()}

for name, model in models.items():
    fit = model.fit(iris_train_ftrs, iris_train_tgt)
    predictions = fit.predict(iris_test_ftrs)

    score = metrics.accuracy_score(iris_test_tgt, predictions)
    print("{:>3s}: {:0.2f}".format(name,score))
kNN: 0.96
 NB: 0.81

With a test set size of 90% of the data, k-NN does fairly well and NB does a bit meh on this train-test split. If you rerun this code many times without random_state set and you use a more moderate amount of testing data, we get upwards of 97+% accuracy on both methods for many repeated runs. So, from a learning performance perspective, iris is a fairly easy problem. It is reasonably easy to distinguish the different types of flowers, based on the measurements we have, using very simple classifiers.

3.7.2 Resource Utilization in Classification

Everything we do on a computer comes with a cost in terms of processing time and memory. Often, computer scientists will talk about memory as storage space or, simply, space. Thus, we talk about the time and space usage of a program or an algorithm. It may seem a bit old-fashioned to worry about resource usage on a computer; today’s computer are orders of magnitude faster and larger in processing and storage capabilities than their ancestors of even a few years ago—let alone the behemoth machines of the 1960s and 1970s. So why are we going down a potentially diverting rabbit hole? There are two major reasons: extrapolation and the limits of theoretical analysis. Extrapolation

Today, much of data science and machine learning is driven by big data. The very nature of big data is that it pushes the limits of our computational resources. Big data is a relative term: what’s big for you might not be too big for someone with the skills and budget to compute on a large cluster of machines with GPUs (graphics processing units). One possible breaking point after which I don’t have small data is when the problem is so large that I can’t solve it on my laptop in a “reasonable” amount of time.

If I’m doing my prototyping and development on my laptop—so I can sip a mojito under a palm tree in the Caribbean while I’m working—how can I know what sort of resources I will need when I scale up to the full-sized problem? Well, I can take measurements of smaller problems of increasing sizes and make some educated guesses about what will happen with the full dataset. To do that, I need to quantify what’s happening with the smaller data in time and space. In fairness, it is only an estimate, and adding computational horsepower doesn’t always get a one-to-one payback. Doubling my available memory won’t always double the size of the dataset I can process. Limits of Theory

Some of you might be aware of a subfield of computer science called algorithm analysis whose job is to develop equations that relate the time and memory use of a computing task to the size of that task’s input. For example, we might say that the new learning method Foo will take 2n + 27 steps on n input examples. (That’s a drastic simplification: we almost certainly care about how many features there are in these examples.)

So, if there is a theoretical way to know the resources needed by an algorithm, why do we care about measuring them? I’m glad you asked. Algorithm analysis typically abstracts away certain mathematical details, like constant factors and terms, that can be practically relevant to real-world run times. Algorithm analysis also (1) makes certain strong or mathematically convenient assumptions, particularly regarding the average case analysis, (2) can ignore implementation details like system architecture, and (3) often uses algorithmic idealizations, devoid of real-world practicalities and necessities, to reach its conclusions.

In short, the only way to know how a real-world computational system is going to consume resources, short of some specialized cases that don’t apply here, is to run it and measure it. Now, it is just as possible to screw this up: you could run and measure under idealized or nonrealistic conditions. We don’t want to throw out algorithmic analysis altogether. My critiques are not failures of algorithm analysis; it’s simply open-eyed understanding its limits. Algorithm analysis will always tell us some fundamental truths about how different algorithms compare and how they behave on bigger-and-bigger inputs.

I’d like to show off a few methods of comparing the resource utilization of our two classifiers. A few caveats: quantifying program behavior can be very difficult. Everything occurring on your system can potentially have a significant impact on your learning system’s resource utilization. Every difference in your input can affect your system’s behavior: more examples, more features, different types of features (numerical versus symbolic), and different hyperparameters can all make the same learning algorithm behave differently and consume different resources. Units of Measure

We need to make one small digression. We’re going to be measuring the resources used by computer programs. Time is measured in seconds, and space is measured in bytes. One byte is eight bits: it can hold the answers to eight yes/no questions. Eight bits can distinguish between 256 different values—so far, so good. However, we’ll be dealing with values that are significantly larger or smaller than our normal experience. I want you to be able to connect with these values.

We need to deal with SI prefixes. SI is short for the International Standard of scientific abbreviations—but, coming from a Romance language, the adjective is after the noun, so the IS is swapped. The prefixes that are important for us are in Table 3.2. Remember that the exponent is the x in 10x; it’s also the number of “padded zeros” on the right. That is, kilo means 103 = 1000 and 1000 has three zeros on the right. The examples are distances that would be reasonable to measure, using that prefix, applied to meters.

Table 3.2 SI prefixes and length scale examples.




Example Distance




orbit of Neptune around the Sun




orbit of the Moon around the Earth




diameter of the Moon




a nice walk




1 meter 1 step













There is another complicating factor. Computers typically work with base-2 amounts of storage, not base-10. So, instead of 10x we deal with 2x. Strictly speaking—and scientists are nothing if not strict—we need to account for this difference. For memory, we have some additional prefixes (Table 3.3) that you’ll see in use soon.

Table 3.3 SI base-two prefixes and memory scale examples.


Verbal Prefix

Number of Bytes





a list of about 1000 numbers




a short song as an MP3




a feature-length movie




a family archive of photos and movies

So, 2 MiB is two mebi-bytes equal to 220 bytes. You’ll notice that the base-2 prefixes are also pronounced differently. Ugh. You might wonder why these step up by 10s, not by 3s as in the base-10 values. Since 210 = 1024 1000 = 103, multiplying by ten 2s is fairly close to multiplying by three 10s. Unfortunately, these binary prefixes, defined by large standards bodies, haven’t necessarily trickled down to daily conversational use. The good news is that within one measuring system, you’ll probably only see MiB or MB, not both. When you see MiB, just know that it isn’t quite MB. Time

In a Jupyter notebook, we have some nice tools to measure execution times. These are great for measuring the time use of small snippets of code. If we have two different ways of coding a solution to a problem and want to compare their speed, or just want to measure how long a snippet of code takes, we can use Python’s timeit module. The Jupyter cell magic %timeit gives us a convenient interface to time a line of code:

In [12]:

%timeit -r1 datasets.load_iris()
1000 loops, best of 1: 1.4 ms per loop

The -r1 tells timeit to measure the timing of the snippet once. If we give a higher r, for repeats, the code will be run multiple times and we will get statistics. Recent versions of Jupyter default to calculating the mean and standard deviation of the results. Fortunately, for a single result we just get that single value. If you are concerned about the 1000 loops, check out my note on it at the end of the chapter.

%%timeit—the two-percents make it a cell magic—applies the same strategy to the entire block of code in a cell:

In [13]:

%%timeit -r1 -n1
(iris_train_ftrs, iris_test_ftrs,
 iris_train_tgt,  iris_test_tgt) = skms.train_test_split(iris.data,
1 loop, best of 1: 638 μs per loop

And now let’s point our chronometer (timeit) at our learning workflow:

In [14]:

%%timeit -r1

nb    = naive_bayes.GaussianNB()
fit   = nb.fit(iris_train_ftrs, iris_train_tgt)
preds = fit.predict(iris_test_ftrs)

metrics.accuracy_score(iris_test_tgt, preds)
1000 loops, best of 1: 1.07 ms per loop

In [15]:

%%timeit -r1

knn   = neighbors.KNeighborsClassifier(n_neighbors=3)
fit   = knn.fit(iris_train_ftrs, iris_train_tgt)
preds = fit.predict(iris_test_ftrs)

metrics.accuracy_score(iris_test_tgt, preds)
1000 loops, best of 1: 1.3 ms per loop

If we just want to time one line in a cell—for example, we only want to see how long it takes to fit the models—we can use a single-percent version, called a line magic, of timeit:

In [16]:

# fitting
nb = naive_bayes.GaussianNB()
%timeit -r1 fit   = nb.fit(iris_train_ftrs, iris_train_tgt)

knn = neighbors.KNeighborsClassifier(n_neighbors=3)
%timeit -r1 fit = knn.fit(iris_train_ftrs, iris_train_tgt)
1000 loops, best of 1: 708 μs per loop
1000 loops, best of 1: 425 μs per loop

In [17]:

# predicting
nb    = naive_bayes.GaussianNB()
fit   = nb.fit(iris_train_ftrs, iris_train_tgt)
%timeit -r1 preds = fit.predict(iris_test_ftrs)

knn   = neighbors.KNeighborsClassifier(n_neighbors=3)
fit   = knn.fit(iris_train_ftrs, iris_train_tgt)
%timeit -r1 preds = fit.predict(iris_test_ftrs)
1000 loops, best of 1: 244 μs per loop
1000 loops, best of 1: 644 μs per loop

There seems to be a bit of a tradeoff. k-NN is faster to fit, but is slower to predict. Conversely, NB takes a bit of time to fit, but is faster predicting. If you’re wondering why I didn’t reuse the knn and nb from the prior cell, it’s because when you %timeit, variable assignment are trapped inside the timeit magic and don’t leak back out to our main code. For example, trying to use preds as “normal” code in the prior cell will results in a NameError. Memory

We can also do a very similar sequence of steps for quick-and-dirty measurements of memory use. However, two issues raise their ugly heads: (1) our tool isn’t built into Jupyter, so we need to install it and (2) there are technical details—err, opportunities?—that we’ll get to in a moment. As far as installation goes, install the memory_profiler module with pip or conda at your terminal command line:

pip install memory_profiler
conda install memory_profiler

Then, in your notebook you will be able to use %load_ext. This is Jupyter’s command to load a Jupyter extension module—sort of like Python’s import. For memory_profiler, we use it like this:

%load_ext memory_profiler

Here it goes:

In [18]:

%load_ext memory_profiler

Use it is just like %%timeit. Here’s the cell magic version for Naive Bayes:

In [19]:

nb    = naive_bayes.GaussianNB()
fit   = nb.fit(iris_train_ftrs, iris_train_tgt)
preds = fit.predict(iris_test_ftrs)
peak memory: 144.79 MiB, increment: 0.05 MiB

And for Nearest Neighbors:

In [20]:

knn   = neighbors.KNeighborsClassifier(n_neighbors=3)
fit   = knn.fit(iris_train_ftrs, iris_train_tgt)
preds = fit.predict(iris_test_ftrs)
peak memory: 144.79 MiB, increment: 0.00 MiB Complicating Factors

You may never have considered what happens with memory on your computer. In the late 2010s, you might have 4 or 8GB of system memory, RAM, on your laptop. I have 32GB on my workhorse powerstation—or workstation powerhorse, if you prefer. Regardless, that system memory is shared by each and every running program on your computer. It is the job of the operating system—Windows, OSX, Linux are common culprits—to manage that memory and respond to applications’ requests to use it. The OS has to be a bit of a playground supervisor to enforce sharing between the different programs.

Our small Python programs, too, are playing on that playground. We have to share with others. As we request resources like memory—or time on the playground swing—the OS will respond and give us a block of memory to use. We might actually get more memory than we request (more on that in a second). Likewise, when we are done with a block of memory—and being the polite playground children that we are—we will return it to the playground monitor. In both our request for memory and our return of the memory, the process incurs management overhead. Two ways that OSes simplify the process and reduce the overhead are (1) by granting memory in blocks that might be more than we need and (2) by possibly letting us keep using memory, after we’ve said we’re done with it, until someone else actively needs it. The net result of this is that determining the actual amount of memory that we are using—versus the amount the operating system has walled off for us—can be very tricky. Measuring additional requests within a running program is even more difficult.

Another issue further complicates matters. Python is a memory-managed language: it has its own memory management facilities on top of the OS. If you were to rerun the above cells in a Jupyter notebook, you might see a memory increment of 0.00 MiB and wonder what circuits just got fried. In that case, the old memory we used was released by us—and the operating system never shuffled it off to someone else. So, when we needed more memory, we were able to reuse the old memory and didn’t need any new memory from the OS. It is almost as if the memory was released and reclaimed by us so quickly that it was never actually gone! Now, whether or not we see an increment is also dependent on (1) what the notebook cell is doing, (2) what other memory our program has claimed and is using, (3) every other program that is running on the computer, and (4) the exact details of the operating system’s memory manager. To learn more, check out a course or textbook on operating systems.

3.7.3 Stand-Alone Resource Evaluation

To minimize these concerns and to reduce confounding variables, it is extremely useful to write small, stand-alone programs when testing memory use. We can make the script general enough to be useful for stand-alone timing, as well.

In [21]:

!cat scripts/knn_memtest.py
import memory_profiler, sys
from mlwpy import *


def knn_memtest(train, train_tgt, test):
    knn   = neighbors.KNeighborsClassifier(n_neighbors=3)
    fit   = knn.fit(train, train_tgt)
    preds = fit.predict(test)

if __name__ == "__main__":
    iris = datasets.load_iris()
    tts = skms.train_test_split(iris.data,
    (iris_train_ftrs, iris_test_ftrs,
     iris_train_tgt,  iris_test_tgt) = tts
    tup = (iris_train_ftrs, iris_train_tgt, iris_test_ftrs)

There are a few ways to use memory_profiler. We’ve seen the line and cell magics in the previous section. In knn_memtest.py, we use the @memory_profiler.profile decorator. That extra line of Python tells the memory profiler to track the memory usage of knn_memtest on a line-by-line basis. When we run the script, we see memory-related output for each line of knn_memtest:

In [22]:

!python scripts/knn_memtest.py
Filename: scripts/knn_memtest.py
# output modified for formatting purposes

Line #    Mem usage    Increment     Line Contents
     4 120.5430 MiB 120.5430 MiB     @memory_profiler.profile(precision=4)
     5                               def knn_memtest(train, train_tgt, test):
     6 120.5430 MiB   0.0000 MiB         knn   = neighbors.
     7 120.7188 MiB   0.1758 MiB         fit   = knn.fit(train, train_tgt)
     8 120.8125 MiB   0.0938 MiB         preds = fit.predict(test)

Here’s another stand-alone script to measure the memory usage of Naive Bayes:

In [23]:

import functools as ft
import memory_profiler
from mlwpy import *

def nb_go(train_ftrs, test_ftrs, train_tgt):
    nb    = naive_bayes.GaussianNB()
    fit   = nb.fit(train_ftrs, train_tgt)
    preds = fit.predict(test_ftrs)

def split_data(dataset):
    split = skms.train_test_split(dataset.data,
    return split[:-1] # don't need test tgt

def msr_mem(go, args):
    base = memory_profiler.memory_usage()[0]
    mu = memory_profiler.memory_usage((go, args),
    print("{:<3}: ~{:.4f} MiB".format(go.__name__, mu-base))

if __name__ == "__main__":
    msr = msr_mem
    go = nb_go

    sd = split_data(datasets.load_iris())
    msr(go, sd)
nb_go: ~0.0078 MiB

nb_go has the model-fit-predict pattern we saw above. split_data just wraps train_test_split in a convenient way to use with nb_go. The new piece is setting up the timing wrapper in msr_mem. Essentially, we ask what memory is used now, run nb_go, and then see the maximum memory used along the way. Then, we take that max, subtract what we were using before, max-baseline, and that’s the peak memory used by nb_go. nb_go gets passed in to msr_mem as go and then finds its way to memory_usage.

We can write a similar msr_time driver to evaluate time, and we can write a similar knn_go to kick off a k-NN classifier for measuring time and memory. Here are all four pieces in a single script:

In [24]:

!cat scripts/perf_01.py
import timeit, sys
import functools as ft
import memory_profiler
from mlwpy import *

def knn_go(train_ftrs, test_ftrs, train_tgt):
    knn   = neighbors.KNeighborsClassifier(n_neighbors=3)
    fit   = knn.fit(train_ftrs, train_tgt)
    preds = fit.predict(test_ftrs)

def nb_go(train_ftrs, test_ftrs, train_tgt):
    nb    = naive_bayes.GaussianNB()
    fit   = nb.fit(train_ftrs, train_tgt)
    preds = fit.predict(test_ftrs)

def split_data(dataset):
    split = skms.train_test_split(dataset.data,
    return split[:-1] # don't need test tgt

def msr_time(go, args):
    call = ft.partial(go, *args)
    tu = min(timeit.Timer(call).repeat(repeat=3, number=100))
    print("{:<6}: ~{:.4f} sec".format(go.__name__, tu))

def msr_mem(go, args):
    base = memory_profiler.memory_usage()[0]
    mu = memory_profiler.memory_usage((go, args),
    print("{:<3}: ~{:.4f} MiB".format(go.__name__, mu-base))

if __name__ == "__main__":
    which_msr = sys.argv[1]
    which_go = sys.argv[2]

    msr = {'time': msr_time, 'mem':msr_mem}[which_msr]
    go = {'nb' : nb_go, 'knn': knn_go}[which_go]

    sd = split_data(datasets.load_iris())
    msr(go, sd)

With all this excitement, let’s see where we end up using Naive Bayes:

In [25]:

!python scripts/perf_01.py mem nb
!python scripts/perf_01.py time nb
nb_go: ~0.1445 MiB
nb_go : ~0.1004 sec

And with k-NN:

In [26]:

!python scripts/perf_01.py mem knn
!python scripts/perf_01.py time knn
knn_go: ~0.3906 MiB
knn_go: ~0.1035 sec

In summary, our learning and resource performance metrics look like this (the numbers may vary a bit):




~Memory (MiB)









Don’t read too much into the accuracy scores! I’ll tell you why in a minute.

  • + Share This
  • 🔖 Save To Your Account