Data

What will we cover?
The definition of data, the different types of data from simple characters and numbers through collections and how to define your own data types.

Data is one of those terms that everyone uses but few really understand. My dictionary defines it as:

Data: "facts or figures from which conclusions can be inferred; information"

That's not much better but at least gives a starting point.

Data is the stuff that your program manipulates. Without data a program cannot usefully exist. Programs manipulate data in many ways, often depending on the type of the data. And it comes in many types:

Data Types

Character Strings

We've already seen these. They are literally any string or sequence of characters that can be printed on your screen. (In fact there can even be non-printable control characters too).

In Python, strings can be represented in several ways:

With single quotes:

'Here is a string'

With double quotes:

"Here is a very similar string"

With triple double quotes:

""" Here is a very long string that can
    if we wish span several lines and Python will
    preserve the lines as we type them..."""

One special use of the latter form is to build in documentation for Python functions that we create ourselves - we'll see this later.

You can access the individual characters in a string by treating it as an array of characters (see arrays below). There are also usually some operations provided by the programming language to help you manipulate strings - find a sub string, join two strings, copy one to another etc.

Integers

Whole numbers from a large negative value through to a large positive value. The size of this value is known as MAXINT and depends on the number of bits used on your computer to represent a number. On most current computers it's 32 bits so MAXINT is around 2 billion.

You can also get unsigned integers which basically are positive and zero only. Thus there is a bigger maximum number of around 2 * MAXINT or 4 billion on a 32 bit computer.

Because integers are restricted in size to MAXINT adding two integers together where the total is greater than MAXINT causes the total to be wrong. On some systems/languages the wrong value is just returned as is (usually with some kind of secret flag raised that you can test if you think it might have ben set). Normally an error condition is raised and either your program can handle the error or the program will exit. Python adopts this latter approach while Tcl adopts the former. BASIC throws an error but provides no way to catch it (at least I don't know how!)

Real Numbers

These are fractions. They can represent very large numbers, much bigger than MAXINT, but with less precision. That is to say that 2 real numbers which should be identical may not seem to be when compared by the computer. This is because the computer only approximates some of the lowest details. Thus 4.0 could be represented by the computer as 3.9999999.... or 4.000000....01. These approximations are close enough for most purposes but occasionally they become important! If you get a funny result when using real numbers, bear this in mind.

Complex or Imaginary Numbers

If you have a scientific or mathematical background you may be wondering about complex numbers? If you aren't you may not even have heard of complex numbers! Anyhow some programming languages, notably Fortran, provide builtin support for the complex type but most, like Python, provide a library of functions which can operate on complex numbers. And before you ask, the same applies to matrices too.

Boolean Values - True and False

Like the heading says, this type has only 2 values - either true or false. Some languages support boolean values directly, others use a convention whereby some numeric value (often 0) represents false and another (often 1 or -1) represents true.

Boolean values are sometimes known as "truth values" because they are used to test whether something is true or not. For example if you write a program to backup all the files in a directory you might backup each file then ask the operating system for the name of the next file. If there are no more files to save it will return an empty string. You can then test to see if the name is an empty string and store the result as a boolean value (true if it is empty). You'll see how we would use that result later on in the course.

Collections

Computer science has built a whole discipline around studying collections and their various behaviours. Some of the names you might see are:

Array or Vector
A list of items which are indexed for easy and fast retrieval. Usually you have to say up front how many items you want to store. Lets say I have an array called A, then I can extract the 3rd item in A by writing A[3]. (Actually arrays usually start at 0 so the 3rd item would be A[2].) Arrays are fundamental in BASIC, in fact they are the only built in collection type. In Python arrays are simulated using lists (see below) and Tcl arrays are implemented using dictionaries (see below).
List
A list is a sequence of items. What makes it different from an array is that it can keep on growing - you just add another item. But it's not usually indexed so you have to find the item you need by stepping through the list from front to back checking each item to see if it's the item you want. Both Python and Tcl have lists built into the language. In BASIC it's harder and we have to do some tricky programming to simulate them. BASIC programmers usually just create very big arrays instead. Python also allows you to index it's lists - in fact it doesn't have arrays as such but combines array indexing with it's lists' ability to grow. As we will see this is a very useful feature.
Stack
Think of a stack of trays in a restaurant. A member of staff puts a pile of clean trays on top and these are removed one by one by customers. The trays at the bottom of the stack get used last (and least!). Data stacks work the same way: you push an item onto the stack or pop one off. The item popped is always the last one pushed. This property of stacks is sometimes called First In Last Out or FILO. One useful property of stacks is that you can reverse a list of items by pushing the list onto the stack then popping it off again. The result will be the reverse of the starting list. Stacks are not built in to Python, Tcl or BASIC. You have to write some program code to implement the behaviour. Lists are usually the best starting point since like stacks they can grow as needed.
Bag
A bag is a collection of items with no specified order and it can contain duplicates. Bags usually have operators to enable you to add, find and remove items. In Python and Tcl bags are just lists. In BASIC you must build the bag from a large array.
Set
A set has the property of only storing one of each item. You can usually test to see if an item is in a set (membership). Add, remove and retrieve items and join two sets together in various ways corresponding to set theory in math (eg union, intersect etc). None of our sample languages implement sets directly but they can be easily implemented in both Python and Tcl by using the built in dictionary type.
Queue
A queue is rather like a stack except that the first item into a queue is also the first item out.This is known as First In First Out or FIFO behaviour.
Dictionary or Hash
A dictionary combines the properties of lists, sets and arrays. You can keep on adding items(like lists) but you can access the item by a key provided at the point of insertion (like arrays). Because access is via the key you can only put in items with unique keys (like sets). Dictionaries are immensely useful structures and are provided as a built in type in both Python and Tcl. They are quite difficult to implement efficiently so are rarely used by BASIC programmers.

We can use dictionaries in lots of ways and we'll see plenty examples later, but for now, here's how to create a dictionary in Python, fill it with some entries and read them back:
>>> dict = {}
>>> dict['boolean'] = "A data item whose value can be either true or false"
>>> dict['integer'] = "A whole number"
>>> print dict['boolean']

Easy eh?

There's a whole bunch of others but these are the main ones that we deal with. (In fact we'll only be dealing with some of these!)

Files

As a computer user you know all about files - the very basis of nearly everything we do with computers. It should be no surprise then, to discover that most programming languages provide a special file type of data. However files and the processing of them are so important that I will defer discussing them till later when they get a whole section to themselves.

Dates and Times

Dates and times are often given dedicated types in programming. At other times they are simply represented as a large number (typically the number of seconds from some arbitrary date/time!). In other cases the data type is what is known as a complex type as described in the next section. This usually makes it easier to extract the month, day, hour etc.

Complex/User Defined

Sometimes the basic types described above are inadequate even when combined in collections. Sometimes what we want to do is group several bits of data together then treat it as a single item. An example might be the description of an address: a house number, a street and a town. Finally there's the post code or zip code. Most languages allow us to group such information together in a record or structure.

In BASIC such a record definition looks like:

Type Address
     Hs_Number AS INTEGER
     Street AS STRING
     Town AS STRING
     Zip_Code AS STRING
End Type

In Python its a little different:

class Address:
    def __init__(self, Hs, St, Town, Zip):
        self.Hs_Number = Hs
        self.Street = St
        self.Town = Town
        self.Zip_Code = Zip		

That may look a little arcane but don't worry it will make sense soon.

We'll look at how to use these structures in the next section on Variables.

Variables

Data is stored in the memory of your computer. You can liken this to the big wall full of boxes used in mail rooms to sort the mail. You can put a letter in any box but unless the boxes are labelled with the destination address its pretty meaningless. Variables are the labels on the boxes in your computer's memory.

Knowing what data is is OK, but what can we do with it? In programming terms we can create instances of data objects and assign them to variables. A variable is a reference to a specific area somewhere in the computers memory. These areas hold the data. In some computer languages a variable must match the type of data that it points to. eg in BASIC we declare a string variable by putting a $ at the end of the name:

DIM MYSTRING$
MYSTRING$ = "Here is a string"

Here DIM MYSTRING$ creates the label and specifies that it will hold a string ( because of the $ sign). The MYSTRING$ = "Here..." line creates the actual data and puts it in the bit of memory labelled MYSTRING$.

Similarly we declare an integer by putting a % at the end:

DIM MYINT%
MYINT% = 7
In Python and Tcl a variable takes the type of the data assigned to it. It will keep that type and Python or Tcl will warn if you try to mix data in strange ways - like trying to add a string to a number. (Recall the example error message earlier?). We can change the type of data that a Python variable points to by reassigning the variable.
>>> q = 7
>>> print 2*q
14
>>> q = "Seven"
>>> print 2*q
SevenSeven

Note that q was set to point to 7 initially. It maintained that value until we made it point at "Seven". Thus, Python variables maintain the type of whatever they point to, but we can change what they point to simply by reassigning the variable. At that point the original data is 'lost' and Python will erase it from memory (unless another variable points at it too) this is known as garbage collection. (Garbage collection can be likened to the mailroom clerk who comes round once in a while and removes any packets that are in boxes with no labels. If he can't find an owner or address on the packets he throws them in the garbage!)

BASIC will not allow you to do this. If a variable is a string variable (terminated with a $) you cannot ever assign a number to it. Similarly, if it is an integer variable (ends in %) you cannot assign a string to it. BASIC does allow 'anonymous variables' that don't end in anything. These can only store numbers however, either real or integer numbers but only numbers.

One final gotcha with integer variables in BASIC:

i% = 7
PRINT 2 * i%
i% = 4.5
PRINT 2 * i%

Notice that the assignment of 4.5 to i% seemed to work but only the integer part was actually assigned. This is reminiscent of the way Python dealt with division of integers. All programming languages have their own little idiosyncracies like this!

Accessing Complex Types

We can assign a complex data type to a variable too, but to access the individual fields of the type we must use some special access mechanism (which will be defined by the language). Usually this is a dot.

To consider the case of the address type we defined above we would do this in BASIC:

DIM Add AS Address
Add.Hs_Number = 7
Add.Street = "High St"
Add.Town = "Anytown"
Add.Zip_Code = "123 456"
PRINT Add.Hs_Number," ",Add.Street

And in Python:

Add = Address(7,"High St","Anytown","123 456")
print Add.Hs_Number, Add.Street

Let's see what we can do with variables now that we know what they are and how to create them.

Points to remember
  • Data comes in many types and the operations you can successfully perform will depend on the type of data you are using.
  • Simple data types include character strings, numbers, Boolean or 'truth' values.
  • Complex data types include collections, files, dates and user defined data types.

Previous  Next  Contents


If you have any questions or feedback on this page send me mail at: alan_gauld@xoommail.com