InformIT

An Introduction to SQL

Date: Mar 15, 2002

Sample Chapter is provided courtesy of Sams.

Return to the article

Structured Query Language (SQL) is a database query language that was adopted as an industry standard in 1986. A major revision of the SQL standard, SQL2, was released in 1992. Its successor, SQL3, also contains object-oriented components. Currently, ANSI SQL92 is the most important standard.

This language enables you to pose complex questions to a database. It also provides a means of modifying databases. Many databases support SQL, so you can apply this knowledge to MS SQL Server, DB2, Oracle, PostgreSQL, and countless other databases.

Structured Query Language (SQL) is a database query language that was adopted as an industry standard in 1986. A major revision of the SQL standard, SQL2, was released in 1992. Its successor, SQL3, also contains object-oriented components. Currently, ANSI SQL92 is the most important standard.

This language enables you to pose complex questions to a database. It also provides a means of modifying databases. SQL is widely used. Many databases support SQL, which means that if you learn how to use SQL, you can apply this knowledge to MS SQL Server, DB2, Oracle, PostgreSQL, and countless other databases. SQL works with relational databases, which store data in objects. An object can be a table, for example. A database is a collection of tables and functions. A table consists of a list of records; each record (row) in a table has the same structure and each has a fixed number of fields (columns) of a given type. SQL can be used to communicate with a database and its components.

Every database has a slightly different version of SQL implemented, so it can sometimes be very hard to port applications from one database to another. For that reason, ANSI SQL92 has been developed. ANSI SQL92 is a standard that should be understood by every database supporting SQL. Unfortunately, many commercial database developers don't implement fully ANSI SQL92–compatible SQL engines in their databases.

PostgreSQL developers are working hard to make PostgreSQL 100% ANSI SQL–compatible. Therefore, PostgreSQL applications are portable and easy to understand.

For further information about the ANSI SQL92 standard or any other ANSI standard, check out http://www.ansi.org.

Relational Databases and Their Components

The relational database model was conceived by E. F. Codd in 1969. The model is based on branches of mathematics called set theory and predicate logic. The main idea is that a database consists of various unordered tables called relations that can be modified by the user. Relational databases were a major improvement to traditional database systems, which were not as flexible and sometimes hardware-dependent.

Relational databases consist of various components. This chapter provides basic insight to those who have not dealt with relational databases, and is not a tutorial about the theory behind relational databases.

Tables and Keys

Tables are the key components of relational databases. A relational database consists of one or more tables used to store information. A table consists of rows. Every row is divided into fields (columns) that have a certain datatype.

Assume you have a table used to store names and salaries (see Figure 3.1). A row containing this information consists of two fields: one for the name and one for the salary. If information from various tables has to be collected by a query, a join is performed by the database. Joins are covered extensively later in this chapter (see the section "Joining Tables").

Primary Keys

Every table should have a primary key. In this case, the name would be a useful primary key if the names are unique. Primary keys have to be fields that contain unique values—a primary key is the identifier of a record (row).

Keys have a significant impact on performance, but are also needed to guarantee data integrity.

Figure 3.1 A simple table with four columns.

Foreign Keys

Foreign keys are keys "taken" from a different table. Imagine a database with two tables. In one table, we store information about companies, such as the name and the location of each company. In the second table, we store information about the employees of the companies stored in the first table. We use a foreign key to make sure that the second table cannot contain information about employees who do not work for one of the companies listed in the first table. The behavior of PostgreSQL when dealing with foreign keys can be defined for every table. It can be defined, for instance, that all employees in the second table are removed when a company is removed from the first table. Rules defining PostgreSQL's behavior are called integrity constraints.

Foreign keys are extremely useful when working with complex data models and are usually used to protect data integrity. See Figure 3.2 for an example of two tables connected with a foreign key.

Datatypes

Every column in a table must have a datatype. The user's job is to find the best datatype for storing a certain piece of information in the database. Let's assume that we want to store the name of a person. Names are character strings of undefined length. A suitable datatype for names would be varchar(50). In this example, 50 is the maximum length of the field. A varchars is stored efficiently in the database, because only the actual length of the field—and not the maximum length of the varchar— is used to store the text.

Figure 3.2 Connecting tables using a foreign key.

PostgreSQL offers a variety of datatypes. Here is an overview of all datatypes available in PostgreSQL 7.0.3:

xy=# \dT
                
List of types

  Type  |              Description
-----------+-------------------------------------------------------------------
 SET    | set of tuples
 abstime  | absolute, limited-range date and time (Unix system time)
 aclitem  | access control list
 bit    | fixed-length bit string
 bool   | boolean, 'true'/'false'
 box    | geometric box '(lower left,upper right)'
 bpchar  | char(length), blank-padded string, fixed storage length
 bytea   | variable-length string, binary values escaped
 char   | single character
 cid    | command identifier type, sequence in transaction id
 cidr   | network IP address/netmask, network address
 circle   | geometric circle '(center,radius)'
 date    | ANSI SQL date
 filename  | filename used in system tables
 float4   | single-precision floating point number, 4-byte storage
 float8   | double-precision floating point number, 8-byte storage
 inet    | IP address/netmask, host address, netmask optional
 int2    | -32 thousand to 32 thousand, 2-byte storage
 int2vector | array of 16 int2 integers, used in system tables
 int4    | -2 billion to 2 billion integer, 4-byte storage
 int8    | ~18 digit integer, 8-byte storage
 interval  | @ <number> <units>, time interval
 line    | geometric line '(pt1,pt2)'
 lseg    | geometric line segment '(pt1,pt2)'
 lztext   | variable-lengthstring, stored compressed
 macaddr  | XX:XX:XX:XX:XX,MAC address
 money   | $d,ddd.cc,money                          
 name    | 31-character type for storing system identifiers
 numeric  | numeric(precision, decimal), arbitrary precision number
 oid    | object identifier(oid), maximum 4 billion
 oidvector | array of 16 oids, used in system tables
 path    | geometric path '(pt1,...)'
 point   | geometric point '(x, y)'
 polygon  | geometric polygon '(pt1,...)'
 regproc  | registered procedure
 reltime  | relative, limited-range time interval (Unix delta time)
 smgr    | storage manager
 text    | variable-length string, no limit specified
 tid    | (Block, offset), physical location of tuple
 time    | hh:mm:ss, ANSI SQL time
 timestamp | date and time
 timetz   | hh:mm:ss, ANSI SQL time
 tinterval | (abstime,abstime), time interval unknown  |
 varbit   | fixed-length bit string
 varchar  | varchar(length), non-blank-padded string, variable storage length
 xid    | transaction id
(47 rows)                      

You can see that PostgreSQL offers powerful datatypes for nearly any purpose you can imagine. Thanks to PostgreSQL's modularity, new datatypes can easily be added to this list. The CREATE TYPE command can be used to add a datatype.

The most important datatypes are covered extensively later in this chapter. You will learn how to use these datatypes efficiently in real-life scenarios.

Indices

Indices are used to speed up searching. Let's assume you have a telephone directory containing 1,000,000 records consisting of two fields. The first field contains the name of a person, and the second field contains the phone number. If someone wants to know the phone number of a certain person, the database runs a sequential scan, which means that every record is scanned for the requested name. On average, a query such as that needs 500,000 (1,000,000 divided by 2) steps to find the result. If tables are large, the performance of the database system decreases significantly.

In this case, an index can be defined on a column. An index is, in most cases, a tree, and the leaves of the tree point to a data object.

Before you look at PostgreSQL's implementations of indices, let's explore the basic idea of indexing using B-trees.

B-trees are an efficient data structure for retrieving values in tables. Trees provide the data sorted so that values can be accessed much faster. In a B-tree, the tree consists of nodes, with up to two children. A child can be a node for up to two more children. Nodes are values in the data that are the parents of other values. A child that has no children is called a leaf. The data structure looks like a tree, but it's upside down.

B-trees are used to search efficiently for a value in a data structure. If the number of values stored in a tree doubles, the time to search for a value doesn't double—it takes one additional step. If the number of values stored in a tree is 1,024 times higher, it takes only 10 additional steps to find a value, because 1,024 is the result of 210.

Imagine 1,048,576 (unique) datasets. It would take 20 (logarithmus dualis: 20 = ld 1,048,576) steps to find the right value. In this example, you can see how an index can speed up your query; if no index is used to find the right value out of 1,048,576, the database needs 524,288 steps (1,048,576 divided by 2) to find the result.

Note

This works only as long as the B-tree is 100% balanced (see Figure 3.3).

In databases, B+ trees are usually used instead of B-trees, because B+ trees guarantee higher performance in real-world scenarios. The Reiser Filesystem (a Linux Filesystem) is also based on balanced B+ trees.

PostgreSQL supports three types of indices:

Figure 3.3 A balanced B-tree.

B-Trees

As mentioned in the last section, one way of indexing a column in PostgreSQL is to use a B-tree. PostgreSQL doesn't use "ordinary" B-trees for indexing because some additional features are required that can't be implemented with ordinary B-trees. One of the problems has to do with index locking. Assume that one user adds data to an index while a second user does an index scan. User two needs a fixed and persistent "image" of the index while performing the query. This problem can be solved with the help of a Lehman-Yao high-concurrency btree. This kind of tree is a super-high–concurrency solution at the expense of a little extra complexity in the data structure. The following changes have to be made in the data structure:

This ensures that no locking for reading is required and lock coupling for writes is rare.

R-Trees

R-trees use Guttman's quadratic split algorithm and are a dynamic index structure for spatial searching. Traditional indexing algorithms are not suitable for computer-aided design or geo-applications. Because PostgreSQL offers many datatypes that can be used for spatial calculations, R-trees can be a powerful method of speeding up your applications.

To understand spatial queries, imagine a situation where you want to find all countries that have land within 100 miles of a specific location. R-trees can be used to solve the problem for the database efficiently. R-trees are highly balanced trees and can be compared with B-trees. PostgreSQL offers a variety of operators for working with geo-data. In most cases, the user does not need to care about the details.

Hash Access Methods

The linear hashing algorithm used by PostgreSQL was developed by W. Litwin for disk-based systems with one processor. Linear hashing allows dynamic reorganization of a hashed database when records are inserted or updated. The possibility of accessing one record with one-bucket access should be maintained. Linear hashing enables the hashing function to be changed while the database is changed; only a small part of the database is affected when the hash function is changed.

Concurrent linear hashing adds a locking protocol and allows simultaneous access.

Sequences

Sequences are a comfortable method for building lists that are numbered consecutively. A sequence can be used in the entire database (if all users have access to the sequence). Every time a user accesses the sequence, the value of the sequence is incremented. It is guaranteed that a certain number is used only once. Sequences can therefore be used to create unique numbers. The user does not have to care about transactions when dealing with sequences, because the database makes sure that every value is used only once internally.

Triggers

Powerful and comfortable applications can be built with the help of triggers, which are used to start certain functions after certain events. Triggers are defined for tables and have to be associated with an event such as INSERT or UPDATE.

In real-world scenarios, triggers are used to perform operations automatically, but triggers are also used for many purposes by the database internally.

You will explore triggers extensively in Chapter 5, "Understanding Transactions," which is about PL/pgSQL.

Objects

Object relational databases consist of objects. Object orientation is an extension to the relational database model and is, in the case of PostgreSQL, a very powerful feature. Objects offer important core features, as explained in the following sections.

Classes

A class is a named collection of object instances. Each instance has a unique object identifier (OID).

Note

Each OID is unique in the entire system.

Classes can be created using the CREATE command. Various versions of a class are called instances. In case of object relational databases, an instance can be a row in a table.

Inheritance

Inheritance means that a class can inherit functions or attributes from a class that is "higher" in the hierarchy. If a new class is derived from one of those upper classes, it inherits all information from the upper class. It is now possible to implement additional features for the new class.

Note

Features defined for a derived class are not visible in the parent class.

Here is an example of how to make the inheritance process clearer:

Imagine a table containing information about cars. We define a class that stores all information about a car that is common for cars, such as the color or the year the car was built. Now we define a class for a specific type of car that is used to store additional information, such as technical data about the air conditioning. The class defined for the specific type of car inherits all information from the parent type storing information about ordinary cars.

You learn how to query derived tables later in the book.

Function Overloading

Function overloading is a key feature of object-oriented systems. In function overloading, many versions of a function can exist. The difference between those functions is the number of parameters that can be passed to it. Assume a function called sum() used to sum numbers. Summing can be useful for 2, 3, or more values. With function overloading, you can implement functions for each of the cases.

PL/pgSQL supports function overloading, and you will soon recognize it as a powerful and easy-to-use feature. Function overloading can also lead to dangerous bugs that are sometimes very hard to track down, because the programmer has to find the correct version of the function that was used before looking for the real error in the source code.

Views

If you want to look at your data from a broader perspective, views might be a good choice. A view is a virtual table that contains information from other tables. A view is nothing else than the result of a SELECT statement presented as a virtual table by the database system. Views can be used to simplify SQL statements.

Procedures

Procedures are functions that are stored directly within the database. Many database systems offer embedded languages. Oracle databases, for instance, offer a language called PL/SQL. PostgreSQL offers a language called PL/pgSQL, which is similar to PL/SQL and also very powerful. PostgreSQL offers even more programming interfaces, but PL/Tcl and PL/Perl are the most important ones to mention here. Writing procedures will be a major part of the chapter about PL/pgSQL.

Aggregate Functions and Aggregate Expressions

The capability of performing aggregations is an important feature of SQL. It enables the user to perform tasks on more than just one record.

Aggregate Functions

Aggregate functions are used to perform data calculations, such as maximum, minimum, or average. Aggregate functions can easily be added to PostgreSQL by using the CREATE AGGREGATE command. Many functions are already included in the base distribution, but it can be extremely useful to add your own features.

Aggregate Expressions

Aggregate expressions are used to perform operations with multiple lines returned by a SELECT statement. The DISTINCT command is a good example of an aggregate expression. If multiple rows contain the same data, DISTINCT returns multiple entries only once. Assume a query where you want to retrieve all names from a table and you want each name to be returned only once. The DISTINCT command is the solution.

Building a Database

In this section, you will learn to build simple PostgreSQL databases. PostgreSQL supports a lot of functions and features to make the definition of a data structure easy. In addition, data can easily be inserted into tables using simple SQL commands.

Building Simple Data Structures

The core component of every data structure is a table. Tables are used to store the data, and most database operations are based on tables. Defining and removing tables correctly are essential steps when working with databases.

Creating Tables

In this section, you learn how to create tables with the help of SQL commands. If you want to create a table, you can use the CREATE TABLE command.

If you want to create a table called emp for storing names and salaries, you use the following command:

name=# CREATE TABLE emp(id serial, empname varchar(50), sal numeric(9,2));

The display is the following:

NOTICE: CREATE TABLE will create implicit sequence 'emp_id_seq' for SERIAL column 'emp.id'
NOTICE: CREATE TABLE/UNIQUE will create implicit index 'emp_id_key' for table 'emp'
CREATE

You can check to see whether the table has been created by using the \d command—in this case, \d emp:

name=# \d emp
Table "emp"

 Attribute |  Type   |         Modifier
-----------+--------------+------------------------------------- --------
 id    | integer   | not null default nextval('emp_id_seq'::text)
 empname  | varchar(50) |
 sal    | numeric(9,2) |
Index: emp_id_key

The table has successfully been created and contains three columns. The first column is used as a sequence; every record will have a unique id. If the datatype serial is used for a column, a sequence will implicitly be created by the database. The second column will be used to store the name of the person. Because a name has no fixed length, you use the datatype varchar() with a maximum length of 50 characters. The third column will be used to store the salary. Salaries are usually decimal values. In this case, the salary can have up to 7 digits before the comma; 2 digits can be used after the comma. PostgreSQL automatically creates an index on the first column if no special primary key is defined.

Here is a second, slightly different CREATE TABLE command:

CREATE TABLE emp2 (id serial, 
    empname varchar(50) UNIQUE, 
    sal numeric(9,2) NOT NULL, 
    currency varchar(4) DEFAULT 'USD');

The following is displayed:

NOTICE: CREATE TABLE will create implicit sequence 'emp2_id_seq' for SERIAL column 'emp2.id'
NOTICE: CREATE TABLE/UNIQUE will create implicit index 'emp2_id_key' for table 'emp2'
NOTICE: CREATE TABLE/UNIQUE will create implicit index 'emp2_empname_key' for table 'emp2'
CREATE

We have created a table called emp2, but the column named empname can be used to store only unique names. This can be useful because it is extremely confusing to have two people with exactly the same name but with different salaries in the database. If someone tries to insert a name twice, the database will report an error. NOT NULL means that a correct salary has to be available. If no salary for a particular row is inserted into the table, an error will be displayed. The fourth column is used to store the currency of the salary—the default value is set to USD. The default value will be used if no value is inserted into the field.

If you don't know the syntax of the command by heart, you can simply use the \h CREATE TABLE command. The following lines will be displayed:

name=# \h CREATE TABLE 
Command: CREATE TABLE 
Description: Creates a new table 

Syntax: CREATE [ TEMPORARY | TEMP ] TABLE table ( column type [ NULL | NOT NULL ] [ UNIQUE ] 
[ DEFAULT value ] [column_constraint_clause | PRIMARY KEY } [ ... ] ] [, ... ] 
[, PRIMARY KEY ( column [, ...] ) ] [, CHECK ( condition ) ] [, table_constraint_clause ] ) 
[ INHERITS ( inherited_table [, ...] ) ]

Using Temporary Tables

Temporary tables are an extremely powerful feature of PostgreSQL. They can be used to store session-specific information in a comfortable way. Temporary tables can be created with the CREATE TEMPORARY TABLE command, which works like the CREATE TABLE command.

The difference between ordinary and temporary tables is that temporary tables are visible only while the session is active. If users quit the session, all temporary tables they have created will be deleted by the database.

If two users log in simultaneously, both users can create temporary tables with the same name because the temporary tables of one user can't be seen by the other user. This feature makes temporary tables easy and safe to use.

The DROP TABLE Command

The DROP TABLE command can be used to delete tables. Here is an example:

DROP TABLE emp2;

The table will be dropped immediately and no ROLLBACK can be performed. You will learn about ROLLBACK and transactions in Chapter 4, "PL/PGSQL."

Be careful when using DROP TABLE. DROP TABLE can also be used to drop multiple tables at once. If you want to drop more than one table, add a comma; the name of the table you want to drop to the statement and two tables will be deleted (as long as you have the permission and the tables are available).

DROP TABLE does not automatically drop sequences that are implicitly created when you use serials in a table. This is extremely important when you want to create a table with the same name and structure you have just dropped, because the sequence won't be overwritten. For solving the problem, use the DROP SEQUENCE command manually.

The ALTER TABLE Command

The ALTER TABLE command can be used to perform multiple operations. ALTER TABLE can be used to add columns to a table. The following example adds a column called currency to table emp (the one we created in the CREATE TABLE section):

ALTER TABLE emp ADD COLUMN currency varchar(4);

The new column is a varchar and can be up to 4 characters long. If we want to set the default value of that column to USD, we have to write a second SQL statement:

ALTER TABLE emp ALTER COLUMN currency SET DEFAULT 'USD';

The default value is set to USD, but ALTER TABLE can also be used to rename tables. Here is an example where the table emp is renamed to oldemp:

ALTER TABLE emp RENAME TO oldemp;

Columns can also be renamed:

ALTER TABLE emp RENAME COLUMN sal TO salary;

In the example, the column sal is renamed to salary. The ALTER TABLE command is indeed powerful and useful. Here is the complete definition of the command's syntax (try \h ALTER TABLE):

Command:   ALTER TABLE
Description: Modifies table properties
Syntax:
ALTER TABLE table [ * ]
  ADD [ COLUMN ] column type
ALTER TABLE table [ * ]
  ALTER [ COLUMN ] column { SET DEFAULT value | DROP DEFAULT }
ALTER TABLE table [ * ]
  RENAME [ COLUMN ] column TO newcolumn
ALTER TABLE table
  RENAME TO newtable
ALTER TABLE table
  ADD table constraint definition

You can see that setting or dropping default values and adding constraints can also be done easily.

Note

Renaming tables can sometimes be very tricky and can lead to confusion. We have renamed the table emp to oldemp, but what about the sequence used in the first column? It is still called emp_id_seq. If the number of tables and sequences increase, this can become a complicated and confusing issue.

Creating and Dropping Indices

Indices are used to speed up queries. PostgreSQL normally uses B-trees for indexing a column. It is also possible to create a single index for multiple columns, but let's look at the syntax of CREATE INDEX first:

persons=# \h CREATE INDEX
Command:   CREATE INDEX
Description: Constructs a secondary index
Syntax:
CREATE [ UNIQUE ] INDEX index_name ON table
  [ USING acc_name ] ( column [ ops_name ] [, ...] )
CREATE [ UNIQUE ] INDEX index_name ON table 
  [ USING acc_name ] ( func_name( column [, ... ]) [ ops_name ] )

Let's create an index:

CREATE UNIQUE INDEX idx_oldemp_empname ON oldemp (empname);

In this example, we create an index called idx_old_empname on table oldemp. The index is used to store unique values, which means that a name can appear only once in the table. The index is defined for the column empname. PostgreSQL uses B-trees for indices by default. If you want to use R-trees or hashes, the SQL command has to be modified slightly:

CREATE INDEX idx_oldemp_empname ON oldemp USING HASH (empname);

Note

Hashes and R-trees cannot be created on tables in combination with the UNIQUE constraint. It is also not possible to use T-trees for certain datatypes. If you try to, the following error messages is displayed:

name=# CREATE UNIQUE INDEX idx_oldemp_empname ON oldemp USING HASH (empname);

ERROR: DefineIndex: unique indices are only available with the btree access method

name=# CREATE UNIQUE INDEX idx_oldemp_empname ON oldemp USING RTREE (empname);

ERROR: DefineIndex: unique indices are only available with the btree access method

name=# CREATE INDEX idx_oldemp_empname ON oldemp USING RTREE (empname);

ERROR: DefineIndex: opclass "varchar_ops" not supported by access method "rtree"

Indices can also be defined for multiple columns; here is an example where one index is used for two columns:

CREATE INDEX idx_oldemp_salcur ON oldemp (salary,currency);

If you want to drop the index, use the following command:

DROP INDEX idx_oldemp_salcur;

You should use idx as a prefix or postfix for the name of your index (it helps you execute larger projects).

Inserting Data

Inserting data into your table is usually done with the INSERT command. Let's assume that you want to insert an employee named Alan Cop with a salary of 20000 USD into the table called oldemp:

INSERT INTO oldemp (empname,salary,currency) VALUES ('Alan Cop',20000,'USD');

The first variable to be specified is the name of the table where you want to insert the data. The names of the columns where data is going to be added have to be in parentheses. In our example, we don't want to add a value in the column called id because this column will get the value from the serial used as the primary key. All fields that are not listed in brackets will be NULL or default values after the INSERT.

Importing Large Datafiles and Using Pipes on UNIX Systems

If large amounts of data have to be imported into a PostgreSQL server, INSERT statements are far too slow for that purpose.

In this case, other methods, such as files and pipes, are far more effective. You can use the COPY command to import data directly from files. PostgreSQL supports reading both binary and ASCII data, which can be quite useful because working with binary data can sometimes be much faster.

Suppose you want to import the following tab-separated file into table oldemp:

1    Paul Stallman    18000  USD
2    Steven Neundorf   4000  DM
3    Josef Russell   20000  USD
4    Alan Torvalds   22000  USD

Here is the required COPY command:

COPY oldemp FROM '/tmp/people.data' USING DELIMITERS '\t';

You import the file people.data in /tmp into the table called oldemp. You use tabs (\t) as delimiters; but let's try to import the file again:

name=# COPY oldemp FROM '/tmp/people.data' USING DELIMITERS '\t';
ERROR: copy: line 1, Cannot insert a duplicate key into unique index emp_id_key

It can't be done again because primary keys have to be unique. Importing the data a second time leads to double entries.

In this example, we manually inserted data into a column that uses a serial as its primary key. What happened to the sequence the serial uses to generate the row ids? You use a SELECT statement to retrieve the value (you learn about SELECT in Chapter 4):

name=# SELECT sequence_name, last_value FROM emp_id_seq;
 sequence_name | last_value
---------------+------------
 emp_id_seq  |     1
(1 row)  

You can see that the value of the sequence is still 1, although we have already inserted four values into the table. This can lead to real difficulties when you start to insert values into the database using INSERT statements:

INSERT INTO oldemp (empname, salary, currency) 
    VALUES ('John Blundell',19000,'USD');
ERROR: Cannot insert a duplicate key into unique index emp_id_key

Because we want the serial to generate the primary key, we don't add the name and the value of the field id to the INSERT statement. During the INSERT, the current value of the emp_id_seq sequence is selected and the value of the sequence is incremented by 1 if the insert succeeds. Accidentally, the sequence's value is still 1, and 1 is already used in the table. Because primary keys have to be unique, the INSERT fails. Temporary tables can be used to solve the problem. If you want to use sequences for your primary key, create a temporary table and import the data to that table. Then perform a SELECT INTO operation to copy the data to the final table. An advantage of that algorithm is that you don't have to remove trash when the import fails. Removing trash from tables can be tricky in production environments.

Pipes are a powerful feature on UNIX systems and can be used to redirect the standard output of a process directly to a PostgreSQL server. Using them is simple:

yourprocess.pl | psql yourdatabase

The output of yourprocess.pl must be formatted properly so that psql can use it as standard input. A good choice of a properly formatted standard output is an SQL statement.

Note

The standard error of the yourprocess.pl process is being omitted, because only standard output is redirected.

If you have processes that run for hours or produce a lot of standard output or standard errors, you might want to store the result of the process in a file. In this case, the UNIX command nohup might be useful. If you start a background process with the help of nohup, you can log out while the process is working. This is not possible when starting a command with & only. Here is an example of a nohup command:

nohup ./myprocess.pl &

myprocess.pl might be a Perl script that performs complex PostgreSQL commands. If you have logged in to a remote machine using your favorite terminal software and if your network is not a reliable one, nohup can save you a lot a time (imagine a process getting killed after 26 hours because of a network interruption).

Retrieving Data

Let's look at a sample database called shop. The database is a prototype of a bookshop, and we will use this database in most parts of this chapter. It might look a bit confusing at first, but we'll work though every detail of the SQL code and the database structure. The following is the output of pg_dump, an interface for dumping databases into ASCII files:

CREATE SEQUENCE "products_id_seq" start 5 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"products_id_seq"');
CREATE SEQUENCE "prodtext_id_seq" start 8 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"prodtext_id_seq"');
CREATE SEQUENCE "prodcat_id_seq" start 9 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"prodcat_id_seq"');
CREATE SEQUENCE "customer_id_seq" start 3 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"customer_id_seq"');
CREATE SEQUENCE "sales_id_seq" start 4 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"sales_id_seq"');

CREATE TABLE "products" (
    "id" int4 DEFAULT 
        nextval('products_id_seq'::text) NOT NULL,
    "prodid" int8 NOT NULL,
    "price" numeric(9,2) NOT NULL,
    "stored" numeric(9,2) DEFAULT 0,
    PRIMARY KEY ("prodid")
);

CREATE TABLE "prodtext" (
    "id" int4 DEFAULT 
        nextval('prodtext_id_seq'::text) NOT NULL,
    "prodid" int8,
    "lang" text,
    "proddesc" text
);

CREATE TABLE "prodcat" (
    "id" int4 DEFAULT 
        nextval('prodcat_id_seq'::text) NOT NULL,
    "prodid" int8 NOT NULL,
    "prodcat" character varying(50) NOT NULL,
    PRIMARY KEY ("id")
); 

CREATE TABLE "customer" (
    "id" int4 DEFAULT 
        nextval('customer_id_seq'::text) NOT NULL,
    "name" character varying(50) NOT NULL,
    "email" character varying(50),
    "state" character varying(50) NOT NULL,
    PRIMARY KEY ("name")
);

CREATE TABLE "sales" (
    "id" int4 DEFAULT 
        nextval('sales_id_seq'::text) NOT NULL,
    "salestime" timestamp,
    "prodid" int8,
    "prodprice" numeric(9,2),
    "amount" int4,
    "name" character varying(50),
    "email" character varying(50),
    "state" character varying(50),
    PRIMARY KEY ("id")
);

COPY "products" FROM stdin;
1    385343 44.99 20.00
2    385342 49.99 64.00
3    394568 39.99 18.00
4    106666 39.99 120.00
5    765354 29.99 98.00
\.

COPY "prodtext" FROM stdin;
1    385343 english Python Developer's Handbook
2    385343 german Python Entwickler Handbuch
3    385342 english Linux Socket Programming
4    394568 english Linux Hardware Handbook
5    394568 german Das Linux Hardware Buch
6    106666 english Linux: Networking for your Office
7    106666 german Linuxnetzwerke fürs Büro 
8    765354 english Palm Programming
\.

COPY "prodcat" FROM stdin;
1    385343 Python
2    385343 Developer
3    385343 Handbook
4    385342 Linux
5    394568 Linux
6    394568 Handbook
7    106666 Networking
8    106666 Linux
9    765354 Palm
\.

COPY "customer" FROM stdin;
1    John  john@no.any  Florida
2    Robert rober@no.any  Georgia
3    Peter peter@no.any  Alaska
\.

COPY "sales" FROM stdin;
1    2001-04-03 15:01:27+02 106666 39.99  1   John  john@no.any  Florida
2    2001-05-12 16:22:09+02 394568 39.99  1   John  john@no.any  Florida
3    2001-05-12 17:19:04+02 385343 49.99  2   Robert robert@no.any Florida
4    2001-06-12 17:19:04+02 385343 44.99  1   Robert robert@no.any Florida
\.

CREATE INDEX "products_id_key" on "products" 
    using btree ( "id" "int4_ops" );

CREATE INDEX "prodtext_id_key" on "prodtext" 
    using btree ( "id" "int4_ops" );

CREATE INDEX "prodcat_id_key" on "prodcat" 
    using btree ( "id" "int4_ops" );

CREATE INDEX "customer_id_key" on "customer" 
    using btree("id" "int4_ops" );

CREATE INDEX "customer_email_key" on "customer" 
    using btree ( "email" "varchar_ops" );

CREATE INDEX "sales_email_key" on "sales" 
    using btree ( "email" "varchar_ops" );

CREATE CONSTRAINT TRIGGER "<unnamed>" AFTER DELETE 
    ON "products" NOT DEFERRABLE INITIALLY IMMEDIATE 
    FOR EACH ROW EXECUTE PROCEDURE 
    "RI_FKey_noaction_del" ('<unnamed>', 'prodtext',
    'products', 'UNSPECIFIED', 'prodid', 'prodid');

CREATE CONSTRAINT TRIGGER "<unnamed>" AFTER UPDATE 
    ON "products" NOT DEFERRABLE INITIALLY 
    IMMEDIATE FOR EACH ROW EXECUTE PROCEDURE
    "RI_FKey_noaction_upd" ('<unnamed>', 'prodtext', 
    'products', 'UNSPECIFIED', 'prodid', 'prodid');

CREATE CONSTRAINT TRIGGER "<unnamed>" AFTER 
    INSERT OR UPDATE ON "prodtext" 
    NOT DEFERRABLE INITIALLY IMMEDIATE FOR EACH ROW 
    EXECUTE PROCEDURE "RI_FKey_check_ins" ('<unnamed>',
    'prodtext', 'products', 'UNSPECIFIED', 
    'prodid', 'prodid');

Let's take a closer look at this code. First, sequences are created. Every table has a column called id, and the sequences assign a unique id to every row in the database. Then we create a table called products to store all information about the products in the shop. The column called prodid will be a unique product id. prodid is also the primary key of the table. price will contain the current price of the product, and stored will contain the number of available units of the product.

prodcat contains the product categories. One book might have multiple entries. The table is referred to by the product id. prodtext stores the title of a book in multiple languages. The customer table stores information about all registered users, and the sales table stores all information about sales, such as the number of books a user purchased or the price of a product.

The COPY commands are used to read data directly from the input file. Then, indices are created on the id column of all tables. At this point, it is useful to add additional indices for certain columns to speed up queries. It depends on the type of queries the user wants to perform on which columns to define indices.

Note

Creating indices is not always good, because sometimes disk space is wasted. Reduced storage is not the only drawback. Sometimes indices can slow the database down, especially when performing INSERT statements. This happens because the database must update the index and the data, which is actually slower than just doing the updates for the data.

Finally, triggers are added to guarantee the integrity of the database. It is not useful to have a product in table prodtext that doesn't exist in table products anymore—the triggers help the user to do some data cleaning.

If you want to create a database with the data shown previously, create an empty database (use createdb shop) and write a simple text file containing the SQL code. Then type

psql shop < yourtextfile.sql

As soon as the import is ready, you can connect to your database with psql.

Simple Queries

Retrieving data is probably the most important activity when dealing with databases. Storing data is useful only if you can find the right piece of information any time it is needed. In case of relational databases, you use the SELECT command to retrieve data from a database. SELECT statements can be very simple but also very tricky. This section should give you a short overview of how to use SELECT for simple queries. Let's start with an example. We want to retrieve where the product id is 385343:

SELECT * FROM products WHERE prodid='385343';

The database executes the query and if everything is configured correctly, displays this result:

 id | prodid | price | stored
----+--------+-------+--------
 1 | 385343 | 44.99 | 20.00
(1 row)

The database has returned all requested records and all columns of the table. If you want to have only the last three columns of the table and only products with a prodid that is higher than 385343, you use the following query:

SELECT prodid,price,stored FROM products WHERE prodid > '385343';

The result is

 prodid | price | stored
--------+-------+--------
 394568 | 39.99 | 18.00
 765354 | 29.99 | 98.00
(2 rows)

Sometimes it is necessary to build more complex WHERE clauses. Imagine a query where we want to retrieve all records where the product id is higher than 385343 and the price is lower than 35$. The query looks like this:

SELECT * FROM products WHERE prodid > '385343' AND price < 35;

The result contains only one value, because the price of product number 394568 is higher than 35$. The AND token tells the database to return only values where both conditions are fulfilled.

Very often AND and OR tokens have to be combined to find the right result. The following query retrieves all records in table products where the product id is 385343 if the price is lower than 35%. Additionally, all records are displayed where the price is exactly 44.99$. With the help of the OR token, the database returns all records where the left expression (prodid > '385343' AND price < 35) or the right expression (price='44.99') returns true:

SELECT * FROM products WHERE (prodid > '385343' 
    AND price < 35) OR price='44.99';

Here is the result for this query:

 id | prodid | price | stored
----+--------+-------+--------
 1 | 385343 | 44.99 | 20.00
 5 | 765354 | 29.99 | 98.00  

The product with id of 385343 is returned because the price matches the right expression exactly. The product with number 765354 is returned because the product is higher than 385343 and the price is lower than 29.99$.

You can see in these first few examples that SQL offers an easy syntax to perform simple queries quickly and without writing huge amounts of code.

Let's look at another example, finding all categories in table prodcat. Because some categories might be listed twice, you can use the DISTINCT operator to make sure that each word appears only once in the result:

SELECT DISTINCT prodcat FROM prodcat;

The result is

 prodcat
------------
 Developer
 Handbook
 Linux
 Networking
 Palm
 Python
(6 rows) 

Only 6 rows are returned, although 9 rows are in the database, because the words Handbook and Linux appear more than once in the table. Use the following if you want the database to return only the first 3 rows in alphabetical order:

SELECT DISTINCT prodcat FROM prodcat 
    ORDER BY prodcat ASC LIMIT 3;

The result is

 prodcat
-----------
 Developer
 Handbook
 Linux
(3 rows) 

You use the ORDER BY clause to sort the result. ASC means that the result has to be sorted in ascending order. If you want the database to return the result in descending order, you use the DESC token. The LIMIT command saves time when performing SELECT * operations on extremely huge tables. You also can display all but the first three records. With the help of OFFSET, you can skip the first n records returned.

You can use OFFSET and LIMIT together. If you want to get only the fourth record returned by the query, you use the following:

SELECT DISTINCT prodcat FROM prodcat 
    ORDER BY prodcat ASC LIMIT 1 OFFSET 3;

This query returns Networking. The explanation for the result is simple: OFFSET 3 makes the database skip the first 3 rows. LIMIT 1 takes the first line and omits the rest.

Performing Simple Calculations

In many cases, it is useful to perform simple calculations directly within an SQL statement. Assume a query where you want to calculate the total amount of money a user has to pay:

SELECT prodid, prodprice, amount, prodprice*amount, name 
    FROM sales; 

You take the price of one unit of the product and multiply it with the amount of units the user bought. The result is

 prodid | prodprice | amount | ?column? | name
--------+------------+-------+----------+--------
 106666 |   39.99 |   1 |  39.99 | John
 394568 |   39.99 |   1 |  39.99 | John
 385343 |   49.99 |   2 |  99.98 | Robert
 385343 |   44.99 |   1 |  44.99 | Robert

You also can write more complex queries. In all cases, SQL uses the "infix" syntax.

Joining Tables

Joining tables is an extremely important issue when talking about relational and object relational databases. Joins are used to select and combine information from multiple columns. The tables you want to join must have at least one column in common.

Suppose you want to create a list with all product descriptions, their price, and the amount of books stored. The amount of books stored and the price can be found in the products table. The description of the products can be found in the prodtext table. Both tables have a column for the product id, so you can join the tables with the help of the product id:

SELECT prodtext.lang, prodtext.proddesc, 
    products.price, products.stored 
    FROM products, prodtext 
    WHERE products.prodid=prodtext.prodid;

The result is the following list of eight records containing the German and English titles of the books:

 lang  |      proddesc        | price | stored

---------+-----------------------------------+-------+--------
 english | Linux: Networking for your Office | 39.99 | 120.00
 german | Linuxnetzwerke fürs Büro     | 39.99 | 120.00
 english | Linux Socket Programming     | 49.99 | 64.00
 english | Python Developer's Handbook    | 44.99 | 20.00
 german | Python Entwickler Handbuch    | 44.99 | 20.00
 english | Linux Hardware Handbook      | 39.99 | 18.00
 german | Das Linux Hardware Buch      | 39.99 | 18.00
 english | Palm Programming         | 29.99 | 98.00  

If you now want to retrieve all books (English title only) where more than 80 books are stored, you use the following query:

SELECT prodtext.lang, prodtext.proddesc, 
    products.price, products.stored 
    FROM products, prodtext 
    WHERE products.prodid=prodtext.prodid
        AND prodtext.lang='english' 
        AND products.stored>80;

The result is

 lang  |      proddesc        | price | stored

---------+-----------------------------------+-------+--------
 english | Linux: Networking for your Office | 39.99 | 120.00
 english | Palm Programming         | 29.99 | 98.00

Only two books are left. In the example, you can see that combining data using AND tokens is extremely easy. Remember the most important point when joining tables: The tables must have at least one column in common. This example contains the column called prodid. Don't forget to perform the actual join (products.prodid=prodtext.prodid) in your query. If you do forget, you will get the Cartesian product of the two tables as the result. The Cartesian product is a join without a WHERE clause. Try the following:

SELECT * FROM products, prodtext;

to find out about the outcome. The result will be 40 lines long, and most of the lines won't have anything in common.

In the next example, we want to generate a complete list of all sales including the product description:

SELECT sales.salestime, prodtext.proddesc, sales.amount 
    FROM sales, prodtext 
    WHERE prodtext.prodid=sales.prodid 
        AND prodtext.lang='english'; 

This joins the table prodtext and the table sales via the product id:

    salestime    |      proddesc       | amount

------------------------+----------------------------------+--------
 2001-05-12 17:19:04+02 | Python Developer's Handbook   |   2
 2001-06-12 17:19:04+02 | Python Developer's Handbook   |   1
 2001-05-12 16:22:09+02 | Linux Hardware Handbook     |   1
 2001-04-03 15:01:27+02 | Linux: Networking for your Office|   1   

Self-Joins and Aliases

This section gives you insight into aliasing and self-joins. Aliases can be used to assign an additional name to a column. On the one hand, it is possible to write shorter and easier-to-read queries (aliasing can speed up writing a query significantly, especially when the query string is more complex). On the other hand, aliases can be used to perform self-joins. With a self-join, a table performs a join with itself. You will see an example of this later. First let's look at a simple example using aliases:

SELECT * FROM products AS a where a.stored=20;

This query selects all records from table products where the amount of books stored is exactly 20. In this case, a is an alias for products. The query can also be written the following way:

SELECT * FROM products a where a.stored=20;

It is not necessary to use the AS token, but AS can sometimes make the query easier to read.

Let's look at a shop where each product can be assigned to multiple product categories. Because the number of categories of a product should not be limited, we won't define a separate column for every category. It would be nice to have a table with just one table for the product category. The following table is called prodcat in the sample database (use "SELECT

* FROM prodcat;"):

 id | prodid | prodcat
----+--------+------------
 1 | 385343 | Python
 2 | 385343 | Developer
 3 | 385343 | Handbook
 4 | 385342 | Linux
 5 | 394568 | Linux
 6 | 394568 | Handbook
 7 | 106666 | Networking
 8 | 106666 | Linux
 9 | 765354 | Palm    

First you write a query to find all products that are assigned to the category Handbook:

SELECT * FROM prodcat WHERE prodcat='Handbook'; 

Two records are returned:

 id | prodid | prodcat
----+--------+----------
 3 | 385343 | Handbook
 6 | 394568 | Handbook

Then we want to find all products that are listed in categories Handbook and Developer. We use a self-join:

SELECT * 
    FROM prodcat AS a, prodcat AS b 
    WHERE a.prodcat='Handbook'
        AND b.prodcat='Developer' 
        AND a.prodid=b.prodid;

The result is

 id | prodid | prodcat | id | prodid | prodcat
----+--------+----------+----+--------+-----------
 3 | 385343 | Handbook | 2 | 385343 | Developer

The database performs the join as if we had two identical tables. The result is only one record because only this record matches both categories. It is also possible to look for all records that match three categories.

Note

When working with multiple tables, you should use the * operator with care, because * displays all columns from all tables listed in the FROM clause.

In the next example, we look for all records that are assigned to Handbook, Developer, and Python:

SELECT * 
    FROM prodcat AS a, prodcat AS b, prodcat AS c 
    WHERE a.prodcat='Handbook' 
        AND b.prodcat='Developer' 
        AND c.prodcat='Python'
        AND a.prodid=b.prodid 
        AND a.prodid=c.prodid 
        AND b.prodid=c.prodid;

The result is again one record:

id | prodid | prodcat | id | prodid | prodcat | id | prodid | prodcat 
----+--------+----------+----+--------+-----------+----+--------+---------
 3 | 385343 | Handbook | 2 | 385343 | Developer | 1 | 385343 | Python

You can see in these examples that self-joins can be done rather easily. The only crucial point is that you must not forget the actual joins to get the right result. If you try your queries on small testing tables, make sure that you have every possible difficulty in them. If the result of a query seems to be right, try to go through the code again and check whether you have all necessary join operations in it. If you don't test thoroughly, you might have bad luck when performing the query on large tables.

The output of the previous query is not a beautiful one. Try the following query to make the result easier to understand:

SELECT a.id AS rowid, a.prodid AS "product id" 
    FROM prodcat AS a, prodcat AS b, prodcat AS c 
    WHERE a.prodcat='Handbook' 
        AND b.prodcat='Developer' 
        AND c.prodcat='Python' 
        AND a.prodid=b.prodid 
        AND a.prodid=c.prodid 
        AND b.prodid=c.prodid; 

The result is much nicer than the previous one:

 rowid | product id
-------+------------
   3 |  385343  

In this example, you see how you can change the description of a table with the help of aliases. If the new heading contains whitespace characters, such as blanks, the new heading has to be quoted in the SQL query—if not, it can be written without quotes. The table can now be read easier.

Aggregation Functions

Aggregates are a powerful feature of relational databases. They can be used to perform operations with multiple lines involved. Let's look at an example where we want to count the lines in a table with the help of the count() function:

SELECT COUNT(*) FROM products;

The result is

 count
-------
   5
(1 row)

The count() function can also be combined with a WHERE clause. The following example shows you how to count all records where the amount of stored books is equal to or higher than 64:

SELECT COUNT(*) FROM products WHERE stored >= 64;

The result of the query is 3. As you might have noticed in the previous example, we queried and displayed only one column; now we also want the product id to be displayed:

SELECT prodid, COUNT(*) FROM products; 

This query leads to an error:

ERROR: Attribute products.prodid must be GROUPed or used in an aggregate function     

The count() function doesn't know what to count any more and the database won't perform the query. The problem can be solved by adding an aggregate function to the query:

SELECT prodid, COUNT(*) FROM products GROUP BY prodid;

The result is a list with all product ids and the frequency they appear in the table:

 prodid | count
--------+-------
 106666 |   1
 385342 |   1
 385343 |   1
 394568 |   1
 765354 |   1
(5 rows)    

In the example, you can see that every product id appears only once in the table.

Now we want to write a more complex query. We want to get a list with the sum of all sales of a certain book, including the product description. We have to perform a join of table sales and table prodtext. We want only the English titles to be displayed. Here is the query:

SELECT a.proddesc, SUM(b.amount) 
    FROM prodtext AS a, sales AS b 
    WHERE a.prodid=b.prodid 
        AND a.lang='english' 
    GROUP BY b.prodid, a.proddesc; 

The result is

       proddesc       | sum
-----------------------------------+-----
 Linux: Networking for your Office |  1
 Python Developer's Handbook    |  3
 Linux Hardware Handbook      |  1
(3 rows)    

We use the sum() function to get the sum of all sales. In the example, we use the GROUP BY token two times. This is necessary because one product id can have many titles. We have also used aliasing to make the actual code of the query shorter. Don't forget the actual join operation in the query; in this example, the join is performed by the a.prodid=b.prodid expression.

In some cases, it can be useful to get only a small piece of the result of a query using aggregate functions. Imagine an example where we want to get all products that have been sold more than once:

SELECT prodid, SUM(amount) AS "total sum" 
    FROM sales 
    GROUP BY prodid 
    HAVING SUM(amount) > 1;

The result is very short:

 prodid | total sum
--------+-----------
 385343 |     3
(1 row)  

Only product number 385343 is available in the list, where more than one unit has been sold. The query contains a HAVING clause. HAVING is used to perform condition tests on aggregate functions. In our case, we check whether the result of the GROUP BY is higher than 1. A query can also have AND or OR tokens in the HAVING clause. The following example adds an AND token to the query we used earlier:

SELECT prodid, SUM(amount) AS "total sum" 
    FROM sales 
    GROUP BY prodid 
    HAVING SUM(amount) > 1
        AND SUM(amount) < 10;  

The result is the same as the result of the query used before, because no product was sold more often than nine times.

Now we want to calculate the average price of the products in table products:

SELECT AVG(price) FROM products; 

The database performs the calculation and returns the following:

   avg
---------------
 40.9900000000
(1 row)  

The value returned is precise, but in many cases not suitable. We want the result to be displayed like a currency, which means only two numbers after the comma. For that purpose, we use the round() function, which needs two arguments. The first argument is the actual value that has to be processed, and the second argument is used to define the precision. In our case, the query looks like this:

SELECT ROUND(AVG(price),2) FROM products; 

The result is

 round
-------
 40.99
(1 row) 

This result is what we expected. Now we want the result to be displayed in full dollars without a comma:

SELECT ROUND(AVG(price),0) FROM products; 

The result of the query is now 41, because 41 is the closest integer value. The second parameter of the function can also be negative:

SELECT ROUND(AVG(price),-2) FROM products; 

As you might have thought, the result is 0. You can see in these examples that PostgreSQL supports nested functions.

The next example shows how to calculate the minimum or the maximum of a column:

SELECT MIN(price) FROM products;

The result is

 min
-------
 29.99
(1 row)   

Note

Check out the docs of your PostgreSQL database to determine whether your version can also perform MAX() and MIN() with the help of indices. Otherwise, your query can be very slow on large tables.

Subqueries

Subqueries are used to calculate values that are needed within the query that is executed. To make it clearer, here is an example: Imagine you want to find all persons in a table whose income is higher than the average income of the persons stored in the table. You need the average income to find the persons with a higher income. One possibility to solve the problem is to perform two separate statements, but what do you do when half of the persons stored in the tables are removed by another user while processing the two SELECT statements? The only way to solve the problem properly is to use subqueries. A subquery is actually a query within a query.

Here is an example where we select all products with a higher price than the average price:

SELECT * 
    FROM products 
    WHERE price > 
        (SELECT AVG(price) FROM products); 

First, the result of the subquery is calculated. After that, the result is used to perform the main query. Let's try it:

SELECT * 
    FROM products 
    WHERE price > 
        (SELECT * FROM products);

The database displays the following error:

ERROR: Subselect must have only one field

It is obvious that the database can't perform the query because a price is only one value. When performing subqueries, you have to make sure that the subquery returns exactly one field.

One query can contain many subselects. The following is a query with two subselects for selecting all products that are more expensive than the average price, but less expensive than the average price plus 5$:

SELECT *
    FROM products 
    WHERE price > 
        (SELECT AVG(price) FROM products) 
    AND price < 
        (SELECT AVG(price)+5 FROM products);  

There will be only one record left in the result:

 id | prodid | price | stored
----+--------+-------+--------
 1 | 385343 | 44.99 | 20.00
(1 row)           

Not only a query can contain many subselects, subselects can also contain subselects. This feature is powerful and useful—especially when working with complex queries and data. Just insert a subquery in your SQL code where you need a special value that has to be calculated at runtime.

Using SELECT and INSERT

INSERT statements can easily be combined with SELECT statements. These are usually done when a large amount of data has to be copied from one table into another or when the results of complex queries are stored in temporary tables to achieve better performance. In the following example, we create a temporary table and insert some values into it. Here is the CREATE TABLE command:

CREATE TEMPORARY TABLE temptable 
    (id integer, prodid int8, price numeric(9,2));

Note

The temporary table won't be listed by a \d command, but the table can be treated like a nontemporary table.

Now we insert some data from table products into table temptable:

INSERT INTO temptable(prodid, price) (SELECT prodid, price FROM products);   

Let's see whether the operations have been performed correctly:

SELECT * FROM temptable;

The test result is

 id | prodid | price
----+--------+-------
  | 385343 | 44.99
  | 385342 | 49.99
  | 394568 | 39.99
  | 106666 | 39.99
  | 765354 | 29.99
(5 rows)     

You can see that the database has done all INSERT operations properly.

Pattern Matching and Regular Expressions in SQL

A regular expression is a formula for matching strings that follow some pattern. Many people are afraid to use regular expressions, but when you get used to them, you can't live without them. Perl hackers especially seem to be addicted to regular expressions, because they are a fast and precise way of performing substring searching.

Regular expressions are implemented in all major databases and can therefore also be used in PostgreSQL. Although the syntax of regular expressions differs slightly from the syntax programming languages such as Perl, it is easy to learn.

Regular expressions consist of characters and metacharacters. Metacharacters are used to tell the database that a pattern has to occur more than once or something like that. In the simplest case, a regular expression can be a search string. Let's start with an example: We want to retrieve all people in table customer whose name is John:

SELECT * FROM customer WHERE name LIKE 'John';

The result is

 id | name |  email  | state
----+------+-------------+---------
 1 | John | john@no.any | Florida
(1 row)

Note

The database performs a substring search for John.

The LIKE token can be used to perform the pattern matching, but it is also possible to use the ~ operator instead.

Let's write a query where all names starting with R are displayed:

SELECT * FROM customer WHERE name ~ '^R';

The ^ operator is used to match the beginning of a string. The R must be the first character after the beginning.

 id | name |  email   | state
----+--------+--------------+---------
 2 | Robert | rober@no.any | Georgia
(1 row)   

What happens if the name starts with r instead of R? The ~ operator performs case-sensitive searching. We can use the ~* operator to perform case-insensitive queries:

SELECT * FROM customer WHERE name ~* '^r';

The result is the same as the previous one. Now let's query all names that do not start with r:

SELECT * FROM customer WHERE name !~ '^R';  

The result contains the remaining records:

 id | name |  email   | state
----+-------+--------------+---------
 1 | John | john@no.any | Florida
 3 | Peter | peter@no.any | Alaska
(2 rows)  

The case-insensitive version of the query is

SELECT * FROM customer WHERE name !~* '^r';

In some cases, it is necessary to search for a sequence of characters in a string. Here is an example where we search for all names that have er somewhere in them:

SELECT * FROM customer WHERE name ~* 'er';

The query returns the records of Robert and Peter. It is also possible to see whether certain characters appear in a string:

SELECT * FROM customer WHERE name ~ '[A-P]'; 

This query returns all names that contain one of the letters from A to P. If you don't want to have the records from A to P, but do want all others, write [^A-P] instead. The same kind of query can be done with numbers.

Suppose we want to look for all records that do not have the numbers 2 or 3 in the id and contain the letters from A to P:

SELECT * FROM customer WHERE id ~ '[^23]' AND name ~ '[A-P]'; 

The result is

 id | name |  email  | state
----+------+-------------+---------
 1 | John | john@no.any | Florida
(1 row)

The $ operator symbolizes the end of a string. If we want to query all records that end with an n, we can use the following SQL command:

SELECT * FROM customer WHERE name ~ 'n$';

The only name ending with an n is John, so that is the only record returned. What happens if we want to query records containing a $ or any other metacharacter? To show the solution for such a problem, let's create a temporary table and insert one record:

CREATE TEMPORARY TABLE demo(demo text);
INSERT INTO demo VALUES ('demo$');

Now we want to write a query that retrieves the records:

SELECT * FROM demo WHERE demo ~ '\$';

The character has to be masqueraded using a backslash (\). The result of the query is

 demo
-------
 demo$
(1 row) 

In this example, you can see how powerful regular expressions are, but the real power of regular expressions can be seen when you have to perform complex queries. Assume a situation where someone tells you that he has bought a book on "the third." You don't know the month but you know that it must have been in the year 2001. You can write a query that selects all possible records from table sales:

SELECT id, salestime 
    FROM sales 
    WHERE salestime ~ '2001-.{2}-03';

The query selects all records whose salestime contains 2001, a - and exactly 2 characters before -03. The two undefined characters are the month when the sale took place. A period (.) matches any character exactly once. Parentheses can be used to tell the database how many undefined characters must be found before the rest. If you don't know whether the month is stored with one or two numbers, you can use the following:

SELECT id, salestime 
    FROM sales 
    WHERE salestime ~ '2001-.{1,2}-03';   

The result of the query contains the requested record:

 id |    salestime
----+------------------------
 1 | 2001-04-03 15:01:27+02
(1 row)

Assume that you knew that the sale took place in March or April; you can use this query:

SELECT id, salestime 
    FROM sales 
    WHERE salestime ~ '2001-.3?4?-03'; 

A question mark (?) symbolizes that a certain character may occur zero times or once. In this example, we look for March (3) or April (4). If one of the characters is found, the right result will be displayed.

Two additional operators are the + and the *. + means that the previous character must match at least once. * means that the previous character must match at least zero times. Imagine a query where you are looking for a certain product id. The only thing you know is that it started with 106 and continued with 6, but you don't know how many times the 6 is in the product id. Here is the solution for your problem:

SELECT * FROM products WHERE prodid ~ '106*';

In this example, 6 has to occur after 10 at least zero times; the result will contain the requested product id:

 id | prodid | price | stored
----+--------+-------+--------
 4 | 106666 | 39.99 | 120.00
(1 row)     

Tables 3.1 and 3.2 present an overview of the most important operators and metacharacters, respectively, of regular expressions.

Table 3.1 Operators of Regular Expressions

Operator

Description

~

Case-sensitive matching

~*

Case-insensitive matching

!~

Does not match, case-sensitive

!~*

Does not match, case-insensitive


Table 3.2 Metacharacters of Regular Expressions

Special Metacharacter

Description

^

Matches the beginning of a string.

$

Matches the end of a string.

.

Matches any character exactly once.

+

Matches previous character once or more often.

*

Matches previous character zero times or more often.

?

Matches previuous character zero times or more often.

[abc]

Matches one of the characters in the list.

[^abc]

Doesn't match one of the characters in the list.

[a-c]

Matches all characters from a to c.

[^a-c]

Doesn't match the characters from a to c.

{x}

The previous character must occur exactly x times.

{x,y}

The previous character must occur x to y times.

( )

Parentheses are used to group expressions.


Groups occur as one character for the expression after the parentheses (for example, the * operator after the parentheses means that the whole expression in parentheses has to match zero times or more often).

The examples of regular expressions might seem a bit strange. You might wonder who is looking for persons or for products with a stupid sequence of characters in it. Don't be surprised, but most of the examples are taken from real-world situations. Sometimes a customer wants to know something, but he can't say precisely what. He might say, "It must be something like that." In these cases, you can use regular expressions. Regular expressions can save you a lot of time and lots of code. You can solve complex problems within a few seconds and with only a few characters or lines of code.

Those people who are already addicted to regular expressions—like me—know what I am talking about.

Using SELECT INTO Statements

You use SELECT INTO to redirect the result of a query into a table. This is sometimes useful when debugging code; or you can use it to write two simple, rather than one complex, SQL statement:

SELECT prodid, price 
    INTO TABLE sel 
    FROM products;

The SELECT INTO statement creates a new table called sel. Let's look at whether the table has been created and filled correctly:

shop=# SELECT * FROM sel;
 prodid | price
--------+-------
 385343 | 44.99
 385342 | 49.99
 394568 | 39.99
 106666 | 39.99
 765354 | 29.99
(5 rows)    

sel has two columns called prodid and price, and contains the complete result of the SELECT statement. sel is not a temporary table, so it cannot be seen only by the user who created it. If you want sel to be a temporary table, try the following command:

DROP TABLE sel; 
SELECT prodid, price 
    INTO TEMPORARY TABLE sel 
    FROM products;   

The data you want to insert into a table created by SELECT INTO can be the result of a very complex SQL query.

Note

If you want to add data to a table without creating a new table, use INSERT in combination with SELECT statements.

Formatting the Output and Casting

When performing SELECT statements, psql creates a header for the tables. This header might not be, in many cases, the result you want to see. Let's see how you can make the output look right. We start with a simple SELECT statement:

SELECT AVG(price) FROM products; 

The result of the query does not look right:

   avg
---------------
 40.9900000000
(1 row)

We want to change the header to average price, which can be done with the following query:

shop=# SELECT AVG(price) AS "average price" FROM products;
 average price
---------------
 40.9900000000
(1 row)

Now we want to display the price and cast it to integer:

SELECT CAST(AVG(price) AS int4) AS "average price" 
    FROM products; 

The result of the query looks like this:

 average price
---------------
      41
(1 row)

In this example, we perform an explicit cast. First the average price is calculated, and then the cast to integer is done. We can use the CAST() function to perform the operation. Casts can also be done without the help of the CAST() function. Simply add a colon and the datatype you want to receive to the column:

SELECT AVG(price)::int4 AS "average price" 
    FROM products;  

Finally we want the price to be displayed with a $ at the end of the number, so we use the following code:

SELECT '$' || CAST(AVG(price) AS int4) AS "average price"
  FROM products;

The || operator concatenates values in SQL. In our case, we add the $ symbol to the result:

 average price
---------------
 $41
(1 row)

Note

NOTE

Not all types of casts can be performed by the database. PostgreSQL 7.0.3, for instance, does not support casts from numeric to text; we had to use integer in the example , but to_char()is also possible.

Now we want to write a query where multiple columns are treated as one:

SELECT name || ' bought ' || amount || ' units' AS "result" 
    FROM sales; 

We select and concatenate multiple columns and assign the name result to the column. The table is now displayed as one column:

    result
-----------------------
 John bought 1 unit
 John bought 1 unit
 Robert bought 2 units
 Robert bought 1 unit
(4 rows)

In the next example, we add a WHERE clause to the query that selects only those records that contain John:

SELECT name || ' bought ' || amount || ' units' AS result 
    FROM sales 
    WHERE name || ' bought ' || amount || ' units' 
        ~ 'John'; 

The result contains two records:

    result
---------------------
 John bought 1 unit
 John bought 1 unit
(2 rows)

We can't use the alias of the column in the WHERE clause because this would lead to an error in PostgreSQL 7.0.3. We have to use the full name of the column as it is shown in the previous example:

shop=# SELECT name || ' bought ' || amount || ' units' 
    AS result 
    FROM sales 
    WHERE result ~ 'John';
ERROR: Attribute 'result' not found

Generating HTML

HTML is the most important language when dealing with Web applications. psql offers a flag that tells the database to generate HTML code instead of ordinary ASCII text. Unfortunately, this feature is supported only by psql and not by other interfaces of PostgreSQL. Because most Web interfaces won't use psql in combination with pipes, the flag is not used very often.

Use psql with the -H flag to enable HTML output. From now on, all results will be returned as HTML code. The most common way of performing a query that returns HTML code is to use your favorite UNIX shell.

Here is an example where we select all products that are more expensive than 40 dollars:

echo "SELECT * FROM products WHERE price > 40;" | psql -H shop

The result will be a table including all HTML tags:

<table border=1>
 <tr>
  <th align=center>id</th>
  <th align=center>prodid</th>
  <th align=center>price</th>
  <th align=center>stored</th>
 </tr>
 <tr valign=top>
  <td align=right>1</td>
  <td align=right>385343</td>
  <td align=right>44.99</td>
  <td align=right>20.00</td>
 </tr>
 <tr valign=top>
  <td align=right>2</td>
  <td align=right>385342</td>
  <td align=right>49.99</td>
  <td align=right>64.00</td>
 </tr>
</table>
(2 rows)<br>

Updating and Deleting Data

Data cannot only be inserted and selected from a table. UPDATE and DELETE are two essential commands for modifying and removing data from a table. The syntax of UPDATE and DELETE is similar to that of SELECT statements. In this section, you will learn to use those two commands efficiently.

Writing UPDATE Queries

SQL supports UPDATE queries. Updating means that the database searches for records that fulfill certain conditions and changes the values of some fields of those records. The syntax of UPDATE queries is simple:

Command:   UPDATE
Description: Replaces values of columns in a table
Syntax:
UPDATE table SET col = expression [, ...]
  [ FROM fromlist ]
  [ WHERE condition ]  

First you define the table. After the SET token, you specify a list of columns that have to be updated. With the help of the FROM clause, a table can be updated by taking data from other tables into consideration. The WHERE clause can be used like the WHERE clause of SELECT statements; the syntax is the same.

Let's look at some examples. We will write a query that updates the sales table by changing the amount of books bought by John from 2 to 3:

UPDATE sales SET amount=3 WHERE id=3;

You see, the query is indeed very simple. Use the following to determine whether the query has been performed correctly:

SELECT id, amount FROM sales WHERE id=3;

If no error has been displayed, the result is

 id | amount
----+--------
 3 |   3
(1 row)

The update has been completed successfully. Now we want to lower the price of all books in the database by 1$. This example also shows that an UPDATE statement does not always need a WHERE clause:

UPDATE products SET price=price-1; 

The query assigns the price of the product minus 1 dollar to the products. UPDATE queries can also use subqueries to calculate the result of a value that has to be updated. Here is an example of assigning the price from table sales to the price in table products:

UPDATE products 
    SET price=
        (SELECT MAX(prodprice) 
             FROM sales 
             WHERE prodid='385343') 
    WHERE prodid='385343'; 

The update has been done for exactly one record, which is correct. The next example uses UPDATE queries with multiple columns. We will update the prices in table products with the value from table sales:

UPDATE products 
    SET price=sales.prodprice 
    FROM sales 
    WHERE products.prodid=sales.prodid;

Update queries can also contain self-joins. The next example changes all entries from Linux to UNIX for all books that are assigned to the categories Handbook and Linux:

UPDATE prodcat 
    SET prodcat='UNIX' 
    FROM prodcat AS a 
    WHERE prodcat.prodid=a.prodid 
        AND prodcat.prodcat='Linux' 
        AND a.prodcat='Handbook';

The database performs an update for product number 394568 and now has the following entries:

shop=# SELECT * FROM prodcat WHERE prodid='394568';
 id | prodid | prodcat
----+--------+----------
 6 | 394568 | Handbook
 5 | 394568 | UNIX
(2 rows)  

Using DELETE

The DELETE command is used to remove records that match certain criteria from a table. Here is an overview of the DELETE command's syntax:

Command:   DELETE
Description: Removes rows from a table
Syntax:
DELETE FROM table [ WHERE condition ] 

We create a table called temp and insert some values into it:

SELECT prodid, price, stored INTO temp FROM products; 

The table now contains the following:

 prodid | price | stored
--------+-------+--------
 385342 | 48.99 | 64.00
 765354 | 28.99 | 98.00
 106666 | 39.99 | 120.00
 385343 | 44.99 | 20.00
 394568 | 39.99 | 18.00
(5 rows)

The next query deletes all books that are more expensive than 40 dollars:

DELETE FROM temp WHERE price > 40;

The database deletes two records. If you want to delete all records from the table, you write the following:

DELETE FROM temp;

The table is now empty. The DELETE command is simple. The only mistake people make is to write DELETE * FROM table instead of DELETE FROM table. This mistake leads to a syntax error, and the DELETE can't do its job because DELETE removes a record completely. It is not possible to remove half of a dataset; you would have to update a column to NULL instead of deleting it.

Writing Advanced SQL Code

For performing complex operations, PostgreSQL supports a pool of features that can make the life of a database developer much easier. Features such as arrays and datatypes for storing network information are often essential for building sophisticated database applications.

Another important topic is masquerading. In the next section, you take a closer look at the idea behind masquerading, and you will learn in which situations characters have to be masqueraded.

Masquerading

Masquerading characters is extremely important and can save you a lot of time.

The purpose of character masquerading can be explained in a few words. Imagine a situation where you have to insert a string into a field that contains special characters. In SQL, characters such as single quotes (') have to be masqueraded because the database must distinguish between characters that will be inserted in a field and characters that are syntax components.

Assume that we want to add a French title to table prodtext. The title of the book in our example is Qu'est-ce que Python, which includes a single quote in the title. We try the following SQL command:

shop=# INSERT INTO prodtext(prodid, lang, proddesc) 
VALUES (385343,'french','Qu'est-ce que Python');
shop'#  

The database did not perform the query and is still prompting for some input. If you look closer at the value in the column called proddesc, you can see that the database has problems distinguishing between the single quotes used for the SQL command and the single quote in the text that has to be added. There are two possibilities to solve the problem: You can use a backslash or use a second single quote to make the INSERT command work properly; here is the code for the two methods:

shop=# INSERT INTO prodtext(prodid, lang, proddesc) VALUES
(385343,'french','Qu''est-ce que Python');
INSERT 51554 1

shop=# INSERT INTO prodtext(prodid, lang, proddesc) VALUES
(385343,'french','Qu\'est-ce que Python');
INSERT 51555 1  

Both methods work the same way. The following shows how the values are stored in the database:

shop=# SELECT * FROM prodtext WHERE prodid='385343' AND lang='french';
 id | prodid | lang |    proddesc
----+--------+--------+----------------------
 11 | 385343 | french | Qu'est-ce que Python
 12 | 385343 | french | Qu'est-ce que Python
(2 rows)

The same result can be reached with the following query. We look for all records that contain a single quote. Because we have inserted only two records, the result will be the same as before:

SELECT * FROM prodtext 
    WHERE lang='french' AND proddesc ~ '\'';

Not only single quotes have to be treated in a special way when dealing with regular expressions. Other characters—such as $ and ^, for instance—have to be masqueraded. This also can be done with the help of a backslash.

Deciding which characters to masquerade is easy: Do it for all characters that have a special meaning in regular expressions and try it with a backslash.

Working with Arrays

Arrays are data structures that enable the user to store multiple values of a certain datatype in one data structure in one field. Arrays are so called non-atomic values, which means that they consist of components. The size of an array can be limited or unlimited depending on the users' needs.

The next example creates a table that contains a one-dimensional array for storing comments. The array is of undefined size, which means unlimited:

CREATE TABLE comments (id serial, prodid int8, comment text[]); 

We will now insert a dataset into the table:

INSERT INTO comments(prodid, comment) 
    VALUES ('385343',
        '{"language","programming","Python"}');

The array is used to store three values in one column. The syntax of the INSERT command is not too difficult. The values are actually passed to the database in single quotes. The brackets within the two single quotes contain the three values of the array—all in double quotes.

Let's see how the values are returned when selecting the record from the table:

 id | prodid |        comment
----+--------+-------------------------------------
 1 | 385343 | {"language","programming","Python"}
(1 row)

If you want to retrieve values from the array, you use a SELECT statement. The WHERE clause has to be modified slightly by adding the index of the array to the name of the column. You can see in the next example that it is not enough to use the name of the column only:

shop=# SELECT * FROM comments WHERE comment='programming';
ERROR: array_in: Need to specify dimension 

Here is the correct version of the query:

shop=# SELECT * FROM comments WHERE comment[1]='programming';
 id | prodid | comment
----+--------+---------
(0 rows)
 
shop=# SELECT * FROM comments WHERE comment[2]='programming';
 id | prodid |        comment
----+--------+-------------------------------------
 1 | 385343 | {"language","programming","Python"}
(1 row)

The examples show that a record can be retrieved from an array only when the correct index of the array is specified in the query. This can sometimes be very tricky; because every array might have a different size, the user might not know about the required records.

Arrays also can be defined with a maximum size, by adding the size between the parentheses in the CREATE TABLE statement:

CREATE TABLE comments (id serial, prodid int8, comment text[5]);

The example creates a table with an array with a maximum of 5 entries.

Arrays are not limited to one dimension. In many cases, you can use multidimensional data structures. Here is an example where we create a table with a two-dimensional data structure:

CREATE TABLE comments (id serial, comments text[][]);  

The table is used to store comments. We use the array to store the category and the text of the comment. Now let's insert some data into the table:

INSERT INTO comments(comments) 
    VALUES ('{{"language"}, {"Python"}}');

The table contains one record now. The category is set to language; the text of the category is Python. We can easily retrieve the data from the table:

shop=# SELECT * FROM comments WHERE comments[1][1] ~ '.';
 id |     comments
----+---------------------------
 2 | {{"language"},{"Python"}}
(1 row)
 
shop=# SELECT * FROM comments WHERE comments[1][2] ~ '.';
 id | comments
----+----------
(0 rows)

In this example, the columns of the array have to be indexed correctly to receive a result from the database. A two-dimensional array can be compared with a system of coordinates. Every point in the system can be assigned to a value. Multidimensional arrays can be useful for scientific databases. Assume a query where you want to know all employees in a certain area, their income, their profession, and their gender. The output of that query can be a three-dimensional array. The number of employees is assigned to every combination of income, profession, and gender. The required result (for example, income = 2000$, gender = female, profession = lawyer) can be found by indexing the array containing all results. The algorithm just described is used in a real-world situation because this is (possibly) the only way of keeping the load of the database low even when generating a lot of data.

Arrays and the COPY Command

Inserting data cannot always be done with INSERT commands, because this would be too slow when treating huge amounts of data. The next example shows how simple it is to use the COPY command to insert data into an array. The database uses nearly the same syntax for the array as the INSERT command:

COPY "comments" FROM stdin;
2    {{"language"},{"Python"}}  
\.

Working with BLOBs

Many advanced databases, including PostgreSQL, support treatment of a binary large object (BLOB). Storing a BLOB means storing a file. It does not matter what's in that file; it can be binary data, ASCII, EBCDIC, and so forth. BLOBs have to be treated differently than "ordinary" data, because it would be complicated to use a plain 500MB file within an INSERT command like ordinary text. BLOBs have to be loaded into the database.

Note

BLOBs don't store the position of the file in the file system. BLOBs are stored directly in the database.

The file upload can be done with the lo_import() function. If you want to export the file, the lo_export() function can be used to export the data in the database to a file in the file system.

Let's create a table where we want to store the product id and the image of a product directly within the database:

CREATE TABLE prodimages (id serial, prodid int8, image OID);

Recall that OID is the acronym for object identifier. In Chapter 7, "Backup and Recovery," you will see what that means when you want to back up your complete database.

For now, let's insert a picture in the database:

shop=# INSERT INTO prodimages(prodid, image) VALUES ('385343', lo_import('/tmp/385343.jpg'));
INSERT 51759 1  

The database generates a unique object id for the file and loads it into the database where it is stored. The unique id of our file in the example is 51745:

shop=# SELECT * FROM prodimages;
 id | prodid | image
----+--------+-------
 1 | 385343 | 51745
(1 row)

You can see in the example that a SELECT statement returns only the object id of the file we have just uploaded. Internally, PostgreSQL creates separate files for every file uploaded:

bash-2.04$ ls -l *51745*
-rw-------  1 postgres postgres   8192 Apr 7 15:18 xinv51745
-rw-------  1 postgres postgres  16384 Apr 7 15:18 xinx51745   

Now we want to export the file stored in the record to a file. This can be done with the help of the lo_export() function. The syntax is similar to the syntax of the lo_import() function:

SELECT lo_export(prodimages.image,'/tmp/export.jpg') 
    FROM prodimages 
    WHERE prodid='385343';

The database exports the image to the file /tmp/export.jpg. If the files in the database become very big, this might take some time—so watch out when writing Web applications. If a user wants to upload large files into the database, this can lead to trouble. Make sure that the maximum size of the file is limited to a certain size.

Now we want to delete and reinsert the records. You must keep one issue in mind when deleting BLOBs. Never perform DELETE commands on tables containing BLOBs before you have deleted the actual image in the database. If you delete the records, the files in the database containing the image won't be deleted—the database deletes only the object id—but the data will become something like a zombie. So make sure that you use the lo_unlink() function to drop the image first:

SELECT lo_unlink(prodimages.image) FROM prodimages;

The file has now been deleted from the database; let's see whether the files are still there:

bash-2.04$ ls -l *51745* 2> /dev/null | wc -l
   0

The files has been found zero times; everything is okay so far, but what about the record in the table?

shop=# SELECT * FROM prodimages;
 id | prodid | image
----+--------+-------
 1 | 385343 | 51745
(1 row)

You can see that the record is still in the table. You can perform a DELETE operation now to safely eliminate the entry.

If you are working on high-availability systems, make sure that no one can access a record whose file has already been deleted. Implement some kind of locking mechanism, such as a locking flag for the datasets.

Using Network-Specific Datatypes

PostgreSQL has special datatypes for storing IP and MAC addresses. These datatypes offer input checking and some special functions and operators. IPs and MAC addresses can also be stored in ASCII text fields, but we strongly recommend using PostgreSQL's functions for network datatypes. Table 3.3 lists PostgreSQL network datatypes (for version 7.0.3).

Table 3.3 PostgreSQL Network Datatypes

Name

Storage

Range

cidr

12 bytes

Stores valid IPv4 network addresses

inet

12 bytes

Stores IPv4 hosts or network addresses

macaddr

6 bytes

Stores MAC addresses (Ethernet card hardware addresses)


The next example creates a table to store logging information in our shop database and shows you how to treat network addresses in PostgreSQL:

CREATE TABLE shoplog(id serial, fieldone cidr, 
    fieldtwo inet, mac macaddr);

The table contains one field of each datatype. Let's see how we can insert valid data into it:

INSERT INTO shoplog(fieldone, fieldtwo, mac) 
    VALUES ('192.168.1.0/24', '192.168.1.1/25',
        '00:50:FC:1E:7C:26');

The most important difference between cidr and inet is that cidr does not support data with nonzero bits to the right of the netmask. As you can see in the previous example, the netmask has to be written as one number instead of something like 255.255.255.0; otherwise, the system will report a parse error:

shop=# INSERT INTO shoplog(fieldone, fieldtwo, mac) 
     VALUES ('192.168.1.0/255.255.255.0', '192.168.1.1/25', 
         '00:50:FC:1E:7C:26');
ERROR: could not parse "192.168.1.0/255.255.255.0" 

If we now select all data from the table, we get this result:

shop=# SELECT * FROM shoplog;
 id |  fieldone  |  fieldtwo  |    mac
----+--------------+----------------+-------------------
 2 | 192.168.1/24 | 192.168.1.1/25 | 00:50:fc:1e:7c:26
(1 row)

Every column has a valid IP address. In fieldone, the zeros on the right edge of the IP address are omitted because this is redundant information (the database has the netmask of the IP). If we insert an IP address without explicitly adding the netmask, the database takes a default netmask of 32 for the cidr datatype; the inet datatype won't have a netmask:

INSERT INTO shoplog(fieldone, fieldtwo, mac) 
    VALUES ('192.168.1.0', '192.168.1.1/25', '00:50:FC:1E:7C:26');
INSERT INTO shoplog(fieldone, fieldtwo, mac) 
    VALUES ('192.168.1.1', '192.168.1.1/25', '00:50:FC:1E:7C:26');
INSERT INTO shoplog(fieldone, fieldtwo, mac) 
VALUES ('0.0.0.0','0.0.0.0','00:50:FC:1E:7C:26');
shop=# SELECT * FROM shoplog;
 id |  fieldone  |  fieldtwo  |    mac
----+----------------+----------------+-------------------
 2 | 192.168.1/24  | 192.168.1.1/25 | 00:50:fc:1e:7c:26
 3 | 192.168.1.0/32 | 192.168.1.1/25 | 00:50:fc:1e:7c:26
 4 | 192.168.1.1/32 | 192.168.1.1/25 | 00:50:fc:1e:7c:26
 5 | 0.0.0.0/32   | 0.0.0.0    | 00:50:fc:1e:7c:26
(4 rows)

In the last three lines of the result, you see what happens when performing the three INSERT statements.

PostgreSQL supports a number of input formats for the MAC address. Here is a complete list of the most important ways (for PostgreSQL 7.0.3):

Working with Geo-Data

PostgreSQL offers datatypes for geometric operations. PostgreSQL can therefore be used to build powerful solutions to fit scientific demands that can hardly be done with commercial mainstream databases. This section gives a brief insight into the wonderful world of modeling geographic data with PostgreSQL.

Geometric objects can be stored with the help of the datatypes described in the following sections.

point

A point is, in a way, the fundamental data structure when dealing with geometric objects point and can be handled easily and efficiently by the user. A point consists of two values: The first value is the x-axis coordinate, the second value is the y-axis coordinate. Both values are internally stored as floating-point numbers.

The syntax of a point is (x, y).

A point needs 16 bytes of storage:

CREATE TABLE temppoint(fieldname point);
INSERT INTO temppoint(fieldname) VALUES ('1,2');
INSERT INTO temppoint(fieldname) VALUES ('(1,3)'); 

We have created a table called temppoint with one column, and we have inserted two records into the table. One very important detail when dealing with a point is that a it has to be written within single quotes. Otherwise, a syntax error will be displayed:

shop=# INSERT INTO temppoint(fieldname) VALUES (1,2);
ERROR: Attribute 'fieldname' is of type 'point' but expression is of type 'int4'
    You will need to rewrite or cast the expression

Let's see how the two points are returned by a query:

shop=# SELECT * FROM temppoint;
 fieldname
-----------
 (1,2)
 (1,3)
(2 rows)

line

A line is a pair of two points. One point matches the beginning of a line; the other point matches the end of the line. A line can be defined as two points: (x1, y1), (x2, y2) or to make the whole thing a little clearer ( (x1, y1), (x2, y2) ).

A line can also be specified by using x1, y1, x2, y2, but we strongly recommend using one of the versions with parentheses to make the code clearer and easier to understand.

A line needs 32 bytes of storage (twice the storage of a point). lseg (line seqment) is an alternative datatype to line.

Let's create a table using line:

CREATE TABLE templine(fieldname line);

The table can already be created, but PostgreSQL 7.0.3 doesn't support INSERT statements yet:

shop=# INSERT INTO templine(fieldname) VALUES ('(1,3), (4,12)');
ERROR: line not yet implemented  
At the time this book is written, full support of line is not
available. You can use path instead of line.

box

A box is used to store a rectangle. A rectangle is defined by two points that are situated opposite each other on the rectangle. The syntax is therefore the same as the syntax of lines. When entering a box into the database, the lower-left corner is determined from the input and stored in the database first. You need not define the lower-left of the rectangle first; this is done by the database internally, and the user does not need to care about the order of the input (not implemented in PostgreSQL 7.0.3 yet).

A box needs 32 bytes of storage.

Here is some code:

CREATE TABLE tempbox(fieldname box); 
INSERT INTO tempbox(fieldname) VALUES ('(8,9), (1,3)'); 
INSERT INTO tempbox(fieldname) VALUES ('(1,3), (4,12)'); 

shop=# SELECT * FROM tempbox;
 fieldname
--------------
 (8,9),(1,3)
 (4,12),(1,3)
(2 rows)   

path

A path is a sequence of points and can either be open or closed. Closed means that the last point of the path "returns" to the beginning. The length of a path is dynamic. PostgreSQL offers special functions to make sure that a path is either open or closed. popen() and pclose() can be used to force a path to be open or closed. In queries, isopen() and isclosed() can be used to check whether the path is open or closed.

A path needs 4 bytes plus 32 bytes for every node to be stored.

Here are some examples of how you can create a table and insert some values. You can see that a path does not have a fixed number of points.

CREATE TABLE temppath(fieldname path);  
INSERT INTO temppath(fieldname) VALUES ('(1,3), (4,12)'); 
INSERT INTO temppath(fieldname) VALUES ('(3,1), (2,8), (10,4)');

shop=# SELECT fieldname FROM temppath;
   fieldname
----------------------
 ((1,3),(4,12))
 ((3,1),(2,8),(10,4))
(2 rows)

polygon

A polygon is actually a closed path, but is stored differently and therefore has its own function.

A polygon needs 4 bytes plus 32 bytes for every node to be stored.

Here is some sample code to create a table and to insert some data:

CREATE TABLE temppoly(fieldname polygon); 
INSERT INTO temppoly(fieldname) VALUES ('(1,3), (4,12), (2,4)');
INSERT INTO temppoly(fieldname) VALUES ('(2,0), (-5,12), (2,0)');
shop=# SELECT * FROM temppoly;
    fieldname
-----------------------
 ((1,3),(4,12),(2,4))
 ((2,0),(-5,12),(2,0))
(2 rows)

circle

A circle consists of a point and the radius, and needs 24 bytes of storage.

The following creates a table and inserts two values:

shop=# CREATE TABLE tempcircle(fieldname circle);
CREATE
shop=# INSERT INTO tempcircle(fieldname) VALUES ('10, 4, 10');
INSERT 51916 1
shop=# INSERT INTO tempcircle(fieldname) VALUES ('10, 4, -10');
ERROR: Bad circle external representation '10, 4, -10'
shop=# INSERT INTO tempcircle(fieldname) VALUES ('10.3 , 2, 4');
INSERT 51917 1 

You can see in the second INSERT statement that the radius of a circle must not be negative and that the database makes sure that no trash is inserted into the table.

Retrieving Geo-Data and Operators

Retrieving data from fields that contain geographic data is a little more complex than retrieving "ordinary" data.

Remember that the ~= operator has to be used instead of the = operator. In the following example, we select a specific point from a table:

SELECT fieldname FROM temppoint WHERE fieldname ~= '(1,2)';

PostgreSQL's geographic datatypes offer much more than just retrieving. The next piece of code calculates the distance of two points:

SELECT '(1,1)'::point <-> '(2,2)'::point; 

The result is
  ?column?
-----------------
 1.4142135623731
(1 row)

which is exactly the square root of 2 (for further information see "Pythagoras' Theorem" at http://www.geocities.com/researchtriangle/system/8956/problems/pyth.htm or http://www.cut-the-knot.com/pythagoras/index.html). In the previous example, the <-> operator is used to calculate the distance.

You also can add a point by using the + operator, as you do for simple addition:

shop=# SELECT '(1,1)'::point + '2,2'::point;
 ?column?
----------
 (3,3)
(1 row)

The operators -, * and / are also supported for performing an operation with a point:

shop=# SELECT '(1,1)'::point - '2,2'::point;
 ?column?
----------
 (-1,-1)
(1 row)
 
shop=# SELECT '(1,1)'::point * '2,2'::point;
 ?column?
----------
 (0,4)
(1 row)
 
shop=# SELECT '(1,1)'::point / '2,2'::point;
 ?column?
----------
 (0.5,0)
(1 row)

The ?- operator is used to see whether two points are horizontal; the ?-| operator checks whether two objects are perpendicular:

shop=# SELECT '(0,0)'::point ?- '-1,0'::point;
 ?column?
----------
 t
(1 row)

PostgreSQL offers four operators to check the position of a point relative to a position. The << operator checks whether a point is located left of another point, <^ checks whether the point is below another point, >> checks whether the point is on the right side, and >^ checks whether the point is above another point. In the next example, you can see how such a check can be performed:

shop=# SELECT '(0,0)'::point << '-1,0'::point;
 ?column?
----------
 f
(1 row)

PostgreSQL is also able to calculate intersections. The following example calculates the intersection of two line segments:

shop=# SELECT '((0,0),(10,0))'::lseg # '((0,0),(0,10))'::lseg;
 ?column?
----------
 (0,0)
(1 row)

PostgreSQL does not return only true or false; it also returns the right value. This is done when evaluating the point of closest proximity:

shop=# SELECT '(99,99)'::point ## '((3,2),(12,4))'::lseg;
 ?column?
----------
 (12,4)
(1 row)

In the example, (12, 4) is nearer to (99, 99) than any other point on the line segment.

You can see in these examples that PostgreSQL offers powerful operators you can use to build complex and sophisticated applications. In addition to the operators we have already discussed in this section, the following operators are also available in PostgreSQL:

If you want to get a complete list of all operators in PostgreSQL—or if you simply want to know which datatypes the operator you need is defined for—check out the online help of PostgreSQL by typing \do in your psql-shell.

Combining Queries

One problem encountered when writing SQL statements is that queries might have to be combined because the result of one SQL query might not lead to the final result you want to achieve. Combining queries means that more than one SELECT statement is used in a query. The result of the SELECT statements is put together by using one of the following keywords: UNION, INTERSECT, or EXCEPT.

UNION is used to add the results of SELECT statements, INTERSECT returns only the data the SELECT statements have in common, and EXCEPT returns all records that are not included in the second SELECT statement. For some real-world scenarios, let's go back to the shop database.

Think of a query where we want to retrieve all persons from table employees that are not in table sales. We try the following query:

SELECT * FROM customer EXCEPT SELECT * FROM sales; 

Let's see what the data structure of those two tables looks like (use \d to display the data structures):

shop=# \d customer
               Table "customer"
 Attribute |  Type   |           Modifier
-----------+-------------+---------------------------------------------------
 id    | integer   | not null default nextval('customer_id_seq'::text)
 name   | varchar(50) | not null
 email   | varchar(50) |
 state   | varchar(50) | not null
Indices: customer_email_key,
     customer_id_key,
     customer_pkey
 
shop=# \d sales
                Table "sales"
 Attribute |   Type   |          Modifier
-----------+--------------+------------------------------------------------
 id    | integer   | not null default nextval('sales_id_seq'::text)
 salestime | timestamp  |
 prodid  | bigint    |
 prodprice | numeric(9,2) |
 amount  | integer   |
 name   | varchar(50) |
 email   | varchar(50) |
 state   | varchar(50) |
Indices: sales_email_key,
     sales_pkey   

You can see that those two tables have just a few columns in common. Try to imagine how the database can perform the EXCEPT operation. You will run into trouble when you come to a column that is contained by only one of the two tables

ERROR: Each UNION | EXCEPT | INTERSECT query must 
    have the same number of columns. 

An operation must have the same number of columns. If we perform the query like this:

SELECT name 
    FROM customer 
EXCEPT SELECT name 
    FROM sales; 

we receive a correct result:

 name
-------
 Peter
(1 row)

Peter is the only person registered as customer who has never bought products. We use the following to retrieve all persons that are in table sales and table customer:

shop=# SELECT name 
        FROM customer 
    INTERSECT SELECT name 
        FROM sales;
 name
--------
 John
 John
 Robert
 Robert
(4 rows)

Note

This can also be done with a join.

In the result, the database returns four records. If we want to get only one entry for every name, we can use DISTINCT:

SELECT DISTINCT name 
    FROM customer 
INTERSECT SELECT DISTINCT name 
    FROM sales;

In the example, we have to use DISTINCT in both SELECT statements; otherwise, an error is displayed:

shop=# SELECT DISTINCT name 
        FROM customer 
    INTERSECT SELECT name 
        FROM sales;
ERROR: get_sortgroupclause_tle: ORDER/GROUP BY 
    expression not found in targetlist

Now we create a temporary table and insert values into it:

CREATE TEMPORARY TABLE person(name varchar(50), 
    email varchar(50));
INSERT INTO person(name, email) VALUES ('Theodore Ts\'o',
    'no@none.any'); 

If we want to write a query that returns a list containing all names in table sales and table person, we can use UNION:

shop=# SELECT name FROM person UNION SELECT name FROM sales;
   name
---------------
 John
 Robert
 Theodore Ts'o
(3 rows)

Not only can you combine the two SELECT statements. Here is an example where we combine the tables:

shop=# SELECT name 
    FROM person UNION 
        (SELECT name FROM customer 
        EXCEPT SELECT name FROM sales);

The result contains two records:

   name
---------------
 Peter
 Theodore Ts'o
(2 rows)

Building Complex Data Structures and Modeling Techniques

Choosing the right data structure is essential for building sophisticated applications, because the way data is organized in a database is the basis of success for the entire application. In this section, you learn how complex data structures can be built efficiently with PostgreSQL.

Creating and Dropping Views

It has been said that eyes are the window to the soul. This might be true. Definitely true is that views can be used to build more complex applications, and they can help you obtain a broader perspective of your data and the data structure. A view can be seen as a virtual table that is a kind of preprocessor for your data. In reality, a view is the result of a SELECT statement that looks like a table.

Views are also used in the system tables; one example of a view in system tables is the pg_indexes view. The following is the definition of that view:

shop=# \d pg_indexes
   View "pg_indexes"
 Attribute | Type | Modifier
-----------+------+----------
 tablename | name |
 indexname | name |
 indexdef | text |
View definition: SELECT c.relname AS tablename, i.relname AS indexname, pg_get_indexdef
(x.indexrelid) AS indexdef FROM pg_index x, pg_class c, pg_class i 
WHERE ((c.oid = x.indrelid) AND (i.oid = x.indexrelid));   

The definition of the view contains only a SELECT statement.

Recall the data structure of the sales table:

                Table "sales"
 Attribute |   Type   |          Modifier
-----------+--------------+------------------------------------------------
 id    | integer   | not null default nextval('sales_id_seq'::text)
 salestime | timestamp  |
 prodid  | bigint    |
 prodprice | numeric(9,2) |
 amount  | integer   |
 name   | varchar(50) |
 email   | varchar(50) |
 state   | varchar(50) |
Indices: sales_email_key,
     sales_pkey     

Let's create a view that contains only the total price a user has to pay. You calculate this price by multiplying prodprice and amount. Because this operation is a fundamental one, you can use a view to avoid bugs and make your SQL statements shorter and easier to understand:

CREATE VIEW sales_price_view AS 
    SELECT id, salestime, prodid, 
        prodprice*amount AS totalprice,
        name, email, state 
    FROM sales; 

The view is now listed as a relation in the database. It is labeled as view, but it can be used like any other table:

shop=# \d
      List of relations
    Name    |  Type  | Owner
------------------+----------+----------
 customer     | table  | postgres
 customer_id_seq | sequence | postgres
 prodcat     | table  | postgres
 prodcat_id_seq  | sequence | postgres
 prodtext     | table  | postgres
 prodtext_id_seq | sequence | postgres
 products     | table  | postgres
 products_id_seq | sequence | postgres
 sales      | table  | postgres
 sales_id_seq   | sequence | postgres
 sales_price_view | view   | postgres
(11 rows)         

We will now perform a SELECT statement. The result contains all records we store in table sales. The view calculates the required results for the prices implicitly:

shop=# SELECT id, totalprice FROM sales_price_view;
 id | totalprice
----+------------
 1 |   39.99
 2 |   39.99
 4 |   44.99
 3 |   149.97
(4 rows)

Views can also be created on top of other views. The following example shows the creation of a view consisting of components from table prodtext and view sales_price_view. We will use it to add the title of the book to the result:

CREATE VIEW sales_prodtext AS 
    SELECT prodtext.proddesc, sales_price_view.prodid,
        sales_price_view.totalprice 
    FROM prodtext, sales_price_view 
    WHERE sales_price_view.prodid=prodtext.prodid 
        AND prodtext.lang='english';  

If we perform a full table scan on the view, we receive the following result:

shop=# SELECT * FROM sales_prodtext;
       proddesc       | prodid | totalprice
-----------------------------------+--------+------------
 Python Developer's Handbook    | 385343 |   44.99
 Python Developer's Handbook    | 385343 |   149.97
 Linux Hardware Handbook      | 394568 |   39.99
 Linux: Networking for your Office | 106666 |   39.99
(4 rows)

Like any other table in the system, a view can also be joined easily with other tables. If your hierarchy of views gets complicated, you have to take care of system performance. If a view contains a slow query, all views on top of that view will also become slow. If you are working on a high-availability system with a lot of data and complex queries, you can run into trouble easily. Make sure that you have tested your data structure under real-world load.

Data Integrity with Constraints

Data integrity can be a very tricky thing. But what does integrity mean? Imagine two tables that store information about a product. One product stores the product id and the price of a product, the second table stores some more details about the product. The second table has multiple entries for one product. Picture a situation where a product id is changed in table one. This leads you into trouble when performing joins because the product id in table two stays unchanged. It can be a problem taking care of issues like these in production environments and to change multiple keys in countless tables by hand. In this case, foreign keys are defined to make sure that data integrity is guaranteed. Here is an example where foreign keys would be useful:

shop=# SELECT DISTINCT name, email from sales WHERE name='Robert';
 name |   email
--------+---------------
 Robert | robert@no.any
(1 row)
shop=# SELECT name, email from customer WHERE name='Robert';
 name |  email
--------+--------------
 Robert | rober@no.any
(1 row)

If you compare the results of the two queries, you might recognize that the email address of Robert seems to be wrong in table customer (a t is missing). This is a tricky bug because it isn't likely to be found on first sight. It would be useful to have something that checks the data when the INSERT statement is performed. Foreign keys are usually used to perform the job.

Let's re-create the tables we have used in the sample database (at least a short version of the tables) and add some foreign keys to it (don't forget to drop the old tables before using the CREATE TABLE command):

CREATE TABLE "customer" (
    "id" serial NOT NULL,
    "name" character varying(50) NOT NULL,
    "email" character varying(50),
    "state" character varying(50) NOT NULL,
    PRIMARY KEY ("name", "email")
);
 
CREATE TABLE "sales" (
    "id" serial NOT NULL,
    "prodid" int8,
    "prodprice" numeric(9,2),
    "amount" int4,
    "name" character varying(50),
    "email" character varying(50),
    PRIMARY KEY ("id"),
    FOREIGN KEY (name, email) REFERENCES customer
);
 
COPY "customer" FROM stdin;
1    John  john@no.any   Florida
2    Robert rober@no.any  Georgia
3    Peter  peter@no.any  Alaska
\.             

First we create the customer table. Note that we have defined a multicolumn primary key to ensure that a combination of name and email address has to be unique. Furthermore the primary key will be referred by the sales table. We have created table sales and added the foreign key. Because we have a multicolumn primary key, we have to use the FOREIGN KEY constraint.

Let's try to insert some values into the database:

INSERT INTO sales(prodid, prodprice, amount, name, email) 
    VALUES ('385343','20','3','John','john@no.any');

Inserting this record works perfectly well because the name John and the correct email address can be found in the master table. What happens if we want to perform the operation with a slightly different email address:

shop=# INSERT INTO sales(prodid, prodprice, amount, name, email) 
VALUES #039;385343','20','3','John','johnathan@no.any');
ERROR: <unnamed> referential integrity violation - 
     key referenced from sales not found in customer

You can see that the INSERT failed because the required email address is not found in table customer. If we want to update the name of the record we have already inserted into table sales, we also receive an error, because this leads to a violation of the referential integrity:

shop=# UPDATE sales SET name='Alex';
ERROR: <unnamed> referential integrity violation - 
key referenced from sales not found in customer

What if we are planning to change the values of the keys we have in our tables? The next example defines the integrity rules of a column precisely.

CREATE TABLE "customer" (
    "id" serial NOT NULL,
    "name" character varying(50) NOT NULL,
    "email" character varying(50),
    "state" character varying(50) NOT NULL,
    PRIMARY KEY ("name")
);
 
CREATE TABLE "sales" (
    "id" serial NOT NULL,
    "prodid" int8,
    "prodprice" numeric(9,2),
    "amount" int4,
    "name" character varying(50)
        REFERENCES customer
            ON UPDATE CASCADE,
    "email" character varying(50),
    PRIMARY KEY ("id")
);          
COPY "customer" FROM stdin;
1    John  john@no.any   Florida
2    Robert rober@no.any  Georgia
3    Peter  peter@no.any  Alaska
\.

We have modified the data structure slightly. The primary key for table customer has only one column now. In table sales, we have defined a foreign key on column name. This key references, as in the earlier example, to the customer tables. The difference is that we use ON UPDATE, which defines the action that has to happen when the primary key of the master table is updated. In the earlier example, we combined the ON UPDATE with the CASCADE command. Using CASCADE means that an update on the primary key of the master table leads to the update of all foreign key columns that refer to it.

Other commands are supported by PostgreSQL. We can also define certain database activities with the help of ON DELETE, which is always used when an entry in the master table is deleted. The following is a short overview of all commands that can be combined with ON UPDATE and ON DELETE:

Command

Action

NO ACTION

This is the default value; UPDATES and DELETES won't be performed to protect referential integrity.

CASCADE

All foreign keys will be updated when the primary key changes (with ON UPDATE). All foreign key records will be deleted when the primary key record is deleted (ON DELETE).

SET NULL

The foreign key is set to NULL when the primary key is updated or deleted.

>SET DEFAULT The foreign key is set to the default value of the column when the primary key is deleted or updated.

Here is an example of how ON UPDATE CASCADE works:

INSERT INTO sales(prodid, prodprice, amount, name, email) 
    VALUES ('385343','20','3','John','john@no.any'); 

The INSERT command works because John is a valid name in table customer. Let's update the name to Paul:

UPDATE customer SET name='Paul' WHERE name='John'; 

The update has been performed without any trouble; the following shows what happened inside the two tables:

shop=# SELECT * FROM sales;
 id | prodid | prodprice | amount | name |  email
----+--------+-----------+--------+------+-------------
 1 | 385343 |   20.00 |   3 | Paul | john@no.any
(1 row)
 
shop=# SELECT * FROM customer WHERE name='Paul';
 id | name |  email  | state
----+------+-------------+---------
 1 | Paul | john@no.any | Florida
(1 row)

You can see that both tables have been updated to Paul as we expected it to be.

Note

ON UPDATE and ON DELETE can both be defined for one column; you need not decide whether ON UPDATE or ON DELETE is more important to you—simply use both for one column.

The CHECK Constraint

Some applications demand some sort of input restriction. This can be done with the CHECK constraint, which checks whether a list of conditions is fulfilled before an INSERT command is processed. In the next example, we want to except only orders where the total price is higher than 100 dollars; otherwise, the INSERT command should fail:

CREATE TABLE "sales" (
    "id" serial NOT NULL,
    "prodid" int8,
    "prodprice" numeric(9,2),
    "amount" int4,
    "name" character varying(50),
    "email" character varying(50),
    PRIMARY KEY ("id"),
    CHECK (amount*prodprice > 100)
);                    

We try to insert an order:

shop=# INSERT INTO sales(prodid, prodprice, amount, name, email) 
VALUES('385343', '39.99', '1', 'John', 'john@no.any');
ERROR: ExecAppend: rejected due to CHECK constraint $1

This fails because one book costs only 39.99 dollars. The minimum is much higher, so the database displays an error. Now we try to order three books that cost 39.99 each. The total price is higher than 100 dollars, so the INSERT statement will be processed successfully:

shop=# INSERT INTO sales(prodid, prodprice, amount, name, email) 
VALUES('385343', '39.99', '3', 'John', 'john@no.any');
INSERT 53120 1 

In many cases, one condition might not be enough for an application. PostgreSQL's CHECK constraint can be used with multiple conditions, as in the next example. We simply have to connect the conditions with the help of AND or OR tokens:

CREATE TABLE "sales" (
    "id" serial NOT NULL,
    "prodid" int8,
    "prodprice" numeric(9,2),
    "amount" int4,
    "name" character varying(50),
    "email" character varying(50),
    PRIMARY KEY ("id"),
    CHECK (amount*prodprice > 100 
        AND amount*prodprice < 1000)
);

Now we can insert only values that are higher than 100 but lower than 1000; all other attempts to insert data will fail:

shop=# INSERT INTO sales(prodid, prodprice, 
        amount, name, email) 
    VALUES('385343', '39.99', '300', 
        'John', 'john@no.any');
NOTICE: sales_id_seq.nextval: sequence was re-created
ERROR: ExecAppend: rejected due to CHECK constraint $1

Using Inheritance

Inheritance is one of the key features of object-relational databases. Objects have certain properties that can be inherited by other classes; in PostgreSQL, tables can inherit from other tables.

Tables can be in a parent-child relationship, which means that all information a parent has is also available in the child (see Figure 3.4).

Figure 3.4 A simple parent-child example.

Imagine a class person that stores all information persons normally have (height, gender, and so on).

CREATE TABLE persons(id serial, name varchar(50), gender char(1), height int4);
INSERT INTO persons(name, gender, height) VALUES ('George','m','178');

Now we create a class of persons where we have some additional information about this group of persons, such as profession and income.

CREATE TABLE worker(prof varchar(50), income int4) INHERITS (persons);

We have additional information about workers, which means that we also have all information about "ordinary" persons. The following is the data structure of table worker:

persons=# \d worker
                Table "worker"
 Attribute |  Type   |           Modifier
-----------+-------------+--------------------------------------------------
 id    | integer   | not null default nextval('persons_id_seq'::text)
 name   | varchar(50) |
 gender  | char(1)   |
 height  | integer   |
 prof   | varchar(50) |
 income  | integer   | 

Let's add a worker:

INSERT INTO worker (name, gender, height, prof, income) VALUES ('Carl','m','182'); 

Derived classes—like parent classes—can have children. In the following example, we create a class called cybertec that inherits all information from worker and adds more information, such as photo and email address:

CREATE TABLE cybertec(photo varchar(50), email varchar(50)) INHERITS (worker);

Of course we will add some data:

INSERT INTO cybertec(name, gender, height, prof, income, photo, email) 
    VALUES ('Ewald Geschwinde','m','182', 'technical director', '1',
        'http://www.cybertec.at/epi/ps2.jpg','eg@cybertec.at');
INSERT INTO cybertec(name, gender, height, prof, income, photo, email) 
    VALUES ('Hans-Juergen Schoenig','m','178', 'director of marketing', '1', 
        'http://www.cybertec.at/hans/1.jpg','hs@cybertec.at'); 

If we want to write a query to retrieve all persons from table persons, this can easily be done:

persons=# SELECT * FROM persons;
 id | name | gender | height
----+--------+--------+--------
 1 | George | m   |  178
(1 row)

One record has been returned by the database, but let's modify the SQL command slightly:

persons=# SELECT * FROM persons*;
 id |     name     | gender | height
----+-----------------------+--------+--------
 1 | George        | m   |  178
 2 | Carl         | m   |  182
 3 | Ewald Geschwinde   | m   |  182
 4 | Hans-Juergen Schoenig | m   |  178
(4 rows)  

We have added an asterisk (*) to the name of the table, but now the database returns all values from table persons, including all tables derived from persons.

Note

Only the columns that are available in persons are displayed.

If we try the same thing with one of the other tables, we will receive more columns.

The next example shows how you can query all records from table cybertec and table worker. First we select all records available and exclude those that are in table persons:

persons=# SELECT name FROM persons* EXCEPT SELECT name FROM persons;
     name
-----------------------
 Carl
 Ewald Geschwinde
 Hans-Juergen Schoenig
(3 rows)

A new table cannot only inherit from one table; it also is possible to define multiple parents for one table. The next example shows how the son inherits from mother and father:

CREATE TABLE father(name text);
CREATE TABLE mother(address text);
CREATE TABLE son(gender char(1)) INHERITS (father, mother);

The table data structure of son now looks like this:

persons=# \d son
     Table "son"
 Attribute | Type  | Modifier
-----------+---------+----------
 name   | text  |
 address  | text  |
 gender  | char(1) |   

After we have successfully created the tables, we try to delete one of them:

persons=# DROP TABLE father;
ERROR: Relation '53451' inherits 'father'

As you might have thought, no table can be dropped that is the parent of another table. If you want to delete tables in a hierarchy, you have to do it "bottom up," which means children first:

persons=# DROP TABLE son;
DROP
persons=# DROP TABLE father, mother;
DROP

The previous examples work for PostgreSQL database releases earlier than 7.1. In 7.1 and later, inherited tables will be accessed automatically, and therefore an asterisk is not needed. If you want to access noninherited tables, you can use ONLY as a keyword.

Modeling Techniques

This section is dedicated to those who want a brief insight into modeling techniques. You learn how to build entity relationship models and explore the normalization rules for databases.

The Entity Relationship Model

In almost any application, a database is used to build a model of reality. Our reality consists of objects that have certain properties and certain relations with other objects. We will call these objects entities.

The entity relationship model is designed to represent real-world objects and processes to build data structures. For every entity, the relation to other entities is shown. Three types of relations are recognized:

An entity relationship model tries to model these issues in a graphical way. Every entity is represented as a box, and the relation between the boxes is shown as arrows or lines.

Entity relationship models are often used in real-world scenarios, because it is easier to understand a graphical overview of complex data structures than a huge amount of code.

Normalization

When designing a database from scratch, you normally end with a set of large tables with everything packed into it. This can soon lead to confusing data structures and a lot of redundancy in the tables. Redundancy means that values are stored more often than necessary. To get rid of the problems, a normalization can be performed. With normalization, the data model's anomalies, redundancies, and inconsistencies are reduced in every step of the process.

Normalization is a fundamental issue in database theory, and many modeling techniques are based on E.F. Codd's rules for normalization. In this section, you explore normalization through a short and easy-to-understand example.

The following is a table that has not been normalized:

Name

Address

Training Course

John Gore

Sesamestreet 1; 1010 Vienna

German, Chemistry, Sports

Lisa Bush

John Street 33; 1150 Vienna

Maths, Computer Science, English


In this example, if a teacher had more than three courses, we would soon run into trouble because there is no space left in the field (let's assume that it has a fixed length).

According to Codd's theories, a table is in first normal form when every value is stored separately and when one value does not consist of further values. Here is the same table, but now in first norm form (1 NF):

pid

first name

surname

street

town

cid

training course

15

John

Gore

Sesame Street 1

1010 Vienna

44

German

15

John

Gore

Sesame Street 1

1010 Vienna

45

Chemistry

15

John

Gore

Sesame Street 1

1010 Vienna

46

Sports

6

Lisa

Bush

John Street 33

1150 Vienna

47

Maths

6

Lisa

Bush

John Street 33

1150 Vienna

48

Computer Science

6

Lisa

Bush

John Street 33

1150 Vienna

49

English


We have added an id in order to distinguish the records (pid for persons and cid for the id of the training course). One problem with the first normal form is that if one teacher changes address, we have to change that in three records. Another problem is that a new teacher can be added only when assigned to a training course. To put it in another way, we would have to remove teachers when they have no more training courses. This might not fit reality; for that reason, we should change our first normal form to a second normal form.

All components that have a separate key are now treated as separate tables. In our case, we create two tables—one for the teachers and one for the training courses.

Here is the table for the teachers:

pid

firstname

surname

street

town

15

John

Gore

Sesame Street 1

1010 Vienna

6

Lisa

Bush

John Street 33

1150 Vienna


Here is the table for the courses:

pid

cid

training course

15

44

German

15

45

Chemistry

15

46

Sports

6

47

Maths

6

48

Computer Science

6

49

English


We can now connect those two tables without storing the name of the teacher in a redundant way. According to Codd's theories, this must not happen because this could lead to anomalies we could have when using the first normal form.

In the mid-seventies, Codd discovered that the second normal form can also lead to anomalies. Assume that we add a column category to the table where we store the courses. Because many courses might belong to the same category, we would still have redundancies in the table and would have to update many rows when one value changes. In this case, we would have to introduce a third table to store the categories. This would be a data model using the third normal form.

In most cases, a data model in third normal form is also called a Boyce-Codd normal form.

In the following cases, a third normal form is not a Boyce-Codd normal form:

When designing data models, keep Codd's theories about normalization in mind, because you will build more flexible data structures when taking the most important points of the theory into consideration.

It is a good idea to create a highly flexible data model. If your data structure is not flexible enough, you will soon run into trouble and extending your application can become truly painful.

The Waterfall Model

A guide to database design says: "Using waterfall model, maybe is for cowards. Using no model is for kamikazes only!" (quoted from http://www.cdt.luth.se/~jmb/ presents/19990903/index.htm) Many situations have proven that the author of this sentence knows what he is talking about.

The waterfall model is one of the most widespread models available. Academics seem to be especially enthusiastic about it; every student who has to deal with databases and modeling techniques will face the waterfall model at least once.

The waterfall model is a systematic, step-by-step approach to software development, in which the next phase of the project is entered as soon as one phase is completed. According to the basic waterfall model, the whole process is straightforward only.

Nowadays, many slightly different and more sophisticated versions of the waterfall model exist, but the most widespread version still seems to be the original model.

A process described by the waterfall model consists of seven steps (see Figure 3.5):

  1. Feasibility study. Determine whether a project can be done.

  2. Requirement analysis and specification. If the project can be done, find out what the basic requirements are. Requirements are not only hardware; human resources are, in most cases, more important than hardware. Define the key features of the product.

  3. Design and specification. Define every detail of the final product precisely. The waterfall model is like a one-way street—you can't revise the specification for the rest of the project. Everything has to be clear; the better the specification, the better the final product. In software design, four parts have to be defined: data structure, software architecture, procedural detail, and the user interface, or human computer interface (HCI).

  4. Coding and module testing. While programming the code, extensively test the modules to ensure high quality of the final product. If the specification is good, coding is a small part of the process.

  5. Integration and system testing. Integrate and connect code with other components. Test the system under real-world circumstances (using people who have not participated in the project to this point) to find all bugs.

  6. Delivery. Deliver the product to the customer(s) and implement it on site.

  7. Maintenance. Make ongoing changes to correct errors and meet new demands. Reapply every step during changes.

Remember, in the original waterfall model, when you finish one step of the process, there is no way back. If you keep this in mind when designing and implementing an IT system, you will ensure a product of good quality.

Figure 3.5 An original waterfall model.

As mentioned before, many versions of waterfall models exist. In some versions, you can go one step back in the process if something turns out to be wrong (see Figure 3.6).

Figure 3.6 A waterfall model in which you can go back to a previous state in the process.

Many people and project leaders are using a "let's see later" strategy. This can lead to real disaster. If you start coding immediately, you will most likely find out that you have forgotten some crucial points and features. In this case, you have to rewrite a lot of code, which leads to bugs or hard-to-understand software. You have to know perfectly well what you are doing when writing the first line of code.

800 East 96th Street, Indianapolis, Indiana 46240