Home > Articles > Data > SQL

  • Print
  • + Share This
This chapter is from the book

This chapter is from the book

Retrieving Data

Let's look at a sample database called shop. The database is a prototype of a bookshop, and we will use this database in most parts of this chapter. It might look a bit confusing at first, but we'll work though every detail of the SQL code and the database structure. The following is the output of pg_dump, an interface for dumping databases into ASCII files:

CREATE SEQUENCE "products_id_seq" start 5 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"products_id_seq"');
CREATE SEQUENCE "prodtext_id_seq" start 8 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"prodtext_id_seq"');
CREATE SEQUENCE "prodcat_id_seq" start 9 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"prodcat_id_seq"');
CREATE SEQUENCE "customer_id_seq" start 3 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"customer_id_seq"');
CREATE SEQUENCE "sales_id_seq" start 4 increment 
    1 maxvalue 2147483647 minvalue 1 cache 1 ;
SELECT nextval ('"sales_id_seq"');

CREATE TABLE "products" (
    "id" int4 DEFAULT 
        nextval('products_id_seq'::text) NOT NULL,
    "prodid" int8 NOT NULL,
    "price" numeric(9,2) NOT NULL,
    "stored" numeric(9,2) DEFAULT 0,
    PRIMARY KEY ("prodid")
);

CREATE TABLE "prodtext" (
    "id" int4 DEFAULT 
        nextval('prodtext_id_seq'::text) NOT NULL,
    "prodid" int8,
    "lang" text,
    "proddesc" text
);

CREATE TABLE "prodcat" (
    "id" int4 DEFAULT 
        nextval('prodcat_id_seq'::text) NOT NULL,
    "prodid" int8 NOT NULL,
    "prodcat" character varying(50) NOT NULL,
    PRIMARY KEY ("id")
); 

CREATE TABLE "customer" (
    "id" int4 DEFAULT 
        nextval('customer_id_seq'::text) NOT NULL,
    "name" character varying(50) NOT NULL,
    "email" character varying(50),
    "state" character varying(50) NOT NULL,
    PRIMARY KEY ("name")
);

CREATE TABLE "sales" (
    "id" int4 DEFAULT 
        nextval('sales_id_seq'::text) NOT NULL,
    "salestime" timestamp,
    "prodid" int8,
    "prodprice" numeric(9,2),
    "amount" int4,
    "name" character varying(50),
    "email" character varying(50),
    "state" character varying(50),
    PRIMARY KEY ("id")
);

COPY "products" FROM stdin;
1    385343 44.99 20.00
2    385342 49.99 64.00
3    394568 39.99 18.00
4    106666 39.99 120.00
5    765354 29.99 98.00
\.

COPY "prodtext" FROM stdin;
1    385343 english Python Developer's Handbook
2    385343 german Python Entwickler Handbuch
3    385342 english Linux Socket Programming
4    394568 english Linux Hardware Handbook
5    394568 german Das Linux Hardware Buch
6    106666 english Linux: Networking for your Office
7    106666 german Linuxnetzwerke fürs Büro 
8    765354 english Palm Programming
\.

COPY "prodcat" FROM stdin;
1    385343 Python
2    385343 Developer
3    385343 Handbook
4    385342 Linux
5    394568 Linux
6    394568 Handbook
7    106666 Networking
8    106666 Linux
9    765354 Palm
\.

COPY "customer" FROM stdin;
1    John  john@no.any  Florida
2    Robert rober@no.any  Georgia
3    Peter peter@no.any  Alaska
\.

COPY "sales" FROM stdin;
1    2001-04-03 15:01:27+02 106666 39.99  1   John  john@no.any  Florida
2    2001-05-12 16:22:09+02 394568 39.99  1   John  john@no.any  Florida
3    2001-05-12 17:19:04+02 385343 49.99  2   Robert robert@no.any Florida
4    2001-06-12 17:19:04+02 385343 44.99  1   Robert robert@no.any Florida
\.

CREATE INDEX "products_id_key" on "products" 
    using btree ( "id" "int4_ops" );

CREATE INDEX "prodtext_id_key" on "prodtext" 
    using btree ( "id" "int4_ops" );

CREATE INDEX "prodcat_id_key" on "prodcat" 
    using btree ( "id" "int4_ops" );

CREATE INDEX "customer_id_key" on "customer" 
    using btree("id" "int4_ops" );

CREATE INDEX "customer_email_key" on "customer" 
    using btree ( "email" "varchar_ops" );

CREATE INDEX "sales_email_key" on "sales" 
    using btree ( "email" "varchar_ops" );

CREATE CONSTRAINT TRIGGER "<unnamed>" AFTER DELETE 
    ON "products" NOT DEFERRABLE INITIALLY IMMEDIATE 
    FOR EACH ROW EXECUTE PROCEDURE 
    "RI_FKey_noaction_del" ('<unnamed>', 'prodtext',
    'products', 'UNSPECIFIED', 'prodid', 'prodid');

CREATE CONSTRAINT TRIGGER "<unnamed>" AFTER UPDATE 
    ON "products" NOT DEFERRABLE INITIALLY 
    IMMEDIATE FOR EACH ROW EXECUTE PROCEDURE
    "RI_FKey_noaction_upd" ('<unnamed>', 'prodtext', 
    'products', 'UNSPECIFIED', 'prodid', 'prodid');

CREATE CONSTRAINT TRIGGER "<unnamed>" AFTER 
    INSERT OR UPDATE ON "prodtext" 
    NOT DEFERRABLE INITIALLY IMMEDIATE FOR EACH ROW 
    EXECUTE PROCEDURE "RI_FKey_check_ins" ('<unnamed>',
    'prodtext', 'products', 'UNSPECIFIED', 
    'prodid', 'prodid');

Let's take a closer look at this code. First, sequences are created. Every table has a column called id, and the sequences assign a unique id to every row in the database. Then we create a table called products to store all information about the products in the shop. The column called prodid will be a unique product id. prodid is also the primary key of the table. price will contain the current price of the product, and stored will contain the number of available units of the product.

prodcat contains the product categories. One book might have multiple entries. The table is referred to by the product id. prodtext stores the title of a book in multiple languages. The customer table stores information about all registered users, and the sales table stores all information about sales, such as the number of books a user purchased or the price of a product.

The COPY commands are used to read data directly from the input file. Then, indices are created on the id column of all tables. At this point, it is useful to add additional indices for certain columns to speed up queries. It depends on the type of queries the user wants to perform on which columns to define indices.

Note

Creating indices is not always good, because sometimes disk space is wasted. Reduced storage is not the only drawback. Sometimes indices can slow the database down, especially when performing INSERT statements. This happens because the database must update the index and the data, which is actually slower than just doing the updates for the data.

Finally, triggers are added to guarantee the integrity of the database. It is not useful to have a product in table prodtext that doesn't exist in table products anymore—the triggers help the user to do some data cleaning.

If you want to create a database with the data shown previously, create an empty database (use createdb shop) and write a simple text file containing the SQL code. Then type

psql shop < yourtextfile.sql

As soon as the import is ready, you can connect to your database with psql.

Simple Queries

Retrieving data is probably the most important activity when dealing with databases. Storing data is useful only if you can find the right piece of information any time it is needed. In case of relational databases, you use the SELECT command to retrieve data from a database. SELECT statements can be very simple but also very tricky. This section should give you a short overview of how to use SELECT for simple queries. Let's start with an example. We want to retrieve where the product id is 385343:

SELECT * FROM products WHERE prodid='385343';

The database executes the query and if everything is configured correctly, displays this result:

 id | prodid | price | stored
----+--------+-------+--------
 1 | 385343 | 44.99 | 20.00
(1 row)

The database has returned all requested records and all columns of the table. If you want to have only the last three columns of the table and only products with a prodid that is higher than 385343, you use the following query:

SELECT prodid,price,stored FROM products WHERE prodid > '385343';

The result is

 prodid | price | stored
--------+-------+--------
 394568 | 39.99 | 18.00
 765354 | 29.99 | 98.00
(2 rows)

Sometimes it is necessary to build more complex WHERE clauses. Imagine a query where we want to retrieve all records where the product id is higher than 385343 and the price is lower than 35$. The query looks like this:

SELECT * FROM products WHERE prodid > '385343' AND price < 35;

The result contains only one value, because the price of product number 394568 is higher than 35$. The AND token tells the database to return only values where both conditions are fulfilled.

Very often AND and OR tokens have to be combined to find the right result. The following query retrieves all records in table products where the product id is 385343 if the price is lower than 35%. Additionally, all records are displayed where the price is exactly 44.99$. With the help of the OR token, the database returns all records where the left expression (prodid > '385343' AND price < 35) or the right expression (price='44.99') returns true:

SELECT * FROM products WHERE (prodid > '385343' 
    AND price < 35) OR price='44.99';

Here is the result for this query:

 id | prodid | price | stored
----+--------+-------+--------
 1 | 385343 | 44.99 | 20.00
 5 | 765354 | 29.99 | 98.00  

The product with id of 385343 is returned because the price matches the right expression exactly. The product with number 765354 is returned because the product is higher than 385343 and the price is lower than 29.99$.

You can see in these first few examples that SQL offers an easy syntax to perform simple queries quickly and without writing huge amounts of code.

Let's look at another example, finding all categories in table prodcat. Because some categories might be listed twice, you can use the DISTINCT operator to make sure that each word appears only once in the result:

SELECT DISTINCT prodcat FROM prodcat;

The result is

 prodcat
------------
 Developer
 Handbook
 Linux
 Networking
 Palm
 Python
(6 rows) 

Only 6 rows are returned, although 9 rows are in the database, because the words Handbook and Linux appear more than once in the table. Use the following if you want the database to return only the first 3 rows in alphabetical order:

SELECT DISTINCT prodcat FROM prodcat 
    ORDER BY prodcat ASC LIMIT 3;

The result is

 prodcat
-----------
 Developer
 Handbook
 Linux
(3 rows) 

You use the ORDER BY clause to sort the result. ASC means that the result has to be sorted in ascending order. If you want the database to return the result in descending order, you use the DESC token. The LIMIT command saves time when performing SELECT * operations on extremely huge tables. You also can display all but the first three records. With the help of OFFSET, you can skip the first n records returned.

You can use OFFSET and LIMIT together. If you want to get only the fourth record returned by the query, you use the following:

SELECT DISTINCT prodcat FROM prodcat 
    ORDER BY prodcat ASC LIMIT 1 OFFSET 3;

This query returns Networking. The explanation for the result is simple: OFFSET 3 makes the database skip the first 3 rows. LIMIT 1 takes the first line and omits the rest.

Performing Simple Calculations

In many cases, it is useful to perform simple calculations directly within an SQL statement. Assume a query where you want to calculate the total amount of money a user has to pay:

SELECT prodid, prodprice, amount, prodprice*amount, name 
    FROM sales; 

You take the price of one unit of the product and multiply it with the amount of units the user bought. The result is

 prodid | prodprice | amount | ?column? | name
--------+------------+-------+----------+--------
 106666 |   39.99 |   1 |  39.99 | John
 394568 |   39.99 |   1 |  39.99 | John
 385343 |   49.99 |   2 |  99.98 | Robert
 385343 |   44.99 |   1 |  44.99 | Robert

You also can write more complex queries. In all cases, SQL uses the "infix" syntax.

Joining Tables

Joining tables is an extremely important issue when talking about relational and object relational databases. Joins are used to select and combine information from multiple columns. The tables you want to join must have at least one column in common.

Suppose you want to create a list with all product descriptions, their price, and the amount of books stored. The amount of books stored and the price can be found in the products table. The description of the products can be found in the prodtext table. Both tables have a column for the product id, so you can join the tables with the help of the product id:

SELECT prodtext.lang, prodtext.proddesc, 
    products.price, products.stored 
    FROM products, prodtext 
    WHERE products.prodid=prodtext.prodid;

The result is the following list of eight records containing the German and English titles of the books:

 lang  |      proddesc        | price | stored

---------+-----------------------------------+-------+--------
 english | Linux: Networking for your Office | 39.99 | 120.00
 german | Linuxnetzwerke fürs Büro     | 39.99 | 120.00
 english | Linux Socket Programming     | 49.99 | 64.00
 english | Python Developer's Handbook    | 44.99 | 20.00
 german | Python Entwickler Handbuch    | 44.99 | 20.00
 english | Linux Hardware Handbook      | 39.99 | 18.00
 german | Das Linux Hardware Buch      | 39.99 | 18.00
 english | Palm Programming         | 29.99 | 98.00  

If you now want to retrieve all books (English title only) where more than 80 books are stored, you use the following query:

SELECT prodtext.lang, prodtext.proddesc, 
    products.price, products.stored 
    FROM products, prodtext 
    WHERE products.prodid=prodtext.prodid
        AND prodtext.lang='english' 
        AND products.stored>80;

The result is

 lang  |      proddesc        | price | stored

---------+-----------------------------------+-------+--------
 english | Linux: Networking for your Office | 39.99 | 120.00
 english | Palm Programming         | 29.99 | 98.00

Only two books are left. In the example, you can see that combining data using AND tokens is extremely easy. Remember the most important point when joining tables: The tables must have at least one column in common. This example contains the column called prodid. Don't forget to perform the actual join (products.prodid=prodtext.prodid) in your query. If you do forget, you will get the Cartesian product of the two tables as the result. The Cartesian product is a join without a WHERE clause. Try the following:

SELECT * FROM products, prodtext;

to find out about the outcome. The result will be 40 lines long, and most of the lines won't have anything in common.

In the next example, we want to generate a complete list of all sales including the product description:

SELECT sales.salestime, prodtext.proddesc, sales.amount 
    FROM sales, prodtext 
    WHERE prodtext.prodid=sales.prodid 
        AND prodtext.lang='english'; 

This joins the table prodtext and the table sales via the product id:

    salestime    |      proddesc       | amount

------------------------+----------------------------------+--------
 2001-05-12 17:19:04+02 | Python Developer's Handbook   |   2
 2001-06-12 17:19:04+02 | Python Developer's Handbook   |   1
 2001-05-12 16:22:09+02 | Linux Hardware Handbook     |   1
 2001-04-03 15:01:27+02 | Linux: Networking for your Office|   1   

Self-Joins and Aliases

This section gives you insight into aliasing and self-joins. Aliases can be used to assign an additional name to a column. On the one hand, it is possible to write shorter and easier-to-read queries (aliasing can speed up writing a query significantly, especially when the query string is more complex). On the other hand, aliases can be used to perform self-joins. With a self-join, a table performs a join with itself. You will see an example of this later. First let's look at a simple example using aliases:

SELECT * FROM products AS a where a.stored=20;

This query selects all records from table products where the amount of books stored is exactly 20. In this case, a is an alias for products. The query can also be written the following way:

SELECT * FROM products a where a.stored=20;

It is not necessary to use the AS token, but AS can sometimes make the query easier to read.

Let's look at a shop where each product can be assigned to multiple product categories. Because the number of categories of a product should not be limited, we won't define a separate column for every category. It would be nice to have a table with just one table for the product category. The following table is called prodcat in the sample database (use "SELECT

* FROM prodcat;"):

 id | prodid | prodcat
----+--------+------------
 1 | 385343 | Python
 2 | 385343 | Developer
 3 | 385343 | Handbook
 4 | 385342 | Linux
 5 | 394568 | Linux
 6 | 394568 | Handbook
 7 | 106666 | Networking
 8 | 106666 | Linux
 9 | 765354 | Palm    

First you write a query to find all products that are assigned to the category Handbook:

SELECT * FROM prodcat WHERE prodcat='Handbook'; 

Two records are returned:

 id | prodid | prodcat
----+--------+----------
 3 | 385343 | Handbook
 6 | 394568 | Handbook

Then we want to find all products that are listed in categories Handbook and Developer. We use a self-join:

SELECT * 
    FROM prodcat AS a, prodcat AS b 
    WHERE a.prodcat='Handbook'
        AND b.prodcat='Developer' 
        AND a.prodid=b.prodid;

The result is

 id | prodid | prodcat | id | prodid | prodcat
----+--------+----------+----+--------+-----------
 3 | 385343 | Handbook | 2 | 385343 | Developer

The database performs the join as if we had two identical tables. The result is only one record because only this record matches both categories. It is also possible to look for all records that match three categories.

Note

When working with multiple tables, you should use the * operator with care, because * displays all columns from all tables listed in the FROM clause.

In the next example, we look for all records that are assigned to Handbook, Developer, and Python:

SELECT * 
    FROM prodcat AS a, prodcat AS b, prodcat AS c 
    WHERE a.prodcat='Handbook' 
        AND b.prodcat='Developer' 
        AND c.prodcat='Python'
        AND a.prodid=b.prodid 
        AND a.prodid=c.prodid 
        AND b.prodid=c.prodid;

The result is again one record:

id | prodid | prodcat | id | prodid | prodcat | id | prodid | prodcat 
----+--------+----------+----+--------+-----------+----+--------+---------
 3 | 385343 | Handbook | 2 | 385343 | Developer | 1 | 385343 | Python

You can see in these examples that self-joins can be done rather easily. The only crucial point is that you must not forget the actual joins to get the right result. If you try your queries on small testing tables, make sure that you have every possible difficulty in them. If the result of a query seems to be right, try to go through the code again and check whether you have all necessary join operations in it. If you don't test thoroughly, you might have bad luck when performing the query on large tables.

The output of the previous query is not a beautiful one. Try the following query to make the result easier to understand:

SELECT a.id AS rowid, a.prodid AS "product id" 
    FROM prodcat AS a, prodcat AS b, prodcat AS c 
    WHERE a.prodcat='Handbook' 
        AND b.prodcat='Developer' 
        AND c.prodcat='Python' 
        AND a.prodid=b.prodid 
        AND a.prodid=c.prodid 
        AND b.prodid=c.prodid; 

The result is much nicer than the previous one:

 rowid | product id
-------+------------
   3 |  385343  

In this example, you see how you can change the description of a table with the help of aliases. If the new heading contains whitespace characters, such as blanks, the new heading has to be quoted in the SQL query—if not, it can be written without quotes. The table can now be read easier.

Aggregation Functions

Aggregates are a powerful feature of relational databases. They can be used to perform operations with multiple lines involved. Let's look at an example where we want to count the lines in a table with the help of the count() function:

SELECT COUNT(*) FROM products;

The result is

 count
-------
   5
(1 row)

The count() function can also be combined with a WHERE clause. The following example shows you how to count all records where the amount of stored books is equal to or higher than 64:

SELECT COUNT(*) FROM products WHERE stored >= 64;

The result of the query is 3. As you might have noticed in the previous example, we queried and displayed only one column; now we also want the product id to be displayed:

SELECT prodid, COUNT(*) FROM products; 

This query leads to an error:

ERROR: Attribute products.prodid must be GROUPed or used in an aggregate function     

The count() function doesn't know what to count any more and the database won't perform the query. The problem can be solved by adding an aggregate function to the query:

SELECT prodid, COUNT(*) FROM products GROUP BY prodid;

The result is a list with all product ids and the frequency they appear in the table:

 prodid | count
--------+-------
 106666 |   1
 385342 |   1
 385343 |   1
 394568 |   1
 765354 |   1
(5 rows)    

In the example, you can see that every product id appears only once in the table.

Now we want to write a more complex query. We want to get a list with the sum of all sales of a certain book, including the product description. We have to perform a join of table sales and table prodtext. We want only the English titles to be displayed. Here is the query:

SELECT a.proddesc, SUM(b.amount) 
    FROM prodtext AS a, sales AS b 
    WHERE a.prodid=b.prodid 
        AND a.lang='english' 
    GROUP BY b.prodid, a.proddesc; 

The result is

       proddesc       | sum
-----------------------------------+-----
 Linux: Networking for your Office |  1
 Python Developer's Handbook    |  3
 Linux Hardware Handbook      |  1
(3 rows)    

We use the sum() function to get the sum of all sales. In the example, we use the GROUP BY token two times. This is necessary because one product id can have many titles. We have also used aliasing to make the actual code of the query shorter. Don't forget the actual join operation in the query; in this example, the join is performed by the a.prodid=b.prodid expression.

In some cases, it can be useful to get only a small piece of the result of a query using aggregate functions. Imagine an example where we want to get all products that have been sold more than once:

SELECT prodid, SUM(amount) AS "total sum" 
    FROM sales 
    GROUP BY prodid 
    HAVING SUM(amount) > 1;

The result is very short:

 prodid | total sum
--------+-----------
 385343 |     3
(1 row)  

Only product number 385343 is available in the list, where more than one unit has been sold. The query contains a HAVING clause. HAVING is used to perform condition tests on aggregate functions. In our case, we check whether the result of the GROUP BY is higher than 1. A query can also have AND or OR tokens in the HAVING clause. The following example adds an AND token to the query we used earlier:

SELECT prodid, SUM(amount) AS "total sum" 
    FROM sales 
    GROUP BY prodid 
    HAVING SUM(amount) > 1
        AND SUM(amount) < 10;  

The result is the same as the result of the query used before, because no product was sold more often than nine times.

Now we want to calculate the average price of the products in table products:

SELECT AVG(price) FROM products; 

The database performs the calculation and returns the following:

   avg
---------------
 40.9900000000
(1 row)  

The value returned is precise, but in many cases not suitable. We want the result to be displayed like a currency, which means only two numbers after the comma. For that purpose, we use the round() function, which needs two arguments. The first argument is the actual value that has to be processed, and the second argument is used to define the precision. In our case, the query looks like this:

SELECT ROUND(AVG(price),2) FROM products; 

The result is

 round
-------
 40.99
(1 row) 

This result is what we expected. Now we want the result to be displayed in full dollars without a comma:

SELECT ROUND(AVG(price),0) FROM products; 

The result of the query is now 41, because 41 is the closest integer value. The second parameter of the function can also be negative:

SELECT ROUND(AVG(price),-2) FROM products; 

As you might have thought, the result is 0. You can see in these examples that PostgreSQL supports nested functions.

The next example shows how to calculate the minimum or the maximum of a column:

SELECT MIN(price) FROM products;

The result is

 min
-------
 29.99
(1 row)   

Note

Check out the docs of your PostgreSQL database to determine whether your version can also perform MAX() and MIN() with the help of indices. Otherwise, your query can be very slow on large tables.

Subqueries

Subqueries are used to calculate values that are needed within the query that is executed. To make it clearer, here is an example: Imagine you want to find all persons in a table whose income is higher than the average income of the persons stored in the table. You need the average income to find the persons with a higher income. One possibility to solve the problem is to perform two separate statements, but what do you do when half of the persons stored in the tables are removed by another user while processing the two SELECT statements? The only way to solve the problem properly is to use subqueries. A subquery is actually a query within a query.

Here is an example where we select all products with a higher price than the average price:

SELECT * 
    FROM products 
    WHERE price > 
        (SELECT AVG(price) FROM products); 

First, the result of the subquery is calculated. After that, the result is used to perform the main query. Let's try it:

SELECT * 
    FROM products 
    WHERE price > 
        (SELECT * FROM products);

The database displays the following error:

ERROR: Subselect must have only one field

It is obvious that the database can't perform the query because a price is only one value. When performing subqueries, you have to make sure that the subquery returns exactly one field.

One query can contain many subselects. The following is a query with two subselects for selecting all products that are more expensive than the average price, but less expensive than the average price plus 5$:

SELECT *
    FROM products 
    WHERE price > 
        (SELECT AVG(price) FROM products) 
    AND price < 
        (SELECT AVG(price)+5 FROM products);  

There will be only one record left in the result:

 id | prodid | price | stored
----+--------+-------+--------
 1 | 385343 | 44.99 | 20.00
(1 row)           

Not only a query can contain many subselects, subselects can also contain subselects. This feature is powerful and useful—especially when working with complex queries and data. Just insert a subquery in your SQL code where you need a special value that has to be calculated at runtime.

Using SELECT and INSERT

INSERT statements can easily be combined with SELECT statements. These are usually done when a large amount of data has to be copied from one table into another or when the results of complex queries are stored in temporary tables to achieve better performance. In the following example, we create a temporary table and insert some values into it. Here is the CREATE TABLE command:

CREATE TEMPORARY TABLE temptable 
    (id integer, prodid int8, price numeric(9,2));

Note

The temporary table won't be listed by a \d command, but the table can be treated like a nontemporary table.

Now we insert some data from table products into table temptable:

INSERT INTO temptable(prodid, price) (SELECT prodid, price FROM products);   

Let's see whether the operations have been performed correctly:

SELECT * FROM temptable;

The test result is

 id | prodid | price
----+--------+-------
  | 385343 | 44.99
  | 385342 | 49.99
  | 394568 | 39.99
  | 106666 | 39.99
  | 765354 | 29.99
(5 rows)     

You can see that the database has done all INSERT operations properly.

Pattern Matching and Regular Expressions in SQL

A regular expression is a formula for matching strings that follow some pattern. Many people are afraid to use regular expressions, but when you get used to them, you can't live without them. Perl hackers especially seem to be addicted to regular expressions, because they are a fast and precise way of performing substring searching.

Regular expressions are implemented in all major databases and can therefore also be used in PostgreSQL. Although the syntax of regular expressions differs slightly from the syntax programming languages such as Perl, it is easy to learn.

Regular expressions consist of characters and metacharacters. Metacharacters are used to tell the database that a pattern has to occur more than once or something like that. In the simplest case, a regular expression can be a search string. Let's start with an example: We want to retrieve all people in table customer whose name is John:

SELECT * FROM customer WHERE name LIKE 'John';

The result is

 id | name |  email  | state
----+------+-------------+---------
 1 | John | john@no.any | Florida
(1 row)

Note

The database performs a substring search for John.

The LIKE token can be used to perform the pattern matching, but it is also possible to use the ~ operator instead.

Let's write a query where all names starting with R are displayed:

SELECT * FROM customer WHERE name ~ '^R';

The ^ operator is used to match the beginning of a string. The R must be the first character after the beginning.

 id | name |  email   | state
----+--------+--------------+---------
 2 | Robert | rober@no.any | Georgia
(1 row)   

What happens if the name starts with r instead of R? The ~ operator performs case-sensitive searching. We can use the ~* operator to perform case-insensitive queries:

SELECT * FROM customer WHERE name ~* '^r';

The result is the same as the previous one. Now let's query all names that do not start with r:

SELECT * FROM customer WHERE name !~ '^R';  

The result contains the remaining records:

 id | name |  email   | state
----+-------+--------------+---------
 1 | John | john@no.any | Florida
 3 | Peter | peter@no.any | Alaska
(2 rows)  

The case-insensitive version of the query is

SELECT * FROM customer WHERE name !~* '^r';

In some cases, it is necessary to search for a sequence of characters in a string. Here is an example where we search for all names that have er somewhere in them:

SELECT * FROM customer WHERE name ~* 'er';

The query returns the records of Robert and Peter. It is also possible to see whether certain characters appear in a string:

SELECT * FROM customer WHERE name ~ '[A-P]'; 

This query returns all names that contain one of the letters from A to P. If you don't want to have the records from A to P, but do want all others, write [^A-P] instead. The same kind of query can be done with numbers.

Suppose we want to look for all records that do not have the numbers 2 or 3 in the id and contain the letters from A to P:

SELECT * FROM customer WHERE id ~ '[^23]' AND name ~ '[A-P]'; 

The result is

 id | name |  email  | state
----+------+-------------+---------
 1 | John | john@no.any | Florida
(1 row)

The $ operator symbolizes the end of a string. If we want to query all records that end with an n, we can use the following SQL command:

SELECT * FROM customer WHERE name ~ 'n$';

The only name ending with an n is John, so that is the only record returned. What happens if we want to query records containing a $ or any other metacharacter? To show the solution for such a problem, let's create a temporary table and insert one record:

CREATE TEMPORARY TABLE demo(demo text);
INSERT INTO demo VALUES ('demo$');

Now we want to write a query that retrieves the records:

SELECT * FROM demo WHERE demo ~ '\$';

The character has to be masqueraded using a backslash (\). The result of the query is

 demo
-------
 demo$
(1 row) 

In this example, you can see how powerful regular expressions are, but the real power of regular expressions can be seen when you have to perform complex queries. Assume a situation where someone tells you that he has bought a book on "the third." You don't know the month but you know that it must have been in the year 2001. You can write a query that selects all possible records from table sales:

SELECT id, salestime 
    FROM sales 
    WHERE salestime ~ '2001-.{2}-03';

The query selects all records whose salestime contains 2001, a - and exactly 2 characters before -03. The two undefined characters are the month when the sale took place. A period (.) matches any character exactly once. Parentheses can be used to tell the database how many undefined characters must be found before the rest. If you don't know whether the month is stored with one or two numbers, you can use the following:

SELECT id, salestime 
    FROM sales 
    WHERE salestime ~ '2001-.{1,2}-03';   

The result of the query contains the requested record:

 id |    salestime
----+------------------------
 1 | 2001-04-03 15:01:27+02
(1 row)

Assume that you knew that the sale took place in March or April; you can use this query:

SELECT id, salestime 
    FROM sales 
    WHERE salestime ~ '2001-.3?4?-03'; 

A question mark (?) symbolizes that a certain character may occur zero times or once. In this example, we look for March (3) or April (4). If one of the characters is found, the right result will be displayed.

Two additional operators are the + and the *. + means that the previous character must match at least once. * means that the previous character must match at least zero times. Imagine a query where you are looking for a certain product id. The only thing you know is that it started with 106 and continued with 6, but you don't know how many times the 6 is in the product id. Here is the solution for your problem:

SELECT * FROM products WHERE prodid ~ '106*';

In this example, 6 has to occur after 10 at least zero times; the result will contain the requested product id:

 id | prodid | price | stored
----+--------+-------+--------
 4 | 106666 | 39.99 | 120.00
(1 row)     

Tables 3.1 and 3.2 present an overview of the most important operators and metacharacters, respectively, of regular expressions.

Table 3.1 Operators of Regular Expressions

Operator

Description

~

Case-sensitive matching

~*

Case-insensitive matching

!~

Does not match, case-sensitive

!~*

Does not match, case-insensitive


Table 3.2 Metacharacters of Regular Expressions

Special Metacharacter

Description

^

Matches the beginning of a string.

$

Matches the end of a string.

.

Matches any character exactly once.

+

Matches previous character once or more often.

*

Matches previous character zero times or more often.

?

Matches previuous character zero times or more often.

[abc]

Matches one of the characters in the list.

[^abc]

Doesn't match one of the characters in the list.

[a-c]

Matches all characters from a to c.

[^a-c]

Doesn't match the characters from a to c.

{x}

The previous character must occur exactly x times.

{x,y}

The previous character must occur x to y times.

( )

Parentheses are used to group expressions.


Groups occur as one character for the expression after the parentheses (for example, the * operator after the parentheses means that the whole expression in parentheses has to match zero times or more often).

The examples of regular expressions might seem a bit strange. You might wonder who is looking for persons or for products with a stupid sequence of characters in it. Don't be surprised, but most of the examples are taken from real-world situations. Sometimes a customer wants to know something, but he can't say precisely what. He might say, "It must be something like that." In these cases, you can use regular expressions. Regular expressions can save you a lot of time and lots of code. You can solve complex problems within a few seconds and with only a few characters or lines of code.

Those people who are already addicted to regular expressions—like me—know what I am talking about.

Using SELECT INTO Statements

You use SELECT INTO to redirect the result of a query into a table. This is sometimes useful when debugging code; or you can use it to write two simple, rather than one complex, SQL statement:

SELECT prodid, price 
    INTO TABLE sel 
    FROM products;

The SELECT INTO statement creates a new table called sel. Let's look at whether the table has been created and filled correctly:

shop=# SELECT * FROM sel;
 prodid | price
--------+-------
 385343 | 44.99
 385342 | 49.99
 394568 | 39.99
 106666 | 39.99
 765354 | 29.99
(5 rows)    

sel has two columns called prodid and price, and contains the complete result of the SELECT statement. sel is not a temporary table, so it cannot be seen only by the user who created it. If you want sel to be a temporary table, try the following command:

DROP TABLE sel; 
SELECT prodid, price 
    INTO TEMPORARY TABLE sel 
    FROM products;   

The data you want to insert into a table created by SELECT INTO can be the result of a very complex SQL query.

Note

If you want to add data to a table without creating a new table, use INSERT in combination with SELECT statements.

Formatting the Output and Casting

When performing SELECT statements, psql creates a header for the tables. This header might not be, in many cases, the result you want to see. Let's see how you can make the output look right. We start with a simple SELECT statement:

SELECT AVG(price) FROM products; 

The result of the query does not look right:

   avg
---------------
 40.9900000000
(1 row)

We want to change the header to average price, which can be done with the following query:

shop=# SELECT AVG(price) AS "average price" FROM products;
 average price
---------------
 40.9900000000
(1 row)

Now we want to display the price and cast it to integer:

SELECT CAST(AVG(price) AS int4) AS "average price" 
    FROM products; 

The result of the query looks like this:

 average price
---------------
      41
(1 row)

In this example, we perform an explicit cast. First the average price is calculated, and then the cast to integer is done. We can use the CAST() function to perform the operation. Casts can also be done without the help of the CAST() function. Simply add a colon and the datatype you want to receive to the column:

SELECT AVG(price)::int4 AS "average price" 
    FROM products;  

Finally we want the price to be displayed with a $ at the end of the number, so we use the following code:

SELECT '$' || CAST(AVG(price) AS int4) AS "average price"
  FROM products;

The || operator concatenates values in SQL. In our case, we add the $ symbol to the result:

 average price
---------------
 $41
(1 row)

Note

NOTE

Not all types of casts can be performed by the database. PostgreSQL 7.0.3, for instance, does not support casts from numeric to text; we had to use integer in the example , but to_char()is also possible.

Now we want to write a query where multiple columns are treated as one:

SELECT name || ' bought ' || amount || ' units' AS "result" 
    FROM sales; 

We select and concatenate multiple columns and assign the name result to the column. The table is now displayed as one column:

    result
-----------------------
 John bought 1 unit
 John bought 1 unit
 Robert bought 2 units
 Robert bought 1 unit
(4 rows)

In the next example, we add a WHERE clause to the query that selects only those records that contain John:

SELECT name || ' bought ' || amount || ' units' AS result 
    FROM sales 
    WHERE name || ' bought ' || amount || ' units' 
        ~ 'John'; 

The result contains two records:

    result
---------------------
 John bought 1 unit
 John bought 1 unit
(2 rows)

We can't use the alias of the column in the WHERE clause because this would lead to an error in PostgreSQL 7.0.3. We have to use the full name of the column as it is shown in the previous example:

shop=# SELECT name || ' bought ' || amount || ' units' 
    AS result 
    FROM sales 
    WHERE result ~ 'John';
ERROR: Attribute 'result' not found

Generating HTML

HTML is the most important language when dealing with Web applications. psql offers a flag that tells the database to generate HTML code instead of ordinary ASCII text. Unfortunately, this feature is supported only by psql and not by other interfaces of PostgreSQL. Because most Web interfaces won't use psql in combination with pipes, the flag is not used very often.

Use psql with the -H flag to enable HTML output. From now on, all results will be returned as HTML code. The most common way of performing a query that returns HTML code is to use your favorite UNIX shell.

Here is an example where we select all products that are more expensive than 40 dollars:

echo "SELECT * FROM products WHERE price > 40;" | psql -H shop

The result will be a table including all HTML tags:

<table border=1>
 <tr>
  <th align=center>id</th>
  <th align=center>prodid</th>
  <th align=center>price</th>
  <th align=center>stored</th>
 </tr>
 <tr valign=top>
  <td align=right>1</td>
  <td align=right>385343</td>
  <td align=right>44.99</td>
  <td align=right>20.00</td>
 </tr>
 <tr valign=top>
  <td align=right>2</td>
  <td align=right>385342</td>
  <td align=right>49.99</td>
  <td align=right>64.00</td>
 </tr>
</table>
(2 rows)<br>
  • + Share This
  • 🔖 Save To Your Account