Introducing Amazon SimpleDB
Amazon has been offering its customers computing infrastructure via Amazon Web Services (AWS) since 2006. AWS aims to use its own infrastructure to provide the building blocks for other organizations to use. The Elastic Compute Cloud (EC2) is an AWS offering that enables you to spin up virtual servers as you need the computing power and shut them off when you are done. Amazon Simple Storage Service (S3) provides fast and unlimited file storage for the web. Amazon SimpleDB is a service designed to complement EC2 and S3, but the concept is not as easy to grasp as "extra servers" and "extra storage." This chapter will cover the concepts behind SimpleDB and discuss how it compares to other services.
What Is SimpleDB?
SimpleDB is a web service providing structured data storage in the cloud and backed by clusters of Amazon-managed database servers. The data requires no schema and is stored securely in the cloud. There is a query function, and all the data values you store are fully indexed. In keeping with Amazon's other web services, there is no minimum charge, and you are only billed for your actual usage.
What SimpleDB Is Not
The name "SimpleDB" might lead you to believe that it is just like relational database management systems (RDBMS), only simpler to use. In some respects, this is true, but it is not just about making simplistic database usage simpler. SimpleDB aims to simplify the much harder task of creating and managing a database cluster that is fault-tolerant in the face of multiple failures, replicated across data centers, and delivers high levels of availability.
One misconception that seems to be very common among people just learning about SimpleDB is the idea that migrating from an RDBMS to SimpleDB will automatically solve your database performance problems. Performance certainly is an important part of the equation when you seek to evaluate databases. Unfortunately, for some people, speed is the beginning and the end of the thought process. It can be tempting to view any of the new hosted database services as a silver bullet when offered by a mega-company like Microsoft, Amazon, or Google. But the fact is that SimpleDB is not going to solve your existing speed issues. The service exists to solve an entirely different set of problems. Reads and writes are not blazingly fast. They are meant to be "fast enough." It is entirely possible that AWS may increase performance of the service over time, based on user feedback. But SimpleDB is never going to be as speedy as a standalone database running on fast hardware. SimpleDB has a different purpose.
Robust database clusters replicating data across multiple data centers is not a data storage solution that is typically easy to throw together. It is a time consuming and costly undertaking. Even in organizations that have the database administrator (DBA) expertise and are using multiple data centers, it is still time consuming. It is costly enough that you would not do it unless there was a quantifiable business need for it. SimpleDB offers data storage with these features on a pay-as-you-go basis.
Of course, taking advantage of these features is not without a downside. SimpleDB is a moderately restrictive environment, and it is not suitable for many types of applications. There are various restrictions and limitations on how much data can be stored and transferred and how much network bandwidth you can consume.
SimpleDB differs from relational databases where you must define a schema for each database table before you can use it and where you must explicitly change that schema before you can store your data differently. In SimpleDB, there is no schema requirement. Although you still have to consider the format of your data, this approach has the benefit of freeing you from the time it takes to manage schema modifications.
The lack of schema means that there are no data types; all data values are treated as variable length character data. As a result, there is literally nothing extra to do if you want to add a new field to an existing database. You just add the new field to whichever data items require it. There is no rule that forces every data item to have the same fields.
The drawbacks of a schema-less database include the lack of automatic integrity checking in the database and an increased burden on the application to handle formatting and type conversions. Detailed coverage of the impact of schema-less data on queries appears in Chapter 4, "A Closer Look at Select," along with a discussion of the formatting issues.
Stored Securely in the Cloud
The data that you store in SimpleDB is available both from the Internet and (with less latency) from EC2. The security of that data is of great importance for many applications, while the security of the underlying web services account should be important to all users.
To protect that data, all access to SimpleDB, whether read or write, is protected by your account credentials. Every request must bear the correct and authorized digital signature or else it is rejected with an error code. Security of the account, data transmission, and data storage is the subject of Chapter 8, "Security in SimpleDB-Based Applications."
Billed Only for Actual Usage
In keeping with the AWS philosophy of pay-as-you-go, SimpleDB has a pricing structure that includes charges for data storage, data transfer, and processor usage. There are no base fees and there are no minimums. At the time of this writing, Amazon's monthly billing for SimpleDB has a free usage tier that covers the first gigabyte (GB) of data storage, the first GB of data transfer, and the first 25 hours of processor usage each month. Data transfer costs beyond the free tier have historically been on par with S3 pricing, whereas storage costs have always been somewhat higher. Consult the AWS website at https://aws.amazon.com/simpledb/ for current pricing information.
Domains, Items, and Attribute Pairs
The top level of data storage in SimpleDB is the domain. A domain is roughly analogous to a database table. You can create and delete domains as needed. There are no configuration options to set on a domain; the only parameter you can set is the name of the domain.
All the data stored in a SimpleDB domain takes the form of name-value attribute pairs. Each attribute pair is associated with an item, which plays the role of a table row. The attribute name is similar to a database column name but unlike database rows that must all have identical columns, SimpleDB items can each contain different attribute names. This gives you the freedom to store different data in some items without changing the layout of other items that do not have that data. It also allows the painless addition of new data fields in the future.
It is possible for each attribute to have not just one value, but an array of values. For example, an application that allows user tagging can use a single attribute named "tags" to hold as many or as few tags as needed for each item. You do not need to change a schema definition to enable multi-valued attributes. All you need to do is add another attribute to an item and use the same attribute name with a different value. This provides you with flexibility in how you store your data.
SimpleDB is primarily a key-value store, but it also has useful query functionality. A SQL-style query language is used to issue queries over the scope of a single domain. A subset of the SQL select syntax is recognized. The following is an example SimpleDB select statement:
SELECT * FROM products WHERE rating > '03' ORDER BY rating LIMIT 10
You put a domain name—in this case, products—in the FROM clause where a table name would normally be. The WHERE clause recognizes a dozen or so comparison operators, but an attribute name must always be on the left side of the operator and a literal value must always be on the right. There is no relational comparison between attributes allowed here. So, the following is not valid:
SELECT * FROM users WHERE creation-date = last-activity-date
All the data stored in SimpleDB is treated as plain string data. There are no explicit indexes to maintain; each value is automatically indexed as you add it.
High availability is an important benefit of using SimpleDB. There are many types of failures that can occur with a database solution that will affect the availability of your application. When you run your own database servers, there is a spectrum of different configurations you can employ.
To help quantify the availability benefits that you get automatically with SimpleDB, let's consider how you might achieve the same results using replication for your own database servers. At the easier end of the spectrum is a master-slave database replication scheme, where the master database accepts client updates and a second database acts as a slave and pulls all the updates from the master. This eliminates the single point of failure. If the master goes down, the slave can take over. Managing these failures (when not using SimpleDB) requires some additional work for swapping IP addresses or domain name entries, but it is not very difficult.
Moving toward the more difficult end of the self-managed replication spectrum allows you to maintain availability during failure that involves more than a single server. There is more work to be done if you are going to handle two servers going down in a short period, or a server problem and a network outage, or a problem that affects the whole data center.
Creating a database solution that maintains uptime during these more severe failures requires a certain level of expertise. It can be simplified with cloud computing services like EC2 that make it easy to start and manage servers in different geographical locations. However, when there are many moving parts, the task remains time consuming. It can also be expensive.
When you use SimpleDB, you get high availability with your data replicated to different geographic locations automatically. You do not need to do any extra work or become an expert on high availability or the specifics of replication techniques for one vendor's database product. This is a huge benefit not because that level of expertise is not worth attaining, but because there is a whole class of applications that previously could not justify that effort.
One of the consequences of replicating database updates across multiple servers and data centers is the need to decide what kind of consistency guarantees will be maintained. A database running on a single server can easily maintain strong consistency. With strong consistency, after an update occurs, every subsequent database access by every client reflects the change and the previous state of the database is never seen.
This can be a problem for a database cluster if the purpose of the cluster is to improve availability. If there is a master database replicating updates to slave databases, strong consistency requires the slaves to accept the update at the same time as the master. All access to the database would then be strongly consistent. However, in the case of a problem preventing communication between the master and a slave, the master would be unable to accept updates because doing so out of sync with a slave would break the consistency guarantee. If the database rejects updates during even simple problem scenarios, it defeats the availability. In practice, replication is often not done this way. A common solution to this problem is to allow only the master database to accept updates and do so without direct contact with any slave databases. After the master commits each transaction, slaves are sent the update in near real-time. This amounts to a relaxing of the consistency guarantee. If clients only connect to the slave when the master goes down, then the weakened consistency only applies to this scenario.
SimpleDB sports the option of either eventual consistency or strong consistency for each read request. With eventual consistency, when you submit an update to SimpleDB, the database server handling your request will forward the update to the other database servers where that domain is replicated. The full update of all replicas does not happen before your update request returns. The replication continues in the background while other requests are handled. The period of time it takes for all replicas to be updated is called the eventual consistency window. The eventual consistency window is usually small. AWS does not offer any guarantees about this window, but it is frequently less than one second.
A couple things can make the consistency window larger. One is a high request load. If the servers hosting a given SimpleDB domain are under heavy load, the time it takes for full replication is increased. Additionally a network or server failure can block replication until it is resolved. Consider a network outage between data centers hosting your data. If the SimpleDB load-balancer is able to successfully route your requests to both data centers, your updates will be accepted at both locations. However, replication will fail between the two locations. The data you fetch from one will not be consistent with updates you have applied to the other. Once the problem is fixed, SimpleDB will complete the replication automatically.
Using a consistent read eliminates the consistency window for that request. The results of a consistent read will reflect all previous writes. In the normal case, a consistent read is no slower than an eventually consistent read. However, it is possible for consistent read requests to display higher latency and lower bandwidth on occasion.