Table of Contents
- Microsoft SQL Server Defined
- Microsoft SQL Server Features
- Microsoft SQL Server Administration
- Microsoft SQL Server Programming
- Performance Tuning
- Practical Applications
- Professional Development
- Application Architecture Assessments
- Business Intelligence
- Tips and Troubleshooting
- Additional Resources
NoSQL for the SQL Server Professional, Part One
Last updated Mar 28, 2003.
Many data professionals are very familiar with Relational Database Management Systems (RDBMS) like SQL Server or Oracle. The concepts within an RDBMS have served data needs since it was adopted by enterprises in the early 1970s. In fact, most of us have "grown up" in our IT careers using an RDBMS of one form or another for almost all programmatic data. We spend our days working with SQL Server and other packages, learning about them, and eagerly await each new release, loaded with features and enhancements.
But those features and enhancements aren't always what a developer is looking for. In fact, sometimes they are looking for a quick, simple storage method — and this part is most important of all — that will scale with large data sets without their having to change the design. They don't want or need an administrator, they don't design the data the same way, and the structures they care about do not necessarily map directly to the relational calculus that RDBMS systems are built on.
Often the data design process is different for developers. Many data professionals, including myself, normally follow this process for designing a database from a set of business requirements:
- Pull out nouns and verbs
- Arrange nouns into entities
- Further divide nouns into "Parent" and "Child" entities
- Apply structures such as tables, data types and constraints to entities
- Verbs become relationships between entities
This is called "Database-First" design and is common among data professionals. But many developers use a different process called "Model-First" or a "Domain-Driven Development Process." Using this process, developers create code, often using Object-Oriented designs, which then drive out "Classes." A class in this context looks a lot like an entity or table to a database professional, but it lacks things like Primary and Foreign Keys, might have different data types, and any RDBMS programming logic is not contained in the class definition.
These classes are often created using Object-Role Modeling (ORM) or other tools, using frameworks from Java or the Entity Framework in .NET programming. So you'll often see these data designs look vastly different from an RDBMS database model — the normalization is different, and Referential Integrity might be placed in the programming layer, called Programmatic Referential Integrity or PRI as opposed to the database program making sure that people can't insert, update or delete incorrectly, called Declarative Referential Integrity or DRI. Whereas you and I would create constraints like Keys or CHECK objects to make sure a number entered falls within a certain range, a developer will just code that right into the program. Often their assumption is that no one can access the data layer except through their program, where a data professional might make the same data source available to other programs, reporting systems and so on.
Often this causes a conflict. The developers might start work on a project, and the data professionals are brought in after the initial design. At that point, the data team wants to begin a database design effort, and in some cases the developers are already past that stage. Everyone has commendable goals — the data professional wants to ensure the safety, stability and integrity of the data, and the developers want fast programs and programming methodologies that don't have time to design the data down to this level of granularity.
Enter "NoSQL." This is a fairly new term that originally indicated that the Structured Query Language (SQL) was not the interface to this class of data storage and retrieval. After much questioning from the SQL community that was trying to make sense of it all, some in the NoSQL community stated that the name meant "Not Only SQL."
Once you begin to peel back the terminology arguments, however, the NoSQL offerings can become confusing very quickly. There is a mix of exactly what the products do with a discussion of how they do it, and a new one seems to be released every couple of months of so. It can be a challenge to find out which one is in vogue at any given time or company.
Developers don't actually care about or use the term "NoSQL" very often — they simply want a quick way to lay out a set of classes and map them back to a reliable, fast, widely scalable data engine. Since most of the NoSQL products are aligned to development methods. The developer simply creates their classes for data, and calls the engine's Application Programming Interface (API) to read or write data (updates are merely writes). In some cases, these API's can be called from an HTTP string — more about that in other books and articles.
But why should the data professional care about NoSQL? Shouldn't we focus on making sure our systems are performing well, have good security, and are optimized? Well, the primary reason we should care is because we should be data stewards, not just tied to a single platform. We need to help the organization handle data in whatever form it works, and however it best suits them.
You also need to care because this methodology is very attractive to developers. If they toss up a NoSQL system, they no longer need to deal with the data team — they simply code and go. They'll explain to the business teams that companies like Google and Twitter use a NoSQL implementation, and those systems are large, and inexpensive. Those are things that resonate well with business people. If you understand how the NoSQL implementations work, you can become part of the process — using it where it makes sense, and calling out the times where an RDBMS might be more suited to a particular task.
In this overview, I'll briefly describe the rationale behind NoSQL databases, how they differ from an RDBMS, and where folks are using them.
It's important to differentiate between the physical way a NoSQL product is persisted, and the interfaces, coding and access methods they use for the abstraction of data. You'll often run into confusion on this point, since some NoSQL offerings imply a certain data storage method and physical architecture, and others can use differing underlying structures and physical layouts. I'll try and point these out as I explain each offering.
At a high level, and keeping the caveat I mentioned above in mind, there are some general similarities between the popular NoSQL products. Many of the offerings at the physical level are really a distributed data store, not really all that dissimilar from COBOL flat-files.
Most of the NoSQL products differ in the way the data is designed and the physical layout from an RDBMS. The data structures are more simplistic, with only a few columns. In fact, in some of the offerings there are only two columns, a key to find the data and the data, line-by-line.
The structure of a data set — what we would refer to as a table or entity - can change from line to line. In other words, one "row" of data in a NoSQL schema can have different attributes (or what we call columns) than the row below it. This can look very strange when you're staring at the data directly, but again, developers don't do that. They call the data object in code, and then query it. So to a developer the underlying design isn't very interesting to look at, while we data professionals usually live in that area.
The program access methods and the languages used in NoSQL are much different. While the name NoSQL indicates that you don't use the Structured Query Language to access data, in some cases there is a SQL layer over a NoSQL structure, which can be confusing. In most NoSQL products, however, the query language is left in the programming API's, with an interesting set of twists. As I've mentioned, for some NoSQL implementations an "Update" is really a delete/write process. Interestingly, this is also technically the way SQL Server and other RDBMS systems perform an update as well, it's just that they give you an UPDATE statement, and in the programming API there is not. Many developers use a framework such as EF or they use something like the LINQ query language so that the query stays constant for them regardless of the underlying query engine.
The next interesting thing about the programming methods for some NoSQL products is that the query language is often less about a SELECT operation that limits the data that is brought back than a filter that is applied across the entire data set. While the effect is the same in that only some data is returned to the program, the way that is done at the engine level is quite dramatic. Using a filter operation instead of a seek-and-build-a-data-set, the work set can be broken down, allowing a huge level of parallel operations. In some cases the "query" is broken down into a series of steps, and then mapped out to which nodes of the system hold the data elements. This means you can add nodes and data sources almost indefinitely, distributing the load using a set of hash tables to break out the work.
There are some trade-off considerations, of course. While it's easier to program for the developer in a NoSQL environment and it scales well, you may need more program-side code than server-side code for a given operation. What that means is that there are no Stored Procedures, automatically scaling lock operations and other places to put the business logic. Everything, or perhaps a lot more of everything, needs to be in the code. Since bugs are inevitable, you can get phantom reads, integrity problems and commitment intervals that are not acceptable.
It should be noted that while NoSQL is often synonymous with a relaxed ACID proteries for data, there are variants out now that do enforce more ACID properties, called "NewSQL." Yet another term to learn.
Your tools for managing a NoSQL system are often programmatic, rather than a graphical toolset. So you will often see third party tools in wide use for NoSQL products. That may be in fact a hidden cost, so it's one you should consider.
The data types in a NoSQL product may be different than the data types you will see in a traditional RDBMS. Of course, coding languages like Java and .NET languages have quite a few data types themselves, but those don't always map one-to-one with SQL Server. But don't let the differences fool you — developers will explain that the primitive data types in a programming language like INT or BOOL can be used in a Class to define very complicated data types — perhaps even more complicated than RDBMS database data types, if you leave out the CLR layer in SQL Server or the Java types in Oracle.
In most cases, the relationships in a NoSQL implementation are done programmatically. There are no Primary or Foreign Keys, at least as they are defined in an RDBMS. I'll explain more about the set-theory and mathematics they use in the next article.
The final consideration I keep in mind when deciding whether a data set is used in NoSQL or an RDBMS are the multiple camps in the NoSQL world. It's not an exaggeration to say that NoSQL variants are released at an astounding rate. Standardization can become a real issue as one NoSQL is used over another. Companies need to be aware of this fragmentation. Not that it's a bad thing, but it is something to keep in mind for supportability and migrations and portability.
In the next article, I'll close out this topic with a discussion when and where you can use NoSQL databases, a few examples of those, and some places to learn more.