Relational (SQL) vs Non-relational (noSQL) Databases

Brad Vander Zanden

Material for these notes has been taken from various Wikipedia pages and the following Stack Exchange post.
  1. Relational Database Management System (RDBMS): A database system that organizes its data into tables with each table containing a fixed, named set of columns and each row providing values for each column (i.e., each cell). Here's an example of a relation for Employees:

    employeeIdfirstname lastnamebranchIdgenderbirthdateemploymentDate
    38396BradVander ZandenB003M1964-02-031990-08-01
    53986SmileyHoundB002F2005-11-282007-04-27
    14983WinnieThe PoohB002NULL1940-10-062010-12-25
    00576EbbyLabB003FNULL2002-8-15

    1. A NULL value may be provided to indicate the absence of a value for a cell.
    2. Tables are typically kept in a "normalized" form that minimize data redundancy. In particular, RDBMS's try to limit data redundancy to keys.
      1. Primary keys uniquely identify rows and may consist of one or more attributes. Often times artificial keys are used, such as social security numbers or student ids, to ensure uniqueness.
      2. Foreign Keys refer to a primary key in another table and act like a pointer that allows the columns of two tables to be joined. For example, consider the following two tables:
          Employee(employeeId, firstname, lastname, branchId, gender, birthdate, employmentDate)
          Branch(branchId, streetAddress, city, state, zipcode, name)
        
        employeeId in Employee and branchId are primary keys while branchId in Employee is a foreign key. the branchId allows the Employee and Branch relations to be joined, so that, for example, we can discover the name of the branch at which "Brad Vander Zanden" works.
    3. RDBMS's enforce various constraints
      1. entity integrity: a primary key is not allowed to be NULL
      2. referential integrity: a foreign key either points to a valid entry in another table or else all of its columns are NULL (a good analogy is that dangling pointers are forbidden)
      3. cell integrity: Every cell has either a single value or the NULL value. A cell may not be empty and it may not contain multiple values, such as multiple hobbies.
      4. business constraints : Users can write queries that enforce various business policies, such as no employee making more than $100K or no sales person handling more than 50 accounts.
    4. RDBMS's support rich, declarative query languages that allow end-users with no programming experience to specify queries that indicate what information the end-user wants without having to specify how to obtain it using a programatic procedure.
      1. Simple selection example: We want to find all employees who started before Jan. 1, 1990.
        SELECT firstname, lastname, employmentDate		  
            FROM Employee
            WHERE employmentDate < '1990-01-01';
        		
      2. Join example: We want to find the name of the branch at which Brad Vander Zanden works.
        SELECT E.firstname, E.lastname, B.name
            FROM Branch B NATURAL JOIN Employee E
            WHERE E.firstname = "Brad" AND E.lastname = "Vander Zanden";
        
        The NATURAL JOIN command tells the RDBMS to join the two tables on columns with identical names, which in this case is branchId.
      3. Grouping example: Here is an example where we want to find all branches that employ more than 50 employees--we are using some aggregate operations:
        SELECT branchId, COUNT(employeeID) as staffCount
            FROM Employee
            GROUP BY branchId
            HAVING COUNT(employeeID) > 50
            ORDER BY branchId;
        		
      4. The results are printed in easy to read tables.
    5. RDBMS's allow multiple transactions to concurrently access a database. A transaction is one or more statements that are executed as an atomic unit and that may Create, Read, Update, or Delete rows from a table. The term CRUD is often used to denote these four operations. RDBMS's maintain 4 conditions of transactions commonly referred to as ACID. ACID stands for Atomicity, Consistency, Isolation, and Durability. These 4 properties of database transactions are intended to guarantee validity even in the event of errors, power failures, etc.
      1. Atomicity requires that a transaction which consists of several statements either complete or rollback completely, thus making the multiple statements succeed or fail as an atomic unit. A failure of a single statement should cause the entire transaction to fail. Partially finished transactions should never be visible, since a statement could subsequently fail, thus causing the entire transaction to rollback.
      2. Consistency requires that a transaction bring the database from one valid state to another that adheres to all constraints. For example, RDBMS's require that a foreign key either point to an existing parent or be NULL. So if a transaction tries to insert a child that points to a parent, such as "Brad Vander Zanden", then "Brad Vander Zanden" must be verified to exist and must continue to exist until the transaction has been completed.
      3. Isolation requires that transactions should not interfere with each other. The system should guarantee the same results if the transactions are executed in parallel or sequentially. In practice most RDBMS products allow modes that trade off isolation against performance.
      4. Durability requires that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions are recorded in non-volatile memory, such as on a disk or solid-state memory.
    6. RDBMS's use a schema or catalog to specify 1) the names of each relation, 2) the names and types of each of a relation's columns, 3) constraints on each relation, such as primary and foreign keys, and 4) the order in which the columns appear in the relation.
    7. SQL is a widely used standard for RDBMS's. Oracle's SQL system and the open source mySQL system are the two most widely-used SQL systems.
  2. NoSQL databases refer to a collection of non-relational database systems that have been developed to handle semi-structured data and real-time data. NoSQL databases are frequently used for big data analysis and by social media companies.
    1. Why traditional relational DBMS (RDBMS) are less than ideal for such applications
      1. They require structured data that can be easily organized into tables with the same set of properties for each table entry. Social media posts often vary in the set of properties that are present or absent, thus leading to a fair amount of sparcity in the data. RDBMS's can handle sparce data in one of two ways:
        1. Use NULL values for missing property values: This can consume unnecessary storage.
        2. Factor sparcely used properties into separate relations that can be reconstructed using joins: This can consume extra time as multiple secondary storage accesses may be required for both reads and updates.
      2. It is harder to implement RDBMS's using horizontal clusters of computers, especially commodity computers, that do not share memory. This is because distributed computing requires complicated and time-consuming locking in order to maintain the ACID conditions of a transaction. In theory it can be done but there's just not a big enough market for it.
      3. Because of the need to perform joins to bring together related information and because of the need to maintain consistency of transactions, relational databases may be too slow to support thousands of simultaneous queries, which may be needed for social media platforms.
    2. How NoSQL Databases solve these problems
      1. semi-structured data: Most NoSQL databases use some form of key-value pairs to represent data, where the key is a unique identifier for an entity and the value is a set of one or more fields (also key-value pairs). Each entity is allowed to have a different set of fields, and hence data can be stored more compactly because each entity only stores those fields that contain values.
        1. This only works well with pretty sparse data because noSQL databases have to explicitly store field names whereas relational databases store the field name in their schemas and hence only have to store the value of each column in a table.
        2. Some noSQL databases support schemas that at least specify the allowable field names and types for the fields, whereas others allow arbitrary field names and dynamic types for the fields in order to allow maximum flexibility.
      2. horizontal clusters: Many, although not all, NoSQL databases relax the ACID requirements of RDBMS's and instead use a concept known as eventual consistency.
        1. Eventual consistency maintains multiple copies of data that are held on different nodes in a cluster and allows reads to be made from any node. When an update transaction is executed, it is executed at multiple nodes concurrently and when a "quorem" of nodes indicates that they have committed the transaction, then the transaction is considered committed and the transaction is then completed at the other nodes as well (if the transaction fails at another node it is rolled back at that node and attempted again).
        2. Eventual consistency informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. However, it does have two drawbacks:
          1. Between the time when a transaction is committed and the transaction is completed at all nodes, some reads will obtain the older, obsolete data. Hence eventual consistency trades off faster reads for potentially out-of-date reads. Obviously this is not acceptable in situations that require up-to-date information, such as financial account balances or inventory reports. However, it is acceptable on many social media platforms. For example, when a user asks to read all the comments on a blog post, it is not critical that they see the most recently added comments--it is okay if they see most of the posts.
          2. If two or more updates come in at roughly the same time and they clash, there is no guarantee as to which of the updates the database will converge to, only that the database will eventually converge to one given enough time. The mechanics by which this happens are complicated and beyond the scope of this course.
        3. Despite its name, eventual consistency relaxes the Isolation property of ACID, not the Consistency condition.
        4. Eventual consistency dramatically reduces the sophisticated locking required by RDBMS's for horizontal clusters, thus making it easier to implement noSQL databases on horizontal clusters.
      3. performance: noSQL databases achieve performance improvements over RDBMS's in the following ways:
        1. Using "eventual consistency" to support replicated data
        2. Using non-normalized data in which instead of just storing foreign keys, you also store the foreign values. For example, you might store all of the comments associated with a blog post in the same MongoDB document. This means you do not have to perform an expensive join, but it also means that if something changes, such as the username associated with comments, then you have to make multiple changes to the username in the database, rather than just the one change that would be required with normalized tables. Hence using non-normalized data works better in a read-heavy database.
        3. Using multiple, simple queries rather than one complicated join query: Rather than using non-normalized data, one may instead use multiple queries to retrieve data from two different collections of data (e.g., one collection may contain blog posts and a second collection might contain comments on posts). NoSQL queries tend to be more procedural while SQL queries are more declarative. What this means is that NoSQL makes the programmer spend the time to make queries efficient while SQL makes the query optimizer spend time to make queries efficient. Since the query optimizer has to "think" before executing the query, the NoSQL query tends to be faster. Second, since the NoSQL queries are typically simpler (but more numerous), it is easier for the programmer to make them more efficient, while the SQL query optimizer has to deal with more complicated queries that have multiple moving parts which make them harder to optimize.
        4. Using horizontal sharding. Horizontal sharding takes horizontal slices of a database and places them on different nodes by using a consistent hash function which evenly distributes the objects to different nodes. This partitioning evenly distributes the load among different processors and makes it easy to scale up or down as processors either come on-line or go off-line. In particular, a consistent hash function only has to remap K/n keys when a node is added or deleted, where K is the number of keys and n is the number of nodes, rather than remapping all K keys which is what a conventional hash functin must do.
    3. Types of NoSQL Databases: There are many types of NoSQL databases but the most common types are:
      1. key-value pair databases: These are very simple databases that look like associative arrays in scripting languages. Keys are unique but the value fields could be comma-separated lists of values and the values could themselves be records with different fields. For example, if a sensor is recording daily high temperatures in different cities, you might have a database that looks like:
        { knoxville : [78, 81, 80, 69],	
          nashville : [86, 82, 93, 95, 94, 78],
          memphis : [93, 92]
        }
        
        A more complicated example might be a contacts database:
        { "brad vanderzanden" : { street : "1678 Cardiff Rd", city : "Columbus", 
                                  state: "OH", zipcode: 43221,  
                                  email: "bvanderz@utk.edu" }
          "ebby" : { apartment: "Steeplechase Apartments", unit: 3503,
                       city : "Knoxville", state: "TN", zipcode: 37919, 
                       poBox: 481, phone: "865-486-3153"n }
        }
        
        Notice that there are some shared fields and some distinctly different fields in the two key-value pairs.
        1. Key-value databases are often used for big data analysis, such as for Map-Reduce operations. In a Map-Reduce operation, the same operation is applied to a set of files (Map) and then an operation is used to summarize the results (Reduce). For example, suppose we have 5 temperature databases that contain daily high temperatures. We could first use a Map operation to find the daily high temperature for each city in each file. For example:
          File 1: (knoxville, 81), (nashville, 95), (memphis, 93)
          File 2: (chattanooga, 75), nashville, 83), (franklin, 62), (memphis, 88)
          File 3: (maryville, 69), (johnson city, 78), (knoxville, 89)
          File 4: (nashville, 98), (memphis, 83), (alcoa, 75), (maryville, 78)
          File 5: (knoxville 48), (franklin, 49), (nashville, 53)
              
          We could then use a Reduce operation to find the maximum daily high for each city:
          (alcoa, 75),
          (chattanooga, 75),
          (franklin, 62),
          (johnson city, 78)
          (knoxville, 89),
          (maryville, 78),
          (memphis, 93),
          (nashville, 98)
              
        2. Apache's Hadoop system is an example of a key-value noSQL database system.
      2. Document Store: A more sophisticated type of key-value pair database that uses a unique key to identify a document and a record as the value. Like a key-value database, the record may consist of an arbitrary number of fields (which are themselves key-value pairs)
        1. Additional characteristics beyond key-value pair databases
          1. Use an encoding, such as JSON (or more likely Binary JSON (BSON)) or XML
          2. Provide a query language or API that allows documents to be accessed via their contents, and not just their keys. For example, here is a simple MongoDB query to find all contacts living at "Steeplechase Apartments":
            db.contacts.find( { apartment: "Steeplechase Apartments" } );
            	  
            These query languages tend to be much more low-level than the RDBMS query languages and they also do not tend to support joins.
        2. Document store databases are good for storing social media documents.
        3. MongoDB is a frequently used document store, noSQL database. It organizes documents as collections, where a collection may be loosely thought of as a table in a RDBMS and a document may be loosely thought of as a row. The contacts database shown above would be a representative collection in a MongoDB (it might have other collections as well). As another example, a MongoDB collection could be blog posts, with posts containing the user id, user name, blog post, and an array of comments on the blog post.
      3. Column Store Database: A database that can be loosely thought of as a two-dimensional database with each file representing a column storing a property of entities.
        1. Only entities that contain the property are stored in the file, unlike RDBMS systems where entities would store a NULL value in the column if they did not have a value for that property.
        2. Column databases essentially use a key-value representation with each key representing an entity and the value representing the entity's value for that property. You can collect all the information for an entity by querying for it across all column files, but obviously this is not as efficient as storing it as a document in a document-store database or a row in a RDBMS.
        3. Column databases work well for big data analysis in concert with Map-Reduce operations.
        4. Google's Bigtable database is a good example of a column database. It maps two arbitrary string values (row key and column key) and timestamp (hence three-dimensional mapping) into an associated arbitrary byte array. It can be thought of as a sparse, distributed multi-dimensional sorted map.
      4. Graph Database: A "network" database with origins in the 1960s that uses links to represent relationships between entities. For example, you could use a graph database to represent friendship links, road maps, network topologies, or public transport networks.
      5. Object-Oriented Database: An object-oriented database stores its entities as objects, rather than relations and uses pointers, rather than foreign keys, to link related objects. Like document store databases OO databases typically provide an API or query language for retrieving documents using either a primary key or a query on the contents of a document. These query languages tend to be much more low-level than the RDBMS query languages and they also do not tend to support joins. One would programmatically chase pointers to perform the equivalent of a join operation. Object-oriented databases have been around since the 1980s, so they tend to be older than most other types of noSQL databases.
  3. Advantages of an RDBMS over a noSQL database: RDBMS's are still the most widely used type of database because they have many advantages over noSQL databases (I constantly hear my friends in industry complain about their bosses forcing them to use noSQL databases when RDBMS's would do the job just as well with less trouble):
    1. The query languages are much richer for SQL databases and declarative, which means that end-users can write queries. That results in much greater flexibility and much faster turnaround in competitive business environments. Before the advent of databases, data was stored in loosely connected file systems that relied on programmers to write procedural queries. End users could wait days or weeks before a programmer became available to write the query, which can significantly slow down decision making. noSQL databases are essentially a throwback to those loosely connected file systems with all of their attendant drawbacks.
    2. Eventual consistency can be difficult to implement in a distributed system because you need convergence of updates. In addition, you cannot guarantee which update will eventually prevail nor can you guarantee that a read will be up-to-date, which is fatal in many systems that require a guarantee on which updates get executed (e.g., how would you feel if there was a race condition for that last item at Amazon and you lost or how would a bank feel about only one withdrawal getting posted to a user's account rather than all withdrawals getting posted--perhaps bankrupt?).
    3. If you are willing to invest in more expensive storage area network (SAN) hardware, then RDBMS's scale up to large workloads rather well because the hardware is faster (e.g., faster bandwidth between processors and faster switches) and more reliable. For example, commodity clusters use replicated nodes not just to speed up read operations, but also to guarantee Durability in case of a node failure. The reason they do that is because the nodes are much more likely to fail than they are in more expensive storage area networks. You can also use shared memory with multiple, high-speed processors, which requires some locking, but much simpler locking than what would be required for distributed, no-shared memory systems.
    4. RDBMS's are older and hence more stable than the more recent noSQL databases. For example, RDBMS's have more standardized interfaces due to the SQL standard, whereas noSQL has a plethora of interfaces, many of which are not standardized by any standards-making body.
    5. Many applications are highly amenable to the sort of structured data required by RDBMS's, and semi-structured data can be easily factored into multiple relations that work with RDBMS's, so if the data is not particulary sparse, RDBMS's can be a good solution for semi-structured data.
    6. Sharding is a technique that can be used with RDBMS's and in fact is used by mySQL to improve performance.
  4. The CAP and PACELC Theorems: These theorems state that distributed data stores require trade-offs, regardless of whether they are an RDBMS or a noSQL database. They could be called the "no free lunch" or "you can't have your cake and eat it too" theorems:
    1. CAP Theorem (originator Eric Brewer)
      1. Theorem statement: It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
        1. Consistency: Every read receives the most recent write or an error
        2. Availability: Every request receives a non-error response, but without the guarantee that it contains the most recent write
        3. Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.
      2. Theorem implications: The CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP theorem is quite different from the consistency guaranteed in ACID database transactions.
      3. How RDBMS's and noSQL databases deal with the CAP theorem in the presence of a network partition
        1. RDBMS's favor consistency, and hence may either stall the user or return an error message if consistency cannot be guaranteed.
        2. noSQL favor availability, and use eventual consistency to eventually regain consistency.
      4. Network partitions are fairly rare, especially with high quality hardware, and hence the CAP theorem does not frequently come into play with RDBMS's (which tend to be implemented on high quality hadware). Even commodity hardware does not fail that frequently, although as the size of the cluster increases, network partitions do increase in frequency as well.
    2. PACELC theorem is an extension to the CAP theorem and comes into play all the time, whereas the CAP theorem comes into play infrequently.
      1. Theorem statement: In case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C).
      2. Theorem implications: In a distributed data store, you always have to deal with latency, which means that if you want high availability (i.e., low latency), then you will need to replicate data, which will decrease consistency.
      3. How RDBMS's and noSQL deal with the PACELC theorem. You will see systems rated as P[AC]/E[LC] where the P says what happens in the case of network partition and the E says what happens in the else case. [AC] are your two choices when faced with network partition (availability or consistency) and [LC] are your two choices when there is no network partition. If you choose L, then you are choosing low latency (and hence high availability) and if you choose C you are choosing consistency but higher latency.
        1. RDBMS's are PC/EC
        2. noSQL databases are all over the place. MongoDB and BigTable, the two noSQL databases cited in these notes are both PC/EC. However, noSQL databases that use eventual consistency are PA/EL.