This exam covers all the material for databases. Make sure you look at the
sample midterm and solutions on the CS465 website to get a feel for what the
format of the exam will be.

Introductory Material
    Functions provided by DBMS's
    	 Data Definition Language (DDL)
	 Data Manipulation Language (DML)
	 Access Control
    Advantages of DBMS's
    Disadvantages of DBMS's

Relational Model (Chap 4)
    Relational model is the theoretical basis for the DDL
    Relational model terminology
    	Relation/Table/File
	Tuple/Row/Record
	Attribute/Column/Field
	Degree
	Cardinality
    Know the Properties of a Relation
    Keys
        Super key
	Candidate key
	Composite key: More than one attribute
	Primary key
	Alternate key
	Foreign key
    Integrity constraints
    	NULL: represents a value that is currently unknown
	Entity integrity
	Referential integrity
	General constraints
    Views
        What they are
	How they're formed from base relations
	Purposes/Advantages
	Restrictions on updating views

SQL
    Select/Project/Join: These are the basic SQL operations. Know what each
    			 operation does (e.g., select filters rows)
    Know how to write queries involving the above operations including
    	 the various filters that can be used in where clauses (e.g., between,
	 in, relational and boolean operators, like, NULL)
	 
    Aggregate operations and group by/having
         Restrictions on aggregate operations:
	     1) can't be used in where clauses
	     2) only allowed attributes are results of aggregate operations
	        and attributes in group by
	 Aggregate operations can be used in Having clauses
	     1) where filters individual rows
	     2) having filters groups
    Order by: Sorts the result
    Joins
	Types
	    Theta
	    Equi
	    Natural
	Participation
	    Semi join
	    Inner join
	    Outer join
	        left
		right
		full

    Modification statements
        Insert
	Update
	Delete

    Subqueries
        Scalar:
	   --Sometimes are an alternative to joins
	   --Useful for getting a scalar from an aggregate for use in a
	   	    where clause
	   --Can also use scalar variables (must use views to hold relations
	   	 that are intermediate results)
        Row: Returns multiple rows/columns
	   --Only used with EXISTS predicate
	Table: Returns a table with one column
	   --Used with IN

SQL DDL
    Specifies
	tables, their attributes, and attributes' domains
	integrity constraints
	views

    Domain types not common in programming languages
    	date
	time
	timestamp
	interval: a period of time
	precise real numbers (numeric or decimal--either keyword works)

    Know ISO SQL syntax for creating tables, including for constraints
    	 required data
	 domain constraints: restrictions on values such as for salary or
	 	gender
	 entity integrity
	 referential integrity
	 general constraints

    Views
	How to create them (use a defining query)
	Horizontal versus Vertical view
	View maintenance
	     view resolution: reexecute the query on the base relations each
	     	  time
	     view materialization: cache the view tuples
	View updateability (if view is updated then must also update base
	     relation)
	     DISTINCT cannot be in defining query
	     every element in SELECT of defining query is a column name
	     FROM clause of defining query involves only one table
	     Defining query has no group by or having clause
	     Cannot violate integrity constraints on base table

Relational Algebra/Calculus
    Relational calculus: declarative, non-procedural language
    Relational algebra: high-level, procedural language
    Relational calculus is the theoretical basis for the relational model's DML
        SQL is based on the relational calculus
    Relationally complete: a language is relationally complete if each query
       can produce a relation that can be derived using relational calculus
    Relational algebra
       Fundamental operations
           Selection
	   Projection
	   Cartesian Product
	   Union
	   Set Difference
       Convenience operations (know how to derive the convenience operations
       		   from the fundamental operations)
           Join
	   Set Intersection
	   Set Division
       Union compatibility for set union, intersection, and difference

    Relational calculus: Only need to know tuple calculus
       Predicate versus proposition: predicate is a truth-valued function with
          arguments and a proposition is a substitution of concrete values
	  for the arguments and can be either true or false

  	  predicate generally contains tuple variables that have a range which
	  	    is all tuples in a relation
	  basic syntax: {S | Predicate(S) }: The set of tuples S such that
	  	Predicate(S) is true
	  Syntax
		How to write tuple variables
		How to write boolean operations
		Existential quantifier: true if there exists at least one tuple for
			    which the predicate is true
		Universal quantifier: true only if the predicate is true for
		 	  all tuples over which a tuple variable ranges
		Free versus Bound variables
		    bound variables: tuple variables bound by the existential
		    	  or universal quantifier
		    free variables: not bound by such quantifiers
		    only free variables can occur to left of the bar
	  Mapping relational calculus to relational algebra/SQL
	        Boolean predicates are typically used for Select queries
		Existential quantifier used for joins
		Universal quantifier and not exists quantifier
			  used for integrity constraints
		Set operations (slide 53 of relational calculus):
		    Union: Exist quantifier with boolean or
		    Difference: Not exists quantifier with boolean and
		    Intersection: Exist quantifier with boolean and
		    Difference: Universal quantifier with nested exists
		        quantifier

    Be able to:
       Write relational algebra queries
       Write relational calculus queries
       Convert SQL, relational algebra, and relational calculus queries to the
       	       other two forms
       Give English language descriptions of relational algebra and relational
       	    calculus queries

ER Design
    ER Concepts
       Entities
       Attributes
          Composite vs Simple
	  Single-valued vs Multi-valued: Multi-valued must go in a separate
	  	relation
	  Derived
       Relationships
	  Naming
              Use action verbs
	      Don't repeat names
	  Drawing
	      Use arrows for binary relationships but undirected edges for
	      	  anything else
	      Use diamonds to represent anything but a binary relationship.
	      	  For binary relationships always use a direct edge between
		  the two entities
	  Recursive Relationships
	      Name both sides with the role (e.g., supervisor supervises supervisee)
	  Constraints
	     Multiplicity
		participation
		cardinality
		determining multiplicity for complex relationships (above
		   binary): Provide concrete
		   instances for all but one entity and then figure out how
		   many instances of an entity can be associated with these
		   instances
	     Degree: Take cardinality of multiplicities for both entities
	     	     in the relationship
		1:1 : one to one
		1:* : one to many
		*:* : many to many

     Problems with ER Models
         Fan traps
	 Chasm traps

     In addition to the above information you should be able to:
         1) draw an ER diagram from a case description
         2) examine a set of relations and determine which ones represent:
	 	a) Entities
		b) Relationships
		c) Multi-valued attributes

     	 3) compute multiplicity and degree of a relationship
	 4) examine an ER diagram and
	     a) identify fan or connection traps
	     b) fix fan or connection traps

Normalization
    Terms
       Normalization
       Functional dependency
           Full functional dependency
	   Partial functional dependency
	   Transitive dependency
	   Determinant

    Purpose of Normalization
       1) Reduce inconsistency
       2) Minimize redundancy

    Important properties of normalization
       1) Lossless join
       2) Dependency preservation
       
    Types of anomalies
       Update
       Insert
       Delete

    Normal Forms
       0th Normal Form: Table that contains multi-valued attributes or
       	   repeating groups
       1st Normal Form: Every cell contains only one value
       2nd Normal Form: 1st Normal Form and every non-primary key attribute
       	   is fully functionally dependent on the primary key
       3rd Normal Form: 2nd Normal Form and no non-primary key attribute is
       	   transitively dependent on the primary key

    Normalization Algorithm
    	0th -> 1st NF:
	    Eliminate multi-valued attributes by moving them to
	    	      a separate relation
	    Separate repeating groups by putting them in separate rows
	1st -> 2nd NF: Move attributes in partial dependencies to new
	    relations. Determinant stays in old relation and becomes foreign
	    key in new relation.
	2nd -> 3rd NF: Move attributes in transitive dependencies to new
	    relations. Determinant stays in old relation and becomes foreign
	    key in new relation.

        There are also "candidate key" dependencies: These won't be used in
	    the normalization process but they will be preserved in one of
	    the final 3rd NF relations.

     In addition to knowing the above information, you should be able to:
	 1) Identify update, delete, and insert anomalies
	 2) Identify functional dependencies from a case description
	 3) Transform relations from x Normal Form to (x+1) Normal Form

Physical Design: Mapping relations to files and determining which attributes
    should be indexed

    Elements of access time for hard drive
    	 1) Seek Time
	 2) Rotational Latency

    Access times for blocks on a disk/SSD are much slower than to RAM so we
        need to minimize block accesses when finding tuples

    File Organizations
        1) Heaps: unordered files
	2) B+ trees: used for physically clustering records--good for range
	   queries
	3) Extendible hash tables: generally used for secondary indices--good
	   for point queries

    Types of Indices
    	1) Primary key index: File physically clustered on primary key: this
	   is unusual because primary keys are often artificial and hence
	   useless for range queries
	2) Cluster index: File physically clustered on a non-primary key--this
	   is far more common
	3) Secondary Index: Index on a non-clustered attribute(s) that helps
	   the query optimizer more efficienty locate records 

     Things to Index
        1) Attributes used in joins: primary and foreign keys
	2) Frequently accessed attributes
	3) Attributes involved in sorting
	   a) Order by
	   b) Group by
	   c) Set operations
	   d) DISTINCT
	4) Attributes used in aggregate functions: can often construct index-only
	   plans for such queries

     Things not to Index
     	1) Small relations
	2) Frequently updated attributes
	3) Attributes in which a large proportion of the relation will be
	   returned (E[#records/block] << 1)
	4) Attributes that are long character strings

     Indexing is an art not a science because the above guidelines conflict.
        You must use judgement and then tweak based on performance data.

     B+ Trees: Trees with large branching factors. They store only keys in
         interior nodes and tuples in the leaves
           M = max number of children per interior node
	   L = max number of tuples per leaf

	 1) Interior nodes must have at least M/2 children (hence M/2-1 keys)
	 2) Leaf nodes must have at least L/2 tuples
	 3) Root nodes may have 2 to M children
	 4) Guarantees log_M/2N disk accesses
	 5) When splitting nodes, if there is an extra key/tuple then place
	    it in the left node
	 6) When promoting keys and there are two "middle" keys, promote the
	    leftmost of the two keys

	 Be able to:
	    1) insert a value into a B+ tree
	    2) delete a value from a B+ tree

     Extendible Hash Tables: Guarantee that the index is at most two levels
     	1) Require at most 3 disk accesses--2 for the index and one for the
	   block containing the tuples
	2) Top-level directory is an array (all blocks must be physically
	   co-located) with "generic" hash keys
	3) Second-level directory stores the actual keys
	4) D = number of leading bits used by the root directory (top-level)
	5) D_L = number of leading bits used by each leaf (second level)
	6) Size
	   a) Expected number of leaves (N/M)log₂e where N is
	      number of tuples and M is number of keys per index block
	   b) Each leaf on average is .69 full
	   c) Expected directory size is O(N^1+1/M/M)