This exam covers all the material for databases. Make sure you look at the
sample midterm and solutions on the CS465 website to get a feel for what the
format of the exam will be.
Introductory Material
Functions provided by DBMS's
Data Definition Language (DDL)
Data Manipulation Language (DML)
Access Control
Advantages of DBMS's
Disadvantages of DBMS's
Relational Model (Chap 4)
Relational model is the theoretical basis for the DDL
Relational model terminology
Relation/Table/File
Tuple/Row/Record
Attribute/Column/Field
Degree
Cardinality
Know the Properties of a Relation
Keys
Super key
Candidate key
Composite key: More than one attribute
Primary key
Alternate key
Foreign key
Integrity constraints
NULL: represents a value that is currently unknown
Entity integrity
Referential integrity
General constraints
Views
What they are
How they're formed from base relations
Purposes/Advantages
Restrictions on updating views
SQL
Select/Project/Join: These are the basic SQL operations. Know what each
operation does (e.g., select filters rows)
Know how to write queries involving the above operations including
the various filters that can be used in where clauses (e.g., between,
in, relational and boolean operators, like, NULL)
Aggregate operations and group by/having
Restrictions on aggregate operations:
1) can't be used in where clauses
2) only allowed attributes are results of aggregate operations
and attributes in group by
Aggregate operations can be used in Having clauses
1) where filters individual rows
2) having filters groups
Order by: Sorts the result
Joins
Types
Theta
Equi
Natural
Participation
Semi join
Inner join
Outer join
left
right
full
Modification statements
Insert
Update
Delete
Subqueries
Scalar:
--Sometimes are an alternative to joins
--Useful for getting a scalar from an aggregate for use in a
where clause
--Can also use scalar variables (must use views to hold relations
that are intermediate results)
Row: Returns multiple rows/columns
--Only used with EXISTS predicate
Table: Returns a table with one column
--Used with IN
SQL DDL
Specifies
tables, their attributes, and attributes' domains
integrity constraints
views
Domain types not common in programming languages
date
time
timestamp
interval: a period of time
precise real numbers (numeric or decimal--either keyword works)
Know ISO SQL syntax for creating tables, including for constraints
required data
domain constraints: restrictions on values such as for salary or
gender
entity integrity
referential integrity
general constraints
Views
How to create them (use a defining query)
Horizontal versus Vertical view
View maintenance
view resolution: reexecute the query on the base relations each
time
view materialization: cache the view tuples
View updateability (if view is updated then must also update base
relation)
DISTINCT cannot be in defining query
every element in SELECT of defining query is a column name
FROM clause of defining query involves only one table
Defining query has no group by or having clause
Cannot violate integrity constraints on base table
Relational Algebra/Calculus
Relational calculus: declarative, non-procedural language
Relational algebra: high-level, procedural language
Relational calculus is the theoretical basis for the relational model's DML
SQL is based on the relational calculus
Relationally complete: a language is relationally complete if each query
can produce a relation that can be derived using relational calculus
Relational algebra
Fundamental operations
Selection
Projection
Cartesian Product
Union
Set Difference
Convenience operations (know how to derive the convenience operations
from the fundamental operations)
Join
Set Intersection
Set Division
Union compatibility for set union, intersection, and difference
Relational calculus: Only need to know tuple calculus
Predicate versus proposition: predicate is a truth-valued function with
arguments and a proposition is a substitution of concrete values
for the arguments and can be either true or false
predicate generally contains tuple variables that have a range which
is all tuples in a relation
basic syntax: {S | Predicate(S) }: The set of tuples S such that
Predicate(S) is true
Syntax
How to write tuple variables
How to write boolean operations
Existential quantifier: true if there exists at least one tuple for
which the predicate is true
Universal quantifier: true only if the predicate is true for
all tuples over which a tuple variable ranges
Free versus Bound variables
bound variables: tuple variables bound by the existential
or universal quantifier
free variables: not bound by such quantifiers
only free variables can occur to left of the bar
Mapping relational calculus to relational algebra/SQL
Boolean predicates are typically used for Select queries
Existential quantifier used for joins
Universal quantifier and not exists quantifier
used for integrity constraints
Set operations (slide 53 of relational calculus):
Union: Exist quantifier with boolean or
Difference: Not exists quantifier with boolean and
Intersection: Exist quantifier with boolean and
Difference: Universal quantifier with nested exists
quantifier
Be able to:
Write relational algebra queries
Write relational calculus queries
Convert SQL, relational algebra, and relational calculus queries to the
other two forms
Give English language descriptions of relational algebra and relational
calculus queries
ER Design
ER Concepts
Entities
Attributes
Composite vs Simple
Single-valued vs Multi-valued: Multi-valued must go in a separate
relation
Derived
Relationships
Naming
Use action verbs
Don't repeat names
Drawing
Use arrows for binary relationships but undirected edges for
anything else
Use diamonds to represent anything but a binary relationship.
For binary relationships always use a direct edge between
the two entities
Recursive Relationships
Name both sides with the role (e.g., supervisor supervises supervisee)
Constraints
Multiplicity
participation
cardinality
determining multiplicity for complex relationships (above
binary): Provide concrete
instances for all but one entity and then figure out how
many instances of an entity can be associated with these
instances
Degree: Take cardinality of multiplicities for both entities
in the relationship
1:1 : one to one
1:* : one to many
*:* : many to many
Problems with ER Models
Fan traps
Chasm traps
In addition to the above information you should be able to:
1) draw an ER diagram from a case description
2) examine a set of relations and determine which ones represent:
a) Entities
b) Relationships
c) Multi-valued attributes
3) compute multiplicity and degree of a relationship
4) examine an ER diagram and
a) identify fan or connection traps
b) fix fan or connection traps
Normalization
Terms
Normalization
Functional dependency
Full functional dependency
Partial functional dependency
Transitive dependency
Determinant
Purpose of Normalization
1) Reduce inconsistency
2) Minimize redundancy
Important properties of normalization
1) Lossless join
2) Dependency preservation
Types of anomalies
Update
Insert
Delete
Normal Forms
0th Normal Form: Table that contains multi-valued attributes or
repeating groups
1st Normal Form: Every cell contains only one value
2nd Normal Form: 1st Normal Form and every non-primary key attribute
is fully functionally dependent on the primary key
3rd Normal Form: 2nd Normal Form and no non-primary key attribute is
transitively dependent on the primary key
Normalization Algorithm
0th -> 1st NF:
Eliminate multi-valued attributes by moving them to
a separate relation
Separate repeating groups by putting them in separate rows
1st -> 2nd NF: Move attributes in partial dependencies to new
relations. Determinant stays in old relation and becomes foreign
key in new relation.
2nd -> 3rd NF: Move attributes in transitive dependencies to new
relations. Determinant stays in old relation and becomes foreign
key in new relation.
There are also "candidate key" dependencies: These won't be used in
the normalization process but they will be preserved in one of
the final 3rd NF relations.
In addition to knowing the above information, you should be able to:
1) Identify update, delete, and insert anomalies
2) Identify functional dependencies from a case description
3) Transform relations from x Normal Form to (x+1) Normal Form
Physical Design: Mapping relations to files and determining which attributes
should be indexed
Elements of access time for hard drive
1) Seek Time
2) Rotational Latency
Access times for blocks on a disk/SSD are much slower than to RAM so we
need to minimize block accesses when finding tuples
File Organizations
1) Heaps: unordered files
2) B+ trees: used for physically clustering records--good for range
queries
3) Extendible hash tables: generally used for secondary indices--good
for point queries
Types of Indices
1) Primary key index: File physically clustered on primary key: this
is unusual because primary keys are often artificial and hence
useless for range queries
2) Cluster index: File physically clustered on a non-primary key--this
is far more common
3) Secondary Index: Index on a non-clustered attribute(s) that helps
the query optimizer more efficienty locate records
Things to Index
1) Attributes used in joins: primary and foreign keys
2) Frequently accessed attributes
3) Attributes involved in sorting
a) Order by
b) Group by
c) Set operations
d) DISTINCT
4) Attributes used in aggregate functions: can often construct index-only
plans for such queries
Things not to Index
1) Small relations
2) Frequently updated attributes
3) Attributes in which a large proportion of the relation will be
returned (E[#records/block] << 1)
4) Attributes that are long character strings
Indexing is an art not a science because the above guidelines conflict.
You must use judgement and then tweak based on performance data.
B+ Trees: Trees with large branching factors. They store only keys in
interior nodes and tuples in the leaves
M = max number of children per interior node
L = max number of tuples per leaf
1) Interior nodes must have at least M/2 children (hence M/2-1 keys)
2) Leaf nodes must have at least L/2 tuples
3) Root nodes may have 2 to M children
4) Guarantees logM/2N disk accesses
5) When splitting nodes, if there is an extra key/tuple then place
it in the left node
6) When promoting keys and there are two "middle" keys, promote the
leftmost of the two keys
Be able to:
1) insert a value into a B+ tree
2) delete a value from a B+ tree
Extendible Hash Tables: Guarantee that the index is at most two levels
1) Require at most 3 disk accesses--2 for the index and one for the
block containing the tuples
2) Top-level directory is an array (all blocks must be physically
co-located) with "generic" hash keys
3) Second-level directory stores the actual keys
4) D = number of leading bits used by the root directory (top-level)
5) DL = number of leading bits used by each leaf (second level)
6) Size
a) Expected number of leaves (N/M)log2e where N is
number of tuples and M is number of keys per index block
b) Each leaf on average is .69 full
c) Expected directory size is O(N1+1/M/M)