Physical Database Design

  1. Goals of Physical Design

    1. Map relations into files on the file system
    2. Select indices to aid the query processor

  2. Basic File Organization on Disk

    1. Block storage
    2. Elements of access time for hard drive

      1. Seek time: time to move mechanical read/write arm to the appropriate track
      2. Rotational latency: time for appropriate sector of track to rotate under the read/write arm
      3. 2017 seek times (source: wikipedia): Random access times have not improved significantly in recent years

        1. 3-4 ms for high end (3000-4000 microseconds)
        2. 9 ms for average desk top
        3. 12-15 ms for average mobile device or laptop

      4. 2017 rotational latency times: 2-7.15ms (source: wikipedia)
      5. 2017 transfer rates:

        1. consumer grade hard disk drive: 200 MB/second for both read and write (source: User Benchmark).
        2. enterprise grade hard disk drive: 200-300MB/second for both read and write (source: Tom's IT Pro.

    3. Solid State Devices (SSD): Uses persistent solid state memory to store data.
      1. Cost: About 24 cents per gigabyte in 2017, which is about 4 times the cost of a gigabyte for a hard disk drive.
      2. Access times for Solid State Device (SSD) (source: wikipedia).

        1. < 0.1ms seek time (no latency because no rotation)
        2. transfer rates
          1. 200-2500MB per second
          2. write times for less expensive SSD's can be 10 times slower than read rates
          3. write times for higher end SSD's is comparable to read times

  3. Basic Organizations

    1. Heap (unordered)
    2. B+ tree
    3. Extendible Hash

  4. Indices

    1. Importance of a cluster key: improves performance of range queries
    2. Types of indices

      1. Primary Index: File is physically clustered using the primary key
      2. Cluster Index: File is physically clustered on a key other than the primary key--Often you do not cluster on the primary key if it is an artificial key since an artificial key is unlikely to be used in range queries
      3. Secondary Index: Index on a non-clustered attribute(s) that helps the query optimizer more efficienty locate records
    3. Some queries, especially aggregate queries, might be solvable from the index alone

  5. Considerations in Choosing Indices

    1. Things to Index
      1. Index PK of a relation if it is not a cluster key of the file organization.
      2. Add secondary index to a FK since it is involved in joins
      3. Add secondary index to any attribute heavily used in selection queries
      4. Add secondary index on attributes involved in:

        1. ORDER BY;
        2. GROUP BY; and
        3. other operations involving sorting (such as UNION or DISTINCT).
      5. Add secondary index on attributes involved in aggregate functions.
      6. Add secondary index on attributes that could result in an index-only plan.
    2. Things Not to Index
      1. small relations.
      2. an attribute or relation that is frequently updated.
      3. an attribute if the query will retrieve a significant proportion of the relation.
        1. If E[#records/block] << 1, then on average only a few blocks will be retrieved and the index is worth it
        2. If E[#records/block] is roughly 1, then on average you will access every block in the file and the index is a waste of time
      4. Avoid indexing attributes that consist of long character strings.

  6. Transactions To Analyze

    1. It's difficult to analyze all transactions so try to concentrate on a few that will be frequently executed