Learning Objectives

Introduction

We know what a block device is and how they work, but what about the data stored on them? They're just groups of bytes, right? A file system is used to organize the data on a block device. In fact, many file systems take into account the underlying block device, whether it be a flash flash file system, such as YAFFS, JFFS, or a solid state drive file system, such as BTRFS. The file system can have a major impact on performance and size constraints.

As a practical filesystem, I will be covering the Minix 3 filesystem in this lecture. However, it mimics many of the original Unix style file system principles, such as direct, indirect, doubly indirect, and triply indirect pointers.

Blocks

Just like a block device, a file system operates on a "block" concept. However, unlike the block device, where a sector is usually 512 bytes, the most common block size is 1024 bytes (two sectors).

Let's say we have a file that takes 10 bytes and another that takes 4 kilobytes. The first 10 byte file requires one block, but it'll waste \(1024-10=1014\) bytes, whereas the 4 KB file will take four blocks and waste nothing.

This is the block concept. A small block size favors a file system that sees mostly small files, whereas a large block size favors fewer, large files. The downside to a smaller block size is that the file system itself can address many fewer bytes. We'll see why when we talk about direct and indirect block pointers.

To summarize, it is important to know that every file must be a multiple of a block. So, even at 10 bytes, we can say it's 10 bytes, but on the disk, it takes 1024 bytes, or whatever the block size is. If our block size was 4096 bytes, we would take 10 bytes for the data and waste 4086 bytes.

Bitmaps

There are two bitmaps for the Minix 3 file system. Each bit represents a block in the zmap and an inode in the imap. If the inode or zone is taken, then the bit will be 1, otherwise the bit will be 0.

The superblock (described below) shows how many blocks are necessary to store the bitmaps. So, for each zone or inode, we can store \(1024\times 8\) for a block size of 1024. So, if we have 200,000 inodes, we need \(\text{ceil}(200,000 / (1024\times 8))=25\) blocks to store the bitmaps. The same calculation goes into calculate the zone map.

Since we only care about free or taken, we can use one bit to represent an inode or block. This is a space efficient way to represent a free or taken zone or inode.

Index Nodes

Index nodes (inodes) are metadata about a file. We use an inode to find the blocks associated with a file, to see how big the file is (in bytes), to give a file permissions and type, and so forth. An inode in the Minix 3 file system has the following structure and it takes 64 bytes.

The mode contains a bitfield that stores the permissions using a combined list (lower 9 bits), such as rwxr-xr--. However, the mode also stores what type of file this is. The mode can store S_IFDIR (directory), S_IFREG (regular file), S_IFLNK (symbolic link), and so forth. So, the mode not only contains the permissions, but also the type of file.

A hard link must exist on the same file system, and it shares the same inode. The purpose of a hard link is to have two different names (usually in two different locations) point to the same data. This is not used a lot and symbolic links are more common. Symbolic links, unlike hard links, require its own inode. The data of a symbol link is where the file system can find the inode of whatever it points to. This is why symbolic links may exist even if the file it points to does not. Finally, whenever we delete a file, we call a function called unlink. This decreases the # of hard links in the inode. When this reaches 0, then the inode can be safely deleted. In other words, there is no delete function in most file systems due to hard links.

The uid (user id) and gid (group id) store a 16-bit number (each) that store the user and group who own the file. Recall that with a combined list, we set the owner's permissions first, then the group's permissions, and finally, everybody else's permissions.

The inode is required to store the number of bytes that a file takes. For Minix 3, this is a 4-byte field, so it has a maximum of 4 gigabytes. The reason we need the size is because we store the file block-by-block. Remember the 10-byte file above? We can store 10 in this size field and then allocate a 1024-byte block for the file.

The atime, mtime, and ctime of a file stores UNIX timestamps (4-byte) of the accessed, modified, and creation times, respectively.

The zones are described below, but these point to the blocks that belong to this inode--where the data can be found.

The inode array follows the zone map blocks, so we can use the following formula inputting an inode number (index + 1). Each inode is 64 bytes, which I use as a constant below.

$$ \text{inode byte} = \text{block size} \times (2+\text{imap blocks}+\text{zmap blocks})+64\times (\text{number - 1}) $$

We're skipping the boot block and super block (2 blocks), plus the imap blocks, plus the zmap blocks. Then, each inode is 64 bytes, so we multiply by 64 to get to the correct inode. Inode number 0 is invalid and is reserved for a null inode. Finally, inode #1 is the root directory inode. We can use inode #1 to find all other directory entries and inodes.

Boot Block

The boot block is reserved for a bootloader. In most PCs, the boot loader is within the first sector (512 bytes) and is automatically loaded by the BIOS. For many file systems, they reserve an entire block for the boot block. For example, if we have a 1024-byte block size, then the boot block is 1024 bytes.

This space is not used by the file system, and instead is used by the operating system or BIOS to boot load (where applicable).

The Super Block

The super block comes right after the boot block. This also is an entire block, although the actual structure in the Minix 3 file system is only 32 bytes, so quite a bit is wasted. The super block describes the file system, including the block size, the number of nodes, and so forth. The following diagram shows the super block and its structure.

The super block is created when the file system is created. Many file systems cannot be resized and the super block is a constant data structure. So, when creating a file system on a disk, we need to know the entire size of the disk and divide it up into blocks. Then, we store the boot block first, then the super block, and then the inodes. The number of inodes must be determined when the file system is created. This can limit the number of files we can have.

The superblock shows how you can navigate around the filesystem, as well as identifies the filesystem itself. A 16-bit value, called the magic, is a special sequence of bytes. For Minix 3, these bytes will be \(4d5a_{16}\). If we read a superblock and this field is NOT 0x4d5a, we can assume that it is not a valid Minix 3. Every file system has its own magic sequence.

Direct Zones

Minix 3 calls its blocks zones, however this terminology isn't used for all Unix-style filesystems. These are 4-byte values that describe the number of the block. The reason it uses a number instead of an index because it uses 0 as an unallocated zone. This can happen if the file isn't big enough to have all zones set or after a set of expansions and contractions.

So, we can find the block using the formula (for non-zero zone pointer values): $$\text{block offset} = \text{zone pointer value} \times \text{block size}$$

What's nice about Minix 3's zone pointers is that it's an absolute offset, so it doesn't have to be scaled with all of the other stuff around the file system, including the number of zone maps and inode maps. Instead, you just multiply the pointer with the block size and get the byte offset of the block this zone is referring to.

After applying the formula, we are now looking at a block with our data. We need to read the entire block, but only care about the amount of information addressable by the size of the file (stored in the inode).

In the Minix 3 file system, there are 7 direct pointers, which means we can address up to \(\text{block size} \times 7\) bytes for a file. For a 1024 block size, we're looking at only 7 kilobytes. Nowhere near enough.

Indirect Zones

We need something more powerful than direct zones to carry the weight of larger files. This is where the indirect zones come into play. For the Minix 3 file system, we have three types of indirect zones: (1) singly indirect, (2) doubly indirect, and (3) triply indirect.

Indirect means that the pointer points to a block where other pointers can be found, much like a double pointer u32 **pointer;. So, instead of the block containing data pertaining to the file, the block contains other pointers which then pertain to the file (for a singly indirect pointer). For a doubly indirect pointer (u32 ***pointer;), the pointer points to a block of pointers. Each pointer points to yet another block of pointers, and in there, each pointer points to a block where the data for the file can be found. Triply indirect is this + 1 more (u32 ****pointer;)! Each block is divided by 4 (each pointer is 4 bytes), so we have \(\text{block size} / 4\) number of pointers in each block.

To calculate how much data we can address using this scheme, we need to know the block size and apply the following formula. For demonstration purposes, I'm going to use a block size of 1024 bytes. Recall that each pointer is 4 bytes, so a block size of 1024 can contain 256 pointers.

Adding all of these together supports a file up to about 17 gigabytes. So, in this case our limitation is using a 4-byte size field in the inode. So, essentially, using this system, we can address all 4 gigabytes of a file.

Recall that these can point to any inode number and that inode number 0 is a null inode. Therefore, if the inode pointer points to 0, we skip it and pretend it isn't even there.

Our maximum using the calculations above is 17GiB, however this is fake. Since we're using a 4-byte size in the inode, our absolute maximum file size is capped at 4GiB no matter the block size. However, recall that if the zone number is 0, then it is NULL, so you skip it and move to the next zone. The triply indirect pointer can be used for fragmentation above the 4GiB range.

Directory Entries

Notice that the inode does not store the name of a file. This is because we can have multiple names for a given inode (hard links, remember?). So, instead, we use a directory entry structure, which contains a 4-byte inode number and a 60-byte name.

When we come across an inode, we check its mode to see if it is a regular file or a directory. If it is a directory, then the blocks that the zones refer to contain directory entries one by one. Each block can store \(\text{block size} / 64\) number of directory entries. Therefore, we still need to check the indirect pointers for directories.

Recall that the 16-bit mode in the inode tells you whether this is a directory of a file. A directory's mask is: \(0100,0000,0000,0000_2\) or \(040,000_8\). A regular file's mask is: \(1000,0000,0000,0000_2\) or \(100,000_8\).