Learning Objectives
- Understand the elements of a virtual queue.
- Understand the field in a virtual queue descriptor.
- Understand the available and used rings.
- Be able to calculate the next available or used ring.
- Understand how to use MMIO to create a communication channel.
- Be able to read from MMIO the number for a block device.
- Understand block device packets for making requests.
- Be able to fill a request header, data, and status.
- Be able to use three descriptors to make a block device request.
- Understand the overview of how a block request is made.
Introduction
The virt I/O protocol is a device communication protocol for virtual device, such as hard drives, mice, keyboards, and so forth. This protocol is mainly used for virtual machines so that the guest and the host can communicate.
Virtual Queues
Recall that block I/O uses the descriptor pattern, where we fill out a structure somewhere in memory and then we point the device to that memory address. Using RAM as the common communication point makes it simple. Since we can control when the device is notified, we can control for race conditions.
A virtual queue contains three parts:
- An array of descriptors
- The available ring (OS to device)
- The used ring (device to OS)
Descriptors
The descriptors contain certain information, such as an address, the length of the address, certain flags, and other information. You can see, with this descriptor, we can point the device to any buffer's memory address in RAM.
struct Descriptor {
u64 address;
u32 length;
u16 flags;
u16 next;
};
The descriptor above has a 64-bit address so that we can tell the device a memory location anywhere within a 64-bit memory address. We also give it a length, so that when the device goes to that memory address, it knows how much of that memory is for the device. The flags control this descriptor. The first flag is VIRTQ_DESC_F_WRITE. This gives permission to the device to be able to write to the memory address specified by the address field. The second flag is VIRTQ_DESC_F_NEXT. This tells the device that we've actually chained multiple descriptors. This gives us flexibility because now we can give pieces of information to the device in non-contiguous memory. The device will put all descriptors together to form a contiguous structure. The last field, next, tells the device the index of the next descriptor. The device only reads this field if VIRTQ_DESC_F_NEXT is specified. Otherwise, this field has no effect.
Available Ring
When we want to make a request, we fill out a descriptor, and then we place the index of that descriptor into the available ring. When the device gets the notification signal, it will check the available ring to see what descriptors it needs to read. Remember that all of this information, including the descriptors and available ring, are all stored in RAM.
struct AvailableRing {
u16 flags;
u16 index;
u16 ring[NUM_RINGS];
};
The flags field can tell the device to change its behavior based on the available ring. The only flag we can specify here is VIRTQ_AVAIL_F_NO_INTERRUPT. This tells the device that after it has serviced this request, DO NOT interrupt is. Instead, we, as the OS, would be responsible for polling the used ring to see when the device is finished with our request.
The NUM_RINGS constant is negotiated by the device and the operating system when the device is initialized by the operating system. The device has a register called QueueNum and QueueNumMax. The OS can ask the device where its limit is by reading from the QueueNumMax register. Then, we as the OS can respond by writing the size we want in QueueNum. Obviously, QueueNum should be less than QueueNumMax, otherwise the device might reject the request.
The reason for using a ring is because we can have many outstanding requests that the device can then service when it finds the time. If we had only a single descriptor, we would have to submit that request, wait for the device, and then submit the next request. This means that the CPU and the operating system would have to pause waiting on I/O.
Used Ring
The used ring is where the device can send information to the OS. This usually is used by the device to tell the OS that it has completed a request. The used ring is much like the available ring, except we as the OS are required to look at the ring to see which one of our descriptors have been serviced.
struct UsedRing {
u16 flags;
u16 index;
Elem ring[NUM_RINGS];
u16 avail_event;
};
The used ring is used by the device to tell the OS something. We have another flags field that can only be given the flag VIRTQ_USED_F_NO_NOTIFY. If this is set, the device is telling you that the device did not notify the operating system (via an interrupt) upon completing this request.
The index field specifies the index of the first unused element in the ring. We can use this because we keep an internal store of the index too. Therefore, if our index and this index field are not the same, we know that we have some outstanding requests that we need to read. Every time we read from the used ring, we increase our internal index. Whenever our internal index and the used ring's index are the same, we know we've read all of the data. The device is the only one who should write to the index field.
struct Elem {
u32 id;
u32 len;
};
Both the device and the operating system keep track of where we are in both rings. This makes sure the operating system and the device are talking about the same request. Notice that instead of just numbers inside of the used ring, it actually contains a structure called an element. This structure contains two fields, an id and a length. The id is the index number of the descriptor the device is responding to. The length is the number of bytes that the device wrote to while responding to the request.
Typically, the virt I/O will interrupt the operating system when it is finished with any request. Then, our OS handler can look at the used ring and check the IDs of each element. Recall that the ID is the index number of the descriptor that the device is responding to. Therefore, we can see what the response is for and then send that response to the appropriate process.
Rings Mean Circular Buffers
The reason the available and uses rings are called "rings" is because they are circular buffers. Recall that we get the number of elements in a ring through the QueueNum register. Say, for example, that we read that there are 16 elements in the ring. Therefore, we can use 16 simultaneous descriptors to send requests or do whatever we want. That would be descriptors 0, 1, 2, ..., 15. However, when we want to make our next request, we will have to use ring 0. So, the calculation is always ring % size, where size is the number of elements in the ring.
QueuePFN
We have to tell the device where to look in memory for all of the descriptors, available ring, and used ring. There is an MMIO register called QueuePFN (Queue's Page Field Number), which is the memory address where the structure can be found in RAM. This must be a physical address.
Now that we've negotiated the queue size and pointed the device to the memory address where we created our queue, we now have a communication channel from the OS to the device (and vice-versa).
Hopefully, this illustrates that we use the normal MMIO registers to set up the queues and so forth, but then use the descriptor pattern (a common memory address for both OS and device to use) to set up a common channel for communication.
VirtIO Block Device
The descriptors and rings are for ANY device connected to the virt I/O bus. However, now we want to control a specific device on the virt I/O bus, that is, we want to control a block device (i.e. a disk).
When we probe the MMIO bus for block devices, we will first read the magic register, which is 4 bytes. The value should be "virt" to tell us that this is a virtual device. Then, we need to read the device identifier. If this identifier is 0, that means that no device is connected to this bus. Otherwise, this identifier tells us what type of device this is. A block device is assigned the identifier 2. Therefore, if we are scanning the virt I/O bus, we can forward any device numbered 2 to the block driver.
All of this information comes from the virt I/O specification. Below shows a part of the documentation.
Block Device Packets
The requests to the block device are made by using a request packet. Recall that generic virt I/O uses the descriptors and available ring to make a request. Also recall that the descriptor has an address, a length field, and a write flag to give the device permission to write to the memory address.
We will be making a block I/O request using three descriptors to split our packet into three uses: (1) the header, (2) the buffer, and (3) the response status. Recall that we can chain these three descriptors together using the flag VIRTIO_DESC_F_NEXT. We will use this flag for the header and the buffer. Also recall that we must specify the next field to the index of the next descriptor. This allows us to use non-contiguous descriptors for a single request.
Block Request Header
struct BlockRequestHeader {
u32 type;
u32 reserved;
u64 sector;
};
The header lets the device know what to expect when it starts filling a request. The type field specifies the direction. VIRTIO_BLK_T_IN specifies that we want to read from the block device. VIRTIO_BLK_T_OUT specifies that we want to write to the block device. Finally, VIRTIO_BLK_T_FLUSH tells the device to synchronize all reads and writes so that what is in memory is the same as what's on the disk.
The reserved portion is used to pad the header to 16 bytes and move the 64-bit sector field to the correct place.
The sector is the starting sector that we want to start reading. Recall that one sector is 512 bytes. So, if we want to read at 2200 bytes starting at index 7120, we have to read starting at sector [latex]s=7120/512=13[/latex].
struct BlockRequestData {
u8 data[][512];
};
The data field must be a multiple of 512, but we can specify the size of the data using the virt I/O descriptor (recall the length field).
Using the example above, we started reading at sector 13, but we want 2200 bytes. Since we must start at a sector and read a multiple of 512 bytes (each sector is 512 bytes), the math gets a little tricky here. However, we know that we start at sector 13, but we will start copying the bytes starting at the \(s=7120~\text{MOD}~512=464\) byte. Recall that sector 13 contains bytes 6656 (\(13~\times~512\)) through 7167 (\(14~\times~512\)).
struct BlockRequestStatus {
u8 status;
};
The status is just one byte, but the block device will put its particular read or write status in this byte. If the device doesn't change the status, that means it could not handle the request or that the request is still ongoing. However, if our status is zero (0), that means that the request succeeded. The status must always be initialized to some value other than 0 (OK), 1 (IOERR), or 2 (UNSUPP). Otherwise, if we initialize status to 0, we won't know whether the device wrote 0 into there to signal OK, or if it's still ongoing.
Recall that when we look at the used ring, we're going to have to see which descriptor the device is responding to. When we get that descriptor, we can check the third descriptor in that bunch and read the status. Recall that the status is just a 1-byte value whose memory address we sent to the block device.
struct FullBlockRequest {
BlockRequestHeader header;
BlockRequestData data;
BlockRequestStatus status;
};
As you can see above, our full block request is actually three smaller portions, the header, the data, and the status. We will use one descriptor for each of these portions, so each request requires three descriptors in total.
The Request Cycle
We set up the central memory addresses as our communication hub using QueuePFN. When we fill out the descriptors, we have to tell the block device that we just made a request. This is done through the GO button, which is a register called QueueNotify. As soon as we write the queue number (which is always 0 for a block device) to this register using MMIO, the device is off! It'll read the descriptors and start processing our request. When it is done, the device will send an external interrupt (routed through the PLIC). We then handle this interrupt, check the used ring, and do whatever needs to be done with the data.