[Previous] [Contents] [Next]

The QNX4 Filesystem

The QNX4 filesystem implements an extremely robust design, utilizing an extent-based, bitmap allocation scheme with fingerprint control structures to safeguard against data loss and to provide easy recovery. Features include:

Disk structure

The QNX4 filesystem consists of the following components found at the beginning of its partition:

These structures are created when the filesystem is initialized with the dinit utility.

Loader block

This is the first physical block of a disk partition. This block contains the code that is loaded and then executed by the BIOS of the computer to load an OS from the partition. If a disk hasn't been partitioned (e.g. a floppy diskette), this block is the first physical block on the disk.

Root block

The root block is structured as a standard directory. It contains inode information for these special files:

The files /.boot and /.altboot contain images of the operating system that can be loaded by the QNX bootstrap loader.

Normally, the QNX loader loads the OS image stored in the /.boot file. But if the /.altboot file isn't empty, you'll be given the option to load the image stored in the /.altboot file.

Bitmap

To allocate space on a disk, QNX uses a bitmap stored in the /.bitmap file. This file contains a map of all the blocks on the disk, indicating which blocks are used. Each block is represented by a bit. If the value of a bit is 1, its corresponding block on the disk is in use.

Root directory

The root directory of a partition behaves as a normal directory file with two exceptions:

Extents

In the QNX filesystem, regular files and directory files are stored as a sequence of extents. An extent is a contiguous sequence of blocks on disk.

Files that have only a single extent store the extent information in the directory entry. If more than one extent is needed to hold the file, the extent location information is stored in one or more linked extent blocks. Each extent block can hold location information for up to 60 extents.


Figure showing file extents


A file with multiple extents.


Extending files

When the QNX filesystem needs to extend a file, it uses the bitmap to see if it can extend the file contiguously on disk. If not, it tries to allocate a new extent. This may require allocating a new extent block as well. When an extent is allocated or grown, the filesystem may over-allocate space under the assumption that the process will continue to write and fill the extra space. When the file is closed any extra space will be returned.

This design ensures that when files are written, even several files at one time, they are as contiguous as possible. Since most hard disk drives implement track caching, this not only ensures that files are read as quickly as possible from the disk hardware, but also serves to minimize the fragmentation of data on disk.

Links and inodes

File data is stored distinctly from its name and can be referenced by more than one name. Each filename, called a link, points to the actual data of the file itself. (There are actually two kinds of links: hard links, which we refer to simply as "links," and symbolic links. Symbolic links are described in the next section.)

In order to support links for each file, the filename is separated from the other information that describes a file. The non-filename information is kept in a storage table called an inode (for "information node").

If a file has only one link (i.e. one filename), the inode information (i.e. the non-filename information) is stored in the directory entry for the file. If the file has more than one link, the inode is stored as a record in a special file named /.inodes -- the file's directory entry will point to the inode record.


Figure showing two links to a file


One file referenced by two links.


Note that you can create a link to a file only if the file and the link are in the same filesystem.

There are two other situations in which a file can have an entry in the /.inodes file:

Removing links

When a file is created, it is given a link count of one. As links to the file are added, this link count is incremented; as links are removed, the link count is decremented. The disk space occupied by the file data isn't freed and marked as unused in the bitmap until its link count goes to zero and all programs using the file have closed it. This allows an open file to remain in use, even though it has been completely unlinked. This behavior is part of that stipulated by POSIX and common UNIX practice.

Directory links

Although you can't create hard links to directories, each directory has two hard-coded links already built in:

The filename "dot" refers to the current directory; "dot dot" refers to the previous directory in the hierarchy.

Note that if there's no predecessor, "dot dot" also refers to the current directory. For example, the "dot dot" entry of "/" is simply "/" -- you can't go further up the path.

Symbolic links

A symbolic link is a special file that usually has a pathname as its data. When the symbolic link is named in an I/O request -- by open(), for example -- the link portion of the pathname is replaced by the link's "data" and the path is re-evaluated.

Symbolic links are a flexible means of pathname indirection and are often used to provide multiple paths to a single file. Unlike hard links, symbolic links can cross filesystems and can also create links to directories.

In the following example, the directories /net/node1/usr/fred and /net/node2/usr/barney are linked even though they reside on different filesystems -- they're even on different nodes (see the following diagram). This couldn't be done using hard links:

/net/node1/usr/fred --> /net/node2/usr/barney

Note how the symbolic link and the target directory need not share the same name. In most cases, you use a symbolic link for linking one directory to another directory. However, you can also use symbolic links for files, as in this example:

/net/node1/usr/eric/src/test.c --> /net/node1/usr/src/game.c


Figure showing two nodes using symbolic links


Symbolic links.



Note: Remember that removing a symbolic link acts only on the link, not the target.

Several functions operate directly on the symbolic link. For these functions, the replacement of the symbolic element of the pathname with its target is not performed. These functions include unlink() (which removes the symbolic link), lstat(), and readlink().

Since symbolic links can point to directories, incorrect configurations can result in problems such as circular directory links. To recover from circular references, the system imposes a limit on the number of hops; this limit is defined as SYMLOOP_MAX in the <limits.h> include file.

Filesystem robustness

The QNX filesystem achieves high throughput without sacrificing reliability. This has been accomplished in several ways.

While most data is held in the buffer cache and written after only a short delay, critical filesystem data is written immediately. Updates to directories, inodes, extent blocks, and the bitmap are forced to disk to ensure that the filesystem structure on disk is never corrupt (i.e. the data on disk should never be internally inconsistent).

Sometimes all of the above structures must be updated. For example, if you move a file to a directory and the last extent of that directory is full, the directory must grow. In such cases, the order of operations has been carefully chosen such that if a catastrophic failure occurs when the operation is only partially completed (e.g. a power failure), the filesystem, upon rebooting, would still be "intact." At worst, some blocks may have been allocated, but not used. You can recover these for later use by running the chkfsys utility.

Filesystem recovery

Even in the best systems, true catastrophes such as these may happen:

To recover as many files as possible if such events occur, unique "signatures" have been written on the disk to aid in the automatic identification and recovery of the critical filesystem pieces. The inodes file (/.inodes), as well as each directory and extent block, all contain unique patterns of data that a "fixer" program could scan for.


[Previous] [Contents] [Next]