DOSContainer logo DOSContainer

Another update on the filesystem crate

📅 2025-05-20  · ✍️ Bas v.d. Wiel  ·  🏷 design

Because the filesystem is such a crucial part of how DOSContainer aims to achieve that legendary `museum quality’ output, let’s look into it some more. The current progress is up to the point where I have two things: an abstract model of the data structures that constitute FAT and a proof-of-concept serializer struct that takes this model’s AllocationTable and spits out an IBM PC-DOS 1.00 compatible interpretation of it.

The abstract model of a File Allocation Table is there to support the basic data structure that all versions of the FAT filesystem have in common, without any of the specific quirks introduced by vendors over the decades. An AllocationTable in the abstract now looks like this:

pub struct AllocationTable {
    clusters: BTreeMap<ClusterIndex, ClusterValue>,
    cluster_size: usize,
    cluster_count: usize,
    fat_type: FatType,
}

For the complete picture, you should realize that a ClusterIndex is an alias for Rust’s internal usize type, an unsigned integer the size of the host machine’s word length: so either 32 or 64 bits. I may at some point peg that to u64 for good measure, but that depends on whether ExFAT ever comes into scope (a no, for now).

The more interesting part is ClusterValue, which abstracts over the numerical values that a real File Allocation Table records like so:

pub enum ClusterValue {
    Next(ClusterIndex),
    EndOfChain,
    Free,
    Reserved,
    Bad,
}

You see, a File Allocation Table is just a sequence of initially unallocated clusters that can hold a numerical value. This numerical value denotes the location in the table for the next cluster that belongs to the current ‘chain’. A ‘chain’ in this context denotes the complete list of on-disk clusters where the bytes of a particular file are stored. This allows a file’s contents to be spread across the disk in cluster-sized chunks, enabling efficient use of available disk space because files don’t need to be written in a contiguous way.

You’ll see that most of the values in my Rust enum are symbolic rather than numerical. This fits my abstract data model. In FAT12 a bad cluster is marked by the numerical value 0xFF7, which is a convention. An end-of-chain marker is noted by 0xFFF, but not in all versions of the FAT12 file system.

By first creating an abstract model that just holds EndOfChain in the last cluster of a chain, we can skip ahead of the fact that 0xFFF is not always the magic number for it. In FAT16 the number is 0xFFFF for instance, in FAT32 you’d write 0xFFFFFFFF, and that’s not even considering ambiguities. These numbers are a convention, an agreement between the programmers of yesteryear that not everybody always respected in equal measure.

FAT serializers

In DOSContainer we can construct a model AllocationTable in the same way every time. It won’t look any different between PC-DOS 1.00 from 1981 or Windows 95 OSR2. The differences only appear when we interpret the model into a specific on-disk implementation. For IBM PC-DOS 1.00 the code does something that would be nuts in later versions: it hard-codes the first two clusters’ numerical values to 0xFFE, 0xFFF. Why? Because that is exactly what IBM did back in the day!

IBM PC-DOS 1.00 never supported anything other than 160KB 5.25" floppies, and these would always have these same two cluster-values in the first locations of their FAT. No exceptions. But the AllocationTable doesn’t need to know that. All it has, is cluster 0 marked Reserved and cluster 1 an EndOfChain. This is enough to tell DOSContainer to not allocate these for actual use by file/directory data, and all later calculations match up nicely.

When I reach the point of implementing later versions of DOS, the concept of a FAT-ID or Media Descriptor byte comes into play. This may require enriching the model’s data structures slightly to allow for some more information, but only in the abstract. The model should never care about any specific IBM-isms or quirks introduced by other OEM’s in the 1980’s.

The current codebase is ready to accept as many AllocationTable serializers as there were DOS versions released. The serializer for IBM PC-DOS 1.00 is 54 lines long. Additional ones will likely duplicate bits of code between them, but that’s a price I’m willing to pay for maintainability and a realistic outlook on supporting a few dozen different FAT implementations in a clean and testable way.

Tags: