File Systems

  • A pattern to the arrangement of data on a disk
  • Have a mounting scheme
  • Usually involve a naming and directory system
  • Other possible features
    • Redundancy
    • Networking
  • In modern systems, the operating system is an intermediary between the process and the disk

Files

  • Logically named persistent storage
  • Can be stored as a byte sequence or a record sequence (a record is just multiple bits)
  • Also a paradigm for I/O devices, communication
  • File Type
    • Determined by
      • extension
      • magic numbers (signature at start of file)
      • file attributes (metadata)
      • look at file contents
  • File Control Block
    • Attribute and location information of a file
    • Called a file descriptor when loaded into memory
    • Distributed across the file system (within directories or reachable with them)
  • File Operations
    • Can be accessed sequentially or via random access depending on the media type
    • Sequential Access
      • Tapes, Pipes
      • Must be accessed/navigated in linear order
      • Some but not all can rewind
    • Random Access
      • HDD, SSD, Floppy
      • Data can be accessed in any order
  • Shared Files
    • Regular/Hard Links
      • Holds reference to inode of the target file
      • Works only within the same file system
      • Removing any link doesn’t affect other links
      • Cannot be used with directories
    • Symbolic/Soft Links
      • Holds path to the target file, not the contents
      • Can link to a directory
      • Can link across different file systems
      • If the original file is deleted or moved, the links will not work correctly
  • When links to a file hits 0, it is considered deleted

Directories

  • A special file type that holds part of the file control block about other files
  • Implementation
    • Normal list (hard to keep sorted for efficient search)
    • Linked List (easier to keep sorted but link overhead)
    • B+ Tree (actually used)
    • Hash Table (typically used alongside a linear structure)
  • Names
    • Windows
      • FAT32: Stores all the metadata in 28 out of 32 of the file block
      • Virtual Fat
        • A way of tacking longer names onto FAT32
        • Has a long entry that connects to a base entry that supports longer names
    • UNIX
      • INODE links to a name entry that contains 255 bytes for the name

Mounting

  • We mount a file system to make it accessible to programs through the operating system
  • During the boot process, the root filesystem gets mounted
  • Its information is added into a mount table structure
  • The file system may be checked for errors or inconsistencies
  • Filesystems may be mounted either automatically or manually

Linux Virtual Filesystem (VFS)

  • With VFS, most userspace programs are written filesystem agnostic
  • Sandwiched between two layers
    • Upper: system call layer
    • Lower: set of function pointers (one per filesystem implementation)
  • Implementation
    • File Objects
      • Represents an open file
      • Contains data such as flags used while opening the file and the offset from which a process can read or write from
    • Directory Entries (Dentries)
      • Holds inodes and files objects together
      • Contains names or identifiers
      • Exists for both files and directories
    • Filesystem Type
      • In order to use a filesystem, the kernel must know the filesystem type
      • Filesystems types must be registered
      • Once registered, a filesystem of that type may be mounted
    • Inodes
      • File data structure that stores information about the file
      • Does not include name (in directory entry) and data (in block that it points to)
    • Superblock
      • Structure that represents an instance of a filesystem
      • Typically stored on the storage device itself and loaded into memory when mounted

File Allocation

  • Operating systems often use multiple methods depending on what is most efficient for a particular file

  • Contiguous (all in a row)

    • Files are stored purely in order on the drive
    • Performance is good
    • May have external fragmentation
    • Problems when file grow or the exact size is unknown at creation time
  • Linked List (block)

    • Each block holds a pointer to the next
    • Prevents external fragmentation
    • Inefficient random access
    • Storage must be used for each link (can group blocks to reduce this)
  • Chained table (FAT)

    • File allocation table
      • Statically allocated at time of formatting
    • Chain store the linked list as a table in memory to minimize disk access
    • File points to the first index in the table, which points to the next block for that file. Continues until it hits the end
    • A single point of failure
    • Usable disk size is limited by the size of the table
  • Indexed allocation (index nodes)

    • Stores locations of each block for a file in a single index block referenced by the file control block
    • If an index block breaks, only that one file is lost
    • Limit for file size is how many blocks the index block can contain
  • Multi-level indexed allocation (tree structure)

    • Use direct and indirect blocks to increase maximum file size
    • The index block points to more index blocks which eventually point to the file blocks
    • Supports big files but wastes space on small files
  • Chained indexed allocation (index + linked list)

    • use direct blocks with an additional indirect pointer to the next index
    • Primary index block contains a link to the next index block and it continues for as long as needed
    • Allows for small files not to waste space
  • Combined

    • Supports jumping to small files directly (efficient because most files are small)
    • Index blocks are used when needed
    • Index blocks can also contain other index blocks when they are needed (but as a full tree instead of a single linked list of index blocks)
    • Used in unix systems

Blocks

  • Unit of allocation on the disk (like pages in memory)
  • Size
    • Trade-off between disk utilization and data transfer rate
    • Bigger: better data rate but possibly worse disk utilization (fragmentation)
    • Smaller: more time spent seeking, leaking to lower data rate, but better disk utilization
  • Tracking free blocks
    • Free blocks form a linked list
      • Problem: Writing and deleting a file rapidly can cause thrashing (frequently copying to memory)
      • Solution: Keep multiple blocks loaded into memory
  • File System Consistency
    • Systems unexpectedly shutting down could cause an inconsistent state
    • Check using a block consistency checker
    • Types of problems
      • Missing block (block that is not used is not marked as free)
        • Update the free block tracking based on what files say they are being used
      • Duplicate block in free list (block is found twice in the linked list)
        • Bring it down to one
      • Duplicate data block (two files are referencing the same data block)
        • No easy solution. Split the files into separate blocks and then have the user check. Might save one.

Reliability

  • Backup Types
    • Physical Dump
      • Copy everything, exactly as it is stored on the disk
      • Very simple to implement
      • Runs very fast
      • Inefficient
    • Logical Dump
      • Starts at a directory
      • Dumps all the files which changed since the last backup
      • Used by most UNIX systems
  • Resiliency
    • Mirrored storage
      • Every write is duplicated
      • Reads are to whichever system is last loaded
      • Automatic failover when disk, file system, or network fails
    • Redundant Arrays of Independent Disks (RAID)
      • Why
        • A way of storing the same data in different places on multiple drives
        • Provide performance gains through parallel hardware access
        • Provide reliability through redundant storage
        • Fast recovery
        • Hot swap is possible
        • Can be software or hardware based
      • Types
        • RAID 0 (Data Stripping)
          • Splits files into blocks and scatters them across physical storage units
          • Increases overall performance due to higher cumulative throughput
          • Offers no data redundancy
        • RAID 1 (Data Mirroring)
          • Mirrors data between two drives
        • RAID 2 (Bit Stripping)
          • Stripes data at the bit (rather than block) level
          • Rarely used in practice
          • Could get very high data transfer rates in multiple magnetic disks spinning at the same rate
        • RAID 3 (Byte Stripping with Parity Disk)
          • Parity disk is an error recovery disk
            • Can lose one disk
          • Cannot service multiple requests simultaneously (any single block of data is spread across all members of the set)
          • Rarely used in practice
        • RAID 4 (Block Stripping with Parity Disk)
          • More often used because can support multiple requests at once
          • Random writes are slow because each one has to update the parity on the same disk
        • RAID 5 (Block Stripping with Distributed Parity)
          • Parity information is distributed among the drives
          • Requires at least 3 disks
          • Makes random writes faster because the parity disk is no longer a bottleneck
        • RAID 6 (Block Stripping with two distributed parity blocks)
          • High fault tolerance: allows losing two drives
          • Slow rebuild time because it has to check parity across all drives
    • Journaling
      • Keeps track of changes not yet committed to the file system
      • Records the goal of such changes in a data structure known as a journal
      • Uses for systems with sensitive data
      • Costly in speed and storage

Kernel Data Structures

  • File Descriptor Table
    • Stores pointers to the corresponding entry for a file in the file table
    • Opening a file returns an index in this table if the file successfully opens
    • Per process
  • File Table
    • Currently open files
    • Stores flags (R, W, …), offset within the file, refcount
    • Points to Inode table
    • Can have multiple entries for a single file (if open is called on that file multiple times)
    • System wide
  • Inode Table
    • Permanent information
    • Stores metadata like size, type, etc.
    • System wide
    • This is not the file system inode table, an in memory representation for all the file systems (with only the relevant files populated)

Relevant System Calls

  • open
    • creates a new entry in the file descriptor table and the file table
  • close
    • delete a file descriptor, decrementing refcount of the associated entry in the file table
  • write
    • writes to a buffer cache
    • the kernel then later initiates a write to the actual device (fsync)
  • dup
    • creates a new file table entry for the file the provided file id is for (incrementing that file’s refcount in the file table)
    • dup2(oldfd, newfd) replaces oldfd descriptor with a new one that points to the file of newfd (closing oldfd in the process)
  • int pipe(int fd[])
    • Pipes are special files that are actually a shared memory buffer managed by the kernel
    • One “file” for read, one for write. fd[0] is the read end, fd[1] is the write end
    • read a pipe returns 0 (end of file) when there are no write ends open (the refcount of the write end in the file table) and there is no data left in the pipe buffer
    • It is best practice to only have one read end and one write end, though you can have multiple by forking (which creates a new file descriptor table and increments the refcounts)

IO Buffering

  • Types (and default use)
    • Line buffered (stdout, stdin)
    • Not buffered (stderr)
    • Block/Full buffered (files)
  • Can set the buffer with
    • setvbuf(FILE* fp, char* buf, scheme, size)
    • setbuf(FILE* fp, char* buf)
      • NULL for not buffered, full buffered with size of char* otherwise
    • Note: set the buffer to something in the data segment, locally initializing something (even if it is in main) will not live long enough because it needs to live past exit
  • printf uses a userspace buffer that only executes write to STDOUT when there is a linebreak (to run fewer syscalls)
    • mixing printf and write(STDOUT_FNO, ...) could lead to unintuitive behavior if they don’t all end with newlines because they would be writing to different buffers
  • When stdout is redirected to a file, the buffering scheme is changed to block buffering so behavior of what is ultimately written might change