Linux File System

Understand this design statement first

In Linux, most system resources are represented as files or can be accessed through the file system, integrating the Unix philosophy that "everything is a file."

The Linux file system forms a single hierarchical tree rooted at /, which unifies all storage including devices. It adheres to the Filesystem Hierarchy Standard (FHS), with directories like /bin (essential binaries), /etc (configuration files), /home (user data), and /var (logs and variable data) ensuring consistent navigation across distributions. Various underlying file systems, such as ext4, XFS, or Btrfs, are abstracted through this common interface.

This philosophy means hardware devices appear as special files in /dev, processes and system info live as virtual files in /proc and /sys, and even sockets or kernel structures can be manipulated via file operations. However, not everything is a traditional disk-stored file; these representations prioritize a unified interaction model over literal storage.

Virtual File System

A virtual file system (VFS) is an abstraction layer in the Linux operating system that sits between the kernel and the actual file systems. The VFS allows applications and the kernel to interact with various types of file systems using a uniform set of operations, such as open(), read(), and write(), without needing to know the details of each specific file system.

This abstraction enables Linux to support a wide variety of local and network file systems, providing access transparency meaning users and programs can access files in the same way regardless of whether they are stored locally or remotely. The VFS also plays a key role in the "everything is a file" philosophy by representing devices, directories, and even processes as files, making system resources easier to manage and access.

VFS Architecture

VFS exposes a single system call API to user space while dispatching each request to the appropriate file-system implementation through a table of operations. Open files become struct file instances, path components are cached as dentry objects, and each unique file or directory is represented by an inode with a fixed-size metadata record.

The VFS architecture is broken down into three main layers:

File-system interface: system calls such as open(), read(), write() create or use file descriptors that point into the kernel’s file table.
VFS layer: defines file_operations, inode_operations, and super_operations so each concrete file system can plug in its own behavior.
Storage level: ext4, XFS, Btrfs, NFS, and others implement the actual block I/O, metadata layout, and journaling semantics hidden from user space.

Inode Structure & Purpose

Every file or directory is uniquely identified by an inode number. The inode stores permissions, ownership, timestamps, link count, and pointers to data blocks - everything except the directory entry that maps a name to that number. Running out of inodes (not disk space) can halt new file creation, which is particularly common on filesystems with many small files.

The inode maps logical file blocks to physical disk blocks through a pointer hierarchy. This scheme optimizes for both small and large files:

Direct Blocks (~12 pointers): Point directly to data blocks. Small files access data in a single disk seek.
Single Indirect: A pointer to a block containing 256+ pointers (varies by block size). Enables mid-size files without excessive indirection.
Double/Triple Indirect: Additional indirection layers for very large files, trading seek time for capacity.

This hierarchical design ensures efficient access patterns—small files stay fast, while large files remain possible without consuming inode space for individual block pointers.

inode Layout and Metadata

In ext4, each inode is a fixed-size structure (~256–512 bytes) containing:

File mode & permissions: rwx bits for owner, group, others
Ownership: UID (user ID) and GID (group ID)
Timestamps: atime (access), mtime (modification), ctime (change), creation time
Link count: Number of directory entries pointing to this inode
Size and block count: File size in bytes and number of allocated blocks
Block-pointer array: 60-byte array of direct, single-, double-, and triple-indirect pointers
Extended fields: ACLs, generation numbers, checksums

Block groups partition inodes on disk—the kernel computes the group and offset from the inode number, enabling direct lookup without searching. Use ls -i to see inode numbers and df -i to monitor inode exhaustion.

Troubleshooting Inode Issues

If you see "No space left on device" but df shows free space, your inode count is exhausted:

df -i                    # Shows inode usage per filesystem
ls -i filename           # Displays the inode number for a file
stat filename            # Reveals full metadata including inode details
find /path -printf '%i\n' | wc -l  # Count inodes in a directory tree

Many small files (like caches or temporary data) can exhaust inodes before disk space—a common issue in web servers, Docker layers, or log directories.

Filesystem Hierarchy Standard (FHS)

FHS defines a predictable tree so administrators know where binaries, configs, logs, and user data live. The root directory / is the ancestor of every path; distributions vary in symlink usage, but the conceptual layout remains consistent:

/bin, /sbin: Essential binaries for boot and single-user mode
/usr: User utilities and read-only data (binaries, libraries, documentation)
/usr/local: Locally compiled or third-party software
/etc: System configuration files
/var: Variable runtime data—logs (/var/log), caches (/var/cache), spools (/var/spool)
/home: User home directories
/tmp, /var/tmp: Temporary files (cleaned periodically by the system)
/root: Superuser home directory
/dev: Device files (block and character devices)
/proc, /sys: Virtual filesystems exposing kernel state
/boot: Kernel, initrd, and bootloader files

Static, read-only data typically resides under /usr, while variable runtime data lives under /var. This separation enables mounting /usr read-only for security and /var with different performance tuning.

Modern File-System Types

ext4

Fourth extended filesystem with journaling, extents, and 1 EiB file support. Balances reliability with broad compatibility and is the default on most Linux distributions.

Key features:

Journaling: Default mode is data=ordered—metadata is journaled and data blocks are flushed before their metadata commits. Options like data=journal or data=writeback trade safety for speed.
Extents: Groups of consecutive blocks, reducing fragmentation and speeding up large file I/O compared to older block-by-block allocation.
Lazy initialization and delayed allocation: Defer actual block allocation until flush time, allowing the kernel to optimize placement.
Checksums and checksumming: Detect corruption silently accumulating on disk.
Mature tooling: fsck.ext4, resize2fs, and rich monitoring make it safe for desktops and general servers.

When to use: Default choice for most systems; stable, well-tested, and broadly supported.

XFS

High-performance, extent-based filesystem designed for large files and parallel workloads. Scales to 18 EiB volumes and is common in enterprise and HPC environments.

Key features:

Allocation Groups (AGs): Divides storage into independent chunks, each managing its own inodes, free-space B+ trees, and metadata. Eliminates global locks, enabling near-linear scalability on multi-core systems.
Extent-based allocation: Allocates contiguous blocks efficiently, reducing fragmentation.
Dynamic inode allocation: Inodes are created on-demand, avoiding inode exhaustion on systems with unpredictable file counts.
Delayed logging: Batches metadata updates, reducing journal overhead.
Optimized for streaming: Throughput-oriented; excellent for media servers, HPC, and sequential workloads.

When to use: Large files, parallel workloads, or throughput-sensitive applications. Less ideal for small files or extreme IOPS requirements.

Btrfs

Copy-on-write (COW) filesystem with built-in snapshots, RAID-like redundancy, checksums, and subvolumes for flexible layouts. Combines advanced features with modern design but remains less mature than ext4.

Key features:

Copy-on-write: Newly written data lands in fresh blocks, leaving old versions intact. Enables instantaneous, space-efficient snapshots.
Subvolumes: Mountable trees with independent inode namespaces; snapshots are read-only subvolumes sharing data via COW.
Self-healing and checksums: Detects and corrects corruption across redundant copies (when using RAID modes).
Incremental backups: send/receive streams preserve the COW graph, enabling efficient incremental backups between filesystems.
Flexible RAID modes: Unlike hardware RAID, filesystem-level RAID avoids controller overhead but requires understanding data/metadata redundancy modes.

When to use: Snapshots are essential, backups must be incremental, or self-healing is critical. Avoid if max throughput is the only concern or on systems with unstable hardware.

When Fragmentation Matters

All modern filesystems reduce fragmentation via extents and intelligent allocation, but XFS and Btrfs excel here. ext4 can fragment over time, especially with many small updates. Monitor with e4defrag or filefrag, though defragmentation is rarely necessary in practice.

# Check fragmentation on ext4
filefrag -v /path/to/file

# Check free space fragmentation
tune2fs -l /dev/sda1 | grep "Filesystem state"

Mount Options & /etc/fstab

The filesystem table (/etc/fstab) defines what to mount, where, and with which options at boot time. Each line has six fields (colon or space-separated): device, mount point, filesystem type, options, dump frequency, and fsck order.

Field breakdown:

Device: UUID, device path, or network location (for NFS/CIFS)
Mount point: Directory where the filesystem appears
Filesystem type: ext4, xfs, btrfs, nfs, cifs, swap, etc.
Options (field 4): Mount behavior flags
Dump (field 5): 1 to include in backups, 0 to skip (legacy; rarely used)
Fsck order (field 6): 0 (skip fsck), 1 (root, checked first), 2+ (checked in order)

Common mount options:

rw / ro: Read-write or read-only
noexec: Prevent binary execution (security for /tmp, /var/tmp)
nosuid: Ignore setuid bits (prevents privilege escalation via world-writable directories)
nodev: Ignore device files (security hardening)
relatime / noatime: Optimize access-time updates for performance
defaults: Shorthand for rw,suid,dev,exec,auto,nouser,async
_netdev: Wait for network before mounting (critical for NFS/CIFS)
nofail: Continue boot even if mount fails (for removable/unreliable devices)
discard / nodiscard: Enable/disable TRIM for SSDs

Security hardening example:

/dev/mapper/data  /var/cache  ext4  defaults,nodev,nosuid,noexec  0  0

Sample /etc/fstab entries

# Device                        Mount point    Filesystem    Options                 Dump    Fsck order
UUID=30fcb748-...-7b56df        /              ext4          defaults                0       1
UUID=64351209-...-56fea         /boot          xfs           defaults                0       2
/dev/mapper/rhel-swap           swap           swap          defaults,pri=5          0       0
# Network share with _netdev so systemd waits for the network
//fileserver/projects /srv/projects  cifs          _netdev,uid=devuser,gid=devgroup  0       0

Troubleshooting mount issues:

mount -o remount,rw /path  # Remount as read-write without unmounting
mount | grep /path         # Show mounted filesystem and active options
findmnt -A                 # List all mounted filesystems with options

Permissions, ACLs & Extended Attributes

Linux combines traditional rwx bits with Access Control Lists (ACLs) and extended attributes (xattrs) for fine-grained authorization and metadata storage.

Standard POSIX permissions:

ls -l /path/file
# Output: -rw-r--r-- 1 user group 1234 Jan 15 10:30 file
#         ^^^^^^^^^^   ^^^^  ^^^^^
#         mode bits    user  group

First character: File type (- for regular, d for directory, l for symlink, etc.)
Next 9 characters: rwx for owner, group, others (e.g., rw- = read+write)
Directories need execute bit for traversal: chmod u+x dir allows entering the directory
Modified with chmod and chown; displayed via ls -l and stat

ACLs (Access Control Lists):

Traditional permissions are coarse—either the owner has access or the group does. ACLs grant per-user or per-group permissions without changing owner/group:

# Grant user 'bob' read+execute on a directory
setfacl -m u:bob:rx /var/project

# View effective permissions (including ACLs)
getfacl /var/project

# Remove a specific ACL
setfacl -x u:bob /var/project

ACLs are stored as extended attributes (system.posix_acl_access and system.posix_acl_default). They add overhead and complexity—use only when standard rwx bits don't suffice.

Extended Attributes (xattrs):

Name/value metadata pairs stored on inodes, useful for application-specific data:

# Set a custom attribute
setfattr -n user.myapp.version -v "1.0" /path/file

# Retrieve attributes
getfattr -d /path/file

# List all attributes
lsattr -d /path/file  # Shows immutable, append-only, and other flags

Common namespaces:

user.*: Application-specific metadata (visible to unprivileged users)
trusted.*: Kernel and trusted-process metadata (root only)
security.*: SELinux labels, IMA hashes (system security)
system.*: ACLs and other filesystem metadata

Practical uses: MIME types, checksums, backup flags, or container layer metadata. Not all filesystems support xattrs—check with attr -l /path.

Performance Outlook & Benchmarking

Filesystem choice impacts throughput, latency, and scalability. Recent benchmarks (Phoronix, Linux 6.15 with PCIe Gen 5 NVMe) show:

XFS: ~20% faster than F2FS, significantly ahead of ext4/Btrfs thanks to allocation-group concurrency and B+ tree free-space management. Excels at large sequential I/O and parallel workloads.
ext4: Mature, battle-tested default with broad distro support. Slightly slower in raw throughput but predictable and feature-complete. Best for balanced workloads.
Btrfs: Slightly slower in throughput tests but offers snapshots, checksums, and self-healing. Choose when data integrity or backup efficiency outweighs peak speed.

Real-world considerations:

Cache hierarchy matters more than filesystem choice: Page cache, buffer cache, and disk controller cache often dominate performance.
Workload-dependent: Sequential large files favor XFS; random IOPS favor ext4 with careful tuning; snapshots/integrity favor Btrfs.
Aging and maintenance: ext4 may fragment over years; XFS and Btrfs are more resilient but require modern tooling.

Benchmarking your own setup:

# Measure sequential read/write throughput
fio --name=random-read --ioengine=libaio --iodepth=16 --rw=randread \
    --bs=4k --direct=1 --size=10G --filename=/mnt/test.dat

# Check disk queue depth and I/O stats
iostat -x 1

# Monitor filesystem-specific metrics (ext4)
dumpe2fs -h /dev/sda1 | grep "Block count\|Inode count"

VFS Layer Explorer

File-System Interface VFS Dispatch Layer Concrete File-System Storage

File-System Interface

User-space programs invoke system calls (open, read, write, close) that map to descriptors in the kernel’s file table.

DATA STRUCTURES

struct file
file descriptor table

USER-VISIBLE ARTIFACTS

fd numbers returned to user space
open file flags (O_RDWR, O_APPEND, etc.)

EXAMPLE OPERATIONS

open()
read()
write()
close()
lseek()

VFS Dispatch Layer

The Virtual File System abstracts per-filesystem implementations through tables of operations (file_ops, inode_ops, super_ops).

DATA STRUCTURES

struct inode
struct dentry
dcache
struct super_block

USER-VISIBLE ARTIFACTS

dentry cache hits/misses
inode lookups by number
mount point metadata

EXAMPLE OPERATIONS

path_walk()
inode lookup
mount/umount
sync_filesystems()

Concrete File-System Storage

ext4, XFS, Btrfs, NFS, and others implement block allocation, journaling, and on-disk metadata formats.

DATA STRUCTURES

block groups / allocation groups
journal blocks
B+ trees for free space
COW trees

USER-VISIBLE ARTIFACTS

on-disk superblocks
inode tables
extent maps
checksum trees

EXAMPLE OPERATIONS

block_map()
journal_commit()
snapshot_create()
grow_filesystem()