Linux File System
Table of Contents
- Linux File System
- Virtual File System
- VFS Architecture
- Inode Structure & Purpose
- inode Layout and Metadata
- Troubleshooting Inode Issues
- Filesystem Hierarchy Standard (FHS)
- Modern File-System Types
- ext4
- XFS
- Btrfs
- When Fragmentation Matters
- Mount Options & /etc/fstab
- Permissions, ACLs & Extended Attributes
- Performance Outlook & Benchmarking
- VFS Layer Explorer
- File-System Interface
- VFS Dispatch Layer
- Concrete File-System Storage
Linux File System
Understand this design statement first
In Linux, most system resources are represented as files or can be accessed through the file system, integrating the Unix philosophy that "everything is a file."
The Linux file system forms a single hierarchical tree rooted at /, which unifies all storage including devices. It adheres to the Filesystem Hierarchy Standard (FHS), with directories like /bin (essential binaries), /etc (configuration files), /home (user data), and /var (logs and variable data) ensuring consistent navigation across distributions. Various underlying file systems, such as ext4, XFS, or Btrfs, are abstracted through this common interface.
This philosophy means hardware devices appear as special files in /dev, processes and system info live as virtual files in /proc and /sys, and even sockets or kernel structures can be manipulated via file operations. However, not everything is a traditional disk-stored file; these representations prioritize a unified interaction model over literal storage.
Virtual File System
A virtual file system (VFS) is an abstraction layer in the Linux operating system that sits between the kernel and the actual file systems. The VFS allows applications and the kernel to interact with various types of file systems using a uniform set of operations, such as open(), read(), and write(), without needing to know the details of each specific file system.
This abstraction enables Linux to support a wide variety of local and network file systems, providing access transparency meaning users and programs can access files in the same way regardless of whether they are stored locally or remotely. The VFS also plays a key role in the "everything is a file" philosophy by representing devices, directories, and even processes as files, making system resources easier to manage and access.
VFS Architecture
VFS exposes a single system call API to user space while dispatching each request to the appropriate file-system implementation through a table of operations. Open files become struct file instances, path components are cached as dentry objects, and each unique file or directory is represented by an inode with a fixed-size metadata record.
The VFS architecture is broken down into three main layers:
- File-system interface: system calls such as
open(),read(),write()create or use file descriptors that point into the kernel’s file table. - VFS layer: defines file_operations, inode_operations, and super_operations so each concrete file system can plug in its own behavior.
- Storage level: ext4, XFS, Btrfs, NFS, and others implement the actual block I/O, metadata layout, and journaling semantics hidden from user space.
Inode Structure & Purpose
Every file or directory is uniquely identified by an inode number. The inode stores permissions, ownership, timestamps, link count, and pointers to data blocks - everything except the directory entry that maps a name to that number. Running out of inodes (not disk space) can halt new file creation, which is particularly common on filesystems with many small files.
The inode maps logical file blocks to physical disk blocks through a pointer hierarchy. This scheme optimizes for both small and large files:
- Direct Blocks (~12 pointers): Point directly to data blocks. Small files access data in a single disk seek.
- Single Indirect: A pointer to a block containing 256+ pointers (varies by block size). Enables mid-size files without excessive indirection.
- Double/Triple Indirect: Additional indirection layers for very large files, trading seek time for capacity.
This hierarchical design ensures efficient access patterns—small files stay fast, while large files remain possible without consuming inode space for individual block pointers.
inode Layout and Metadata
In ext4, each inode is a fixed-size structure (~256–512 bytes) containing:
- File mode & permissions: rwx bits for owner, group, others
- Ownership: UID (user ID) and GID (group ID)
- Timestamps: atime (access), mtime (modification), ctime (change), creation time
- Link count: Number of directory entries pointing to this inode
- Size and block count: File size in bytes and number of allocated blocks
- Block-pointer array: 60-byte array of direct, single-, double-, and triple-indirect pointers
- Extended fields: ACLs, generation numbers, checksums
Block groups partition inodes on disk—the kernel computes the group and offset from the inode number, enabling direct lookup without searching. Use ls -i to see inode numbers and df -i to monitor inode exhaustion.
Troubleshooting Inode Issues
If you see "No space left on device" but df shows free space, your inode count is exhausted:
df -i # Shows inode usage per filesystem
ls -i filename # Displays the inode number for a file
stat filename # Reveals full metadata including inode details
find /path -printf '%i\n' | wc -l # Count inodes in a directory tree
Many small files (like caches or temporary data) can exhaust inodes before disk space—a common issue in web servers, Docker layers, or log directories.
Filesystem Hierarchy Standard (FHS)
FHS defines a predictable tree so administrators know where binaries, configs, logs, and user data live. The root directory / is the ancestor of every path; distributions vary in symlink usage, but the conceptual layout remains consistent:
/bin,/sbin: Essential binaries for boot and single-user mode/usr: User utilities and read-only data (binaries, libraries, documentation)/usr/local: Locally compiled or third-party software/etc: System configuration files/var: Variable runtime data—logs (/var/log), caches (/var/cache), spools (/var/spool)/home: User home directories/tmp,/var/tmp: Temporary files (cleaned periodically by the system)/root: Superuser home directory/dev: Device files (block and character devices)/proc,/sys: Virtual filesystems exposing kernel state/boot: Kernel, initrd, and bootloader files
Static, read-only data typically resides under /usr, while variable runtime data lives under /var. This separation enables mounting /usr read-only for security and /var with different performance tuning.
Modern File-System Types
ext4
Fourth extended filesystem with journaling, extents, and 1 EiB file support. Balances reliability with broad compatibility and is the default on most Linux distributions.
Key features:
- Journaling: Default mode is
data=ordered—metadata is journaled and data blocks are flushed before their metadata commits. Options likedata=journalordata=writebacktrade safety for speed. - Extents: Groups of consecutive blocks, reducing fragmentation and speeding up large file I/O compared to older block-by-block allocation.
- Lazy initialization and delayed allocation: Defer actual block allocation until flush time, allowing the kernel to optimize placement.
- Checksums and checksumming: Detect corruption silently accumulating on disk.
- Mature tooling:
fsck.ext4,resize2fs, and rich monitoring make it safe for desktops and general servers.
When to use: Default choice for most systems; stable, well-tested, and broadly supported.
XFS
High-performance, extent-based filesystem designed for large files and parallel workloads. Scales to 18 EiB volumes and is common in enterprise and HPC environments.
Key features:
- Allocation Groups (AGs): Divides storage into independent chunks, each managing its own inodes, free-space B+ trees, and metadata. Eliminates global locks, enabling near-linear scalability on multi-core systems.
- Extent-based allocation: Allocates contiguous blocks efficiently, reducing fragmentation.
- Dynamic inode allocation: Inodes are created on-demand, avoiding inode exhaustion on systems with unpredictable file counts.
- Delayed logging: Batches metadata updates, reducing journal overhead.
- Optimized for streaming: Throughput-oriented; excellent for media servers, HPC, and sequential workloads.
When to use: Large files, parallel workloads, or throughput-sensitive applications. Less ideal for small files or extreme IOPS requirements.
Btrfs
Copy-on-write (COW) filesystem with built-in snapshots, RAID-like redundancy, checksums, and subvolumes for flexible layouts. Combines advanced features with modern design but remains less mature than ext4.
Key features:
- Copy-on-write: Newly written data lands in fresh blocks, leaving old versions intact. Enables instantaneous, space-efficient snapshots.
- Subvolumes: Mountable trees with independent inode namespaces; snapshots are read-only subvolumes sharing data via COW.
- Self-healing and checksums: Detects and corrects corruption across redundant copies (when using RAID modes).
- Incremental backups:
send/receivestreams preserve the COW graph, enabling efficient incremental backups between filesystems. - Flexible RAID modes: Unlike hardware RAID, filesystem-level RAID avoids controller overhead but requires understanding data/metadata redundancy modes.
When to use: Snapshots are essential, backups must be incremental, or self-healing is critical. Avoid if max throughput is the only concern or on systems with unstable hardware.
When Fragmentation Matters
All modern filesystems reduce fragmentation via extents and intelligent allocation, but XFS and Btrfs excel here. ext4 can fragment over time, especially with many small updates. Monitor with e4defrag or filefrag, though defragmentation is rarely necessary in practice.
# Check fragmentation on ext4
filefrag -v /path/to/file
# Check free space fragmentation
tune2fs -l /dev/sda1 | grep "Filesystem state"
Mount Options & /etc/fstab
The filesystem table (/etc/fstab) defines what to mount, where, and with which options at boot time. Each line has six fields (colon or space-separated): device, mount point, filesystem type, options, dump frequency, and fsck order.
Field breakdown:
- Device: UUID, device path, or network location (for NFS/CIFS)
- Mount point: Directory where the filesystem appears
- Filesystem type: ext4, xfs, btrfs, nfs, cifs, swap, etc.
- Options (field 4): Mount behavior flags
- Dump (field 5): 1 to include in backups, 0 to skip (legacy; rarely used)
- Fsck order (field 6): 0 (skip fsck), 1 (root, checked first), 2+ (checked in order)
Common mount options:
rw/ro: Read-write or read-onlynoexec: Prevent binary execution (security for/tmp,/var/tmp)nosuid: Ignore setuid bits (prevents privilege escalation via world-writable directories)nodev: Ignore device files (security hardening)relatime/noatime: Optimize access-time updates for performancedefaults: Shorthand forrw,suid,dev,exec,auto,nouser,async_netdev: Wait for network before mounting (critical for NFS/CIFS)nofail: Continue boot even if mount fails (for removable/unreliable devices)discard/nodiscard: Enable/disable TRIM for SSDs
Security hardening example:
/dev/mapper/data /var/cache ext4 defaults,nodev,nosuid,noexec 0 0
Sample /etc/fstab entries
# Device Mount point Filesystem Options Dump Fsck order
UUID=30fcb748-...-7b56df / ext4 defaults 0 1
UUID=64351209-...-56fea /boot xfs defaults 0 2
/dev/mapper/rhel-swap swap swap defaults,pri=5 0 0
# Network share with _netdev so systemd waits for the network
//fileserver/projects /srv/projects cifs _netdev,uid=devuser,gid=devgroup 0 0
Troubleshooting mount issues:
mount -o remount,rw /path # Remount as read-write without unmounting
mount | grep /path # Show mounted filesystem and active options
findmnt -A # List all mounted filesystems with options
Permissions, ACLs & Extended Attributes
Linux combines traditional rwx bits with Access Control Lists (ACLs) and extended attributes (xattrs) for fine-grained authorization and metadata storage.
Standard POSIX permissions:
ls -l /path/file
# Output: -rw-r--r-- 1 user group 1234 Jan 15 10:30 file
# ^^^^^^^^^^ ^^^^ ^^^^^
# mode bits user group
- First character: File type (- for regular, d for directory, l for symlink, etc.)
- Next 9 characters: rwx for owner, group, others (e.g.,
rw-= read+write) - Directories need execute bit for traversal:
chmod u+x dirallows entering the directory - Modified with
chmodandchown; displayed vials -landstat
ACLs (Access Control Lists):
Traditional permissions are coarse—either the owner has access or the group does. ACLs grant per-user or per-group permissions without changing owner/group:
# Grant user 'bob' read+execute on a directory
setfacl -m u:bob:rx /var/project
# View effective permissions (including ACLs)
getfacl /var/project
# Remove a specific ACL
setfacl -x u:bob /var/project
ACLs are stored as extended attributes (system.posix_acl_access and system.posix_acl_default). They add overhead and complexity—use only when standard rwx bits don't suffice.
Extended Attributes (xattrs):
Name/value metadata pairs stored on inodes, useful for application-specific data:
# Set a custom attribute
setfattr -n user.myapp.version -v "1.0" /path/file
# Retrieve attributes
getfattr -d /path/file
# List all attributes
lsattr -d /path/file # Shows immutable, append-only, and other flags
Common namespaces:
user.*: Application-specific metadata (visible to unprivileged users)trusted.*: Kernel and trusted-process metadata (root only)security.*: SELinux labels, IMA hashes (system security)system.*: ACLs and other filesystem metadata
Practical uses: MIME types, checksums, backup flags, or container layer metadata. Not all filesystems support xattrs—check with attr -l /path.
Performance Outlook & Benchmarking
Filesystem choice impacts throughput, latency, and scalability. Recent benchmarks (Phoronix, Linux 6.15 with PCIe Gen 5 NVMe) show:
- XFS: ~20% faster than F2FS, significantly ahead of ext4/Btrfs thanks to allocation-group concurrency and B+ tree free-space management. Excels at large sequential I/O and parallel workloads.
- ext4: Mature, battle-tested default with broad distro support. Slightly slower in raw throughput but predictable and feature-complete. Best for balanced workloads.
- Btrfs: Slightly slower in throughput tests but offers snapshots, checksums, and self-healing. Choose when data integrity or backup efficiency outweighs peak speed.
Real-world considerations:
- Cache hierarchy matters more than filesystem choice: Page cache, buffer cache, and disk controller cache often dominate performance.
- Workload-dependent: Sequential large files favor XFS; random IOPS favor ext4 with careful tuning; snapshots/integrity favor Btrfs.
- Aging and maintenance: ext4 may fragment over years; XFS and Btrfs are more resilient but require modern tooling.
Benchmarking your own setup:
# Measure sequential read/write throughput
fio --name=random-read --ioengine=libaio --iodepth=16 --rw=randread \
--bs=4k --direct=1 --size=10G --filename=/mnt/test.dat
# Check disk queue depth and I/O stats
iostat -x 1
# Monitor filesystem-specific metrics (ext4)
dumpe2fs -h /dev/sda1 | grep "Block count\|Inode count"
VFS Layer Explorer
File-System Interface
User-space programs invoke system calls (open, read, write, close) that map to descriptors in the kernel’s file table.
DATA STRUCTURES
struct file- file descriptor table
USER-VISIBLE ARTIFACTS
- fd numbers returned to user space
- open file flags (O_RDWR, O_APPEND, etc.)
EXAMPLE OPERATIONS
open()read()write()close()lseek()
VFS Dispatch Layer
The Virtual File System abstracts per-filesystem implementations through tables of operations (file_ops, inode_ops, super_ops).
DATA STRUCTURES
struct inodestruct dentry- dcache
struct super_block
USER-VISIBLE ARTIFACTS
- dentry cache hits/misses
- inode lookups by number
- mount point metadata
EXAMPLE OPERATIONS
path_walk()- inode lookup
- mount/umount
sync_filesystems()
Concrete File-System Storage
ext4, XFS, Btrfs, NFS, and others implement block allocation, journaling, and on-disk metadata formats.
DATA STRUCTURES
- block groups / allocation groups
- journal blocks
- B+ trees for free space
- COW trees
USER-VISIBLE ARTIFACTS
- on-disk superblocks
- inode tables
- extent maps
- checksum trees
EXAMPLE OPERATIONS
block_map()journal_commit()snapshot_create()grow_filesystem()