Supporting Raid Devices
This section discusses software raid devices from an initial boot
image perspective: how to get the root device up and running.
There are other aspects to consider, the bootloader for example:
if your root device is on a mirror for reliability, it would be
a disappointment if after the crash you still had a long downtime
because the MBR was only available on the crashed disk. Then there's
the issue of managing raid devices in combination with hotplugging:
once the system is operational, how should the raid devices that
the initial image left untouched be brought online?
Raid devices are managed via ioctls (mostly; there is something
called "autorun" in the kernel)
The interface from userland is simple: mknod a block device file,
send an ioctl to it specifying the devnos of the underlying block
devices and whether you'd like mirroring or striping, then send
a final ioctl to activate the device. This leaves the managing
application free to pick any unused device (minor) number and
has no assumptions about device file names.
Devices that take part in a raid set also have a "superblock",
a header at the end of the device that contains a uuid and indicates
how many drives and spares are supposed to take part in the raid set.
This can be used be the kernel to do consistency checking, it can also
be used by applications to scan for all disks belonging in a raid set,
even if one of the component drives is moved to another disk controller.
The fact that the superblock is at the end of a device has an obvious
advantage: if you somehow loose your raid software, the device
underlying a mirror can be mounted directly as a fallback measure.
If raid is compiled into the kernel rather than provided as a module,
the kernel uses superblocks at boot time to find raid sets and make
them available without user interaction. In this case the filename of
the created blockdevice is hardcoded: /dev/md\d.
This feature is intended for machines with root on a raid device
that don't use an initial boot image. This autorun feature is
also accessible via an ioctl, but it's not used in management
applications, since it won't work with an initial boot image and
it can be a nuisance if some daemon brought a raid set online just
after the administator took it off line for replacement.
Finally, by picking a different major device number for the raid device,
the raid device can be made partitionable without use of LVM.
There are at least three different raid management applications
for Linux: raidtools, the oldest; mdadm, more modern; and EVMS, a
suite of graphical and command line tools that manages not only raid
but also LVM, partitioning and file system formating. We'll only
consider mdadm for now. The use of mdadm is simple:
There's an option to create a new device from components,
building the superblock.
Another option assembles a raid device from components,
assuming the superblocks are already available.
Optionally, a configuration file can be used, specifying which
components make up a device, whether a device file should
be created or it is assumed to exist, whether it's stripe or
mirror, and the uuid. Also, a wildcard pattern can be given:
disks matching this pattern will be searched for superblocks.
Information given in the configuration file can be omitted
on the command line. If there's a wildcard, you don't even
have to specify the component devices of the raid device.
A typical command is mdadm --assemble /dev/md-root
auto=md uuid=..., which translates to "create
/dev/md-root with some unused minor number,
and put the components with matching uuid in it."
So far, raid devices look fairly simple to use; the complications
arise when you have to play nicely with all the other software
on the box. It turns out there are quite a lot of packages that
interact with raid devices:
When the md module is loaded, it registers 256 block devices
with devfs. These devices
are not actually allocated, they're just names set up to
allocate the underlying device when opened. These names in
devfs have no counterpart in sysfs.
When the LVM vgchange is started,
it opens all md devices to scan for headers, only to find the
raid devices have no underlying components and will return
no data. In this process, all these stillborn md devices get
registered with sysfs.
When udevstart is executed
at boot time, it walks over the sysfs tree and lets
udev create block devices files for
every block device it finds in sysfs. The name and permissions
of the created file are configurable, and there is a hook to
initialise SELinux access controls.
When mdadm is invoked with the auto
option, it will create a block device file with an unused
device number and put the requested raid volume under it.
The created device file is owned by whoever executed the
mdadm command, permissions are 0600
and there are no hooks for SELinux.
When the Debian installer builds a system with LVM and raid, the
raid volumes have names such as /dev/md0,
where there is an assumption about the device minor number in
the name of the file.
For the current Debian mkinitrd, this all works together in
a wonderful manner: devfs creates file names for raid devices,
LVM scans them with as side effect entering the devices in sysfs,
and after pivotroot udevstart triggers
udev into creating block device files with proper permissions and
SELinux hooks. Later in the processing of rcS.d,
mdadm will put a raid device under the
created special file. Convoluted but correct, except for the fact
that out of 256 generated raid device files, up to 255 are unused.
In yaird, we do not use devfs.
Instead, we do a mknod before the
mdadm, taking care to use the same
device number that's in use in the running kernel. We expect
mdadm.conf to contain an auto=md
option for any raid device files that need to be created.
This approach should work regardless of whether the fstab uses
/dev/md\d or a device number independent name.