Installation of a FreeBSD 8.2 system with root-on-ZFS, mfs driver and ZFSv28

This article explains how to install a FreeBSD 8.2 system on a remote server known as a dedibox1. It is closely related to my previous HOWTO2 but with several changes due to the remoteness and the level of security we need.

IMPORTANT NOTE

Please note that this article is being deprecated in favour of the 9.1 ZFS/GELI one. Every server I have is being moved or has moved over to 9.1. 9.1 is easier to deal with, no patching is required.

Table of content

Installation of a FreeBSD 8.2 system with root-on-ZFS, mfs driver and ZFSv28
IMPORTANT NOTE
Table of content
Prerequisites
Installed system
Hardware
Notes on ZFS, disks and all that
Constraints
Generation of mfsbsd image
Creation of the customized distribution
Actual FreeBSD installation
Finishing up
Feedback
History
Credits

Prerequisites

Like many dedicated servers in an hosting datacenter, access to a console (such as iLO or iDRAC) is mandatory to be able to manipulate BIOS settings, access to some kind of rescue mode where you can upload an ISO image and boot from it. The example used here will be the Dedibox rescue system (such as described in 5 and used in 7)

You must have an MFSBSD3 generated image or the ability to generate one (which means an entire /usr/src tree). See the above mfsbsd URL for details.

Please note that the whole process right now is not the faint of heart because not only you need to generate a custom mfsbsd image but you need also to generate your own release. The way one manages a dedicated server hosted at Dedibox (and possibly other hosting companies) features a rescue mode that we will be using to boot up our mfsbsd image and work up the installation that way.

Installed system

ZFS-only system with Root-on-ZFS
two disks are in the machine, running with ZFS/mirror
the main zpool is encrypted with geli(8) so that if a disk needs to be replaced, data on it will be secure.4
ZPool v28 along with ZFS v5 to get deduplication and performance fixes.
two ZFS pools are defined, one encrypted to contain the minimal booting system and the real one that is mounted over as /.

Hardware

The hardware of choice is the Dedibox PRO R210 system (See 1 for reference), a rather powerful system with the following characteristics:

L3426 Nehalem quad-core CPU running at 1.86 GHz
16 GB of RAM
2 disks of 2 TB each (Hitachi HUA72202 apparently)

the mps driver that we will be using for the H200 HBA controller does not support the RAID1 option installed by default by the Online people. The first step is to break that RAID1 setup and configure the BIOS in passthrough mode to get the drives “back” as separate devices (da0 and da1).

Do not be confused by the “SCSI” naming of the drives. The H200 controller, made by LSI and also called the LSI SAS 2009, is a SAS controller but SATA2 drives are compatible and will appear as SAS drives (hence the da name).

LSI has released a new driver (using the existing mfi driver) that does support the RAID option and has been merged into 10-CURRENT but we are quite happy not to use it anyway, relying on ZFS for that.

Notes on ZFS, disks and all that

Please go read this article to find useful information on ZFS, disks and how to use it. It is not specific to FreeBSD ZFS but will apply as well for most things.

Constraints

The fact that we want to use encryption to protect our data is a major constraint on the on-disk architecture and how we lay down/use partitions. /boot/loader has the ability to ask for encryption keys at boot-time but that also means that it must be unencrypted…

Main choice to be made is whether we use a plain UFS booting partition then mount everything from the ZFS pool or we use ZFS pools for both partitions.

We choose to have two separate ZFS pools to be able to use the other ZFS features such as snapshots, mirroring and al.

Generation of mfsbsd image

Buildworld & buildkernel

make buildworld
make buildkernel KERNCONF=GENERIC

Please add any option you want like -j8 and define options as you see fit like NO_CLEAN=yes

Configuration

You need to modify and customize the various configuration files located in conf/:

authorized_keys
interfaces.conf
loader.conf
rc.conf
resolv.conf

The main things to change are IP addresses of the server, DNS servers, defaultroute and all that. Most of these informations are on the page describing your server in the console.

loader.conf is important because that where you specify the unables for the kernel. As we are installing a ZFS system, you will need at least to specify vm.kmem_size. I’d advise you to use 2xRAM here.

There also on file we need to modify: Makefile. By default, only some kernel modules are loaded in, a few more are put as loadable modules and there are some missing (in the context of the current article): geom_eli and its dependencies (crypto, zlib). ZFS and opensolaris are included though.

mfsbsd creation

make BASE=/data/work/freebsd/9 SRC_DIR=/data/work/freebsd/9  CUSTOM=1

Copy the newly created mfsboot.img somewhere you can reach over the Internet. Any FTP or HTTP server will do.

Installation of the mfsbsd image

Put the dedibox server in “rescue mode”. Typically it will be running some form of Linux system like Ubuntu. Now get the image and install it:

wget -O - <url>/mfsimage.img | dd of=/dev/sda bs=1048576

When using this procedure, you end up destroying the installed system and will have to reinstall everything.

Booting off the mfsbsd image

Reboot your dedibox normally by exiting the rescue mode, it should now boot off the mfsbsd image. A few minutes later you should be able to access the system through ssh.

Creation of the customized distribution

In parallel to what we are going to do on the target machine, you can generate your custom distribution (incl. your modified source and kernel configuration) by visiting the release directory and running the following command after creating a big enough space somewhere to hold the result:

mkdir /data/work/release
make release EXTSRCDIR=/data/work/freebsd/8v28-20110301 EXTPORTSDIR=/usr/ports \ 
	CHROOTDIR=/data/work/release NODOC=yes NOPORTSATALL=yes NOPORTS=yes

Now, you can find the result in /data/work/release under the snapshot’s name:

ls /data/work/release/R/cdrom 
bootonly/       disc1/          disc2/          dvd1/           livefs/

and in the one we are most interested in (dvd1):

ls /data/work/release/R/cdrom/dvd1
.cshrc              etc/          sbin/
.profile            lib/          stand@
8.2-20110309-SNAP/  libexec/      sys@
COPYRIGHT           media/        tmp/
bin/                mnt/          usr/
boot/               proc/         var/
rescue/             dev/          root/
cdrom.inf

The main distribution is located even further below in the 8.2-* snapshot:

base/           doc/            kernels/        proflibs/
catpages/       games/          lib32/          src/
dict/           info/           manpages/

That will need to be copied over to your shiny new server somewhere.

Actual FreeBSD installation

We will more or less follow the instructions in 4. Things that we will change are not essential but reflect our special requirements.

We will later be using the “dvd1” ISO image to install the system. For now, the mfsbsd has a subset of the installation disk with enough commands to get you going.

Partitioning the drives

As we will be using the two disks in mirror mode, we will be replicating the commands we do on da0 on da1. That way, if either disk is broken at some point, the system will be able to find all the information it needs to boot.

Later on, when finished with the partitioning and encryption phases, we will be transferring the dvd1 image in memory and mount it as a directory with the mdconfig command. In the meantime, let’s begin.

Before we install our own GPT partition table, we must wipe out the previous partition table installed by the Dedibox installation system.

dd if=/dev/zero of=/dev/da0 bs=512 count=10
dd if=/dev/zero of=/dev/da1 bs=512 count=10

then install our own:

gpart create -s gpt da0
gpart create -s gpt da1

scoite# gpart show
=>        34  3907029101  da0  GPT  (1.8T)
          34  3907029101       - free -  (1.8T)

=>        34  3907029101  da1  GPT  (1.8T)
          34  3907029101       - free -  (1.8T)

Now create the boot, 1st freebsd-zfs, swap then the 2nd freebsd-zfs partition. The first ZFS partition is not big because we need only what is necessary for booting. As we will be also encrypting the swap (it makes no sense encryting the data and not the swap as well) and also mirror it for safety reasons, swap will be twice the RAM (32 GB in our case).

An issue you want to look for is that now that we have really big hard drives (2 and 3 TB now and soon more), these have been getting 4 KB sectors and run really slowly in 512-bytes sectors so you want your partitions aligned on 4 KB boundaries so, from now on, I will adding “-a 4k” to the gpart(8) command lines:

gpart add -s 64K -a 4k -t freebsd-boot da0
gpart add -s 2G -a 4k -t freebsd-zfs -l boot0 da0
gpart add -s 32G -a 4k -t freebsd-swap -l swap0 da0
gpart add -a 4k -t freebsd-zfs -l tank0 da0

It is probably only needed on the first call as the rest will be aligned due to their sizes.

We mirror that configuration on da1:

gpart add -s 64K -a 4k -t freebsd-boot da1
gpart add -s 2G -a 4k -t freebsd-zfs -l boot1 da1
gpart add -s 32G -a 4k -t freebsd-swap -l swap1 da1
gpart add -a 4k -t freebsd-zfs -l tank1 da1

You can not use something like gpart backup da0 | gpart restore -F da1 to copy the entire partition table in one go because the name of the labels would be the same.

You should end up with something like this:

=>        34  3907029101  da0  GPT  (1.8T)
          34         128    1  freebsd-boot  (64K)
         162     4194304    2  freebsd-zfs  (2.0G)
     4194466    67108864    3  freebsd-swap  (32G)
    71303330  3835725805    4  freebsd-zfs  (1.8T)

=>        34  3907029101  da1  GPT  (1.8T)
          34         128    1  freebsd-boot  (64K)
         162     4194304    2  freebsd-zfs  (2.0G)
     4194466    67108864    3  freebsd-swap  (32G)
    71303330  3835725805    4  freebsd-zfs  (1.8T)

You can check that the different partitions and labels do now exist in /dev/gpt:

scoite# ls /dev/gpt
boot0	boot1	swap0	swap1	tank0	tank1

As we want to be able to boot from either disk, we mark the 2nd partition as a boot candidate:

gpart set -a bootme -i 2 da0
gpart set -a bootme -i 2 da1

We will load the bootcode in place on both disks

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da1

Encrypting the disks

Creating the keyfile for the partitions, using the same passphrase for convenience.

mkdir /root/keys
dd if=/dev/random of=/root/keys/boot.key bs=128k count=1

Now, you have to choose a passphrase (as usual, not too short, not guessable, remember you need it only at bootime).

geli init -b -K /root/keys/boot.key -s 4096 -l 256 /dev/gpt/tank0
geli init -b -K /root/keys/boot.key -s 4096 -l 256 /dev/gpt/tank1

Attach both drives:

geli attach -k /root/keys/boot.key /dev/gpt/tank0
geli attach -k /root/keys/boot.key /dev/gpt/tank1

Now, create the 1st ZFS partition in mirror mode for the unencrypted part:

zpool create zboot mirror gpt/boot0 gpt/boot1

and the encrypted also mirrored 2nd ZFS partition:

zpool create tank mirror gpt/tank0.eli gpt/tank1.eli

The two pools should be appearing like this:

NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
tank   1.78T  1.76G  1.78T     0%  1.01x  ONLINE  -
zboot  1.98G  92.5K  1.98G     0%  1.00x  ONLINE  -

When we will have created some filesystems on the disks, we will set the bootfs property on both pools. I use a separate root filesystem on my pools, it makes changing the / fs much more simple and allow to have different ones.

Now that the pools have been created, we switch the algorithm used to checksum disk blocks. “fletcher4” is only slightly slower but better (just like CRC16 vs CRC32).

zfs set checksum=fletcher4 zboot
zfs set checksum=fletcher4 tank

Encrypted swap

Swap is slightly different, we will use the “onetime” command to geli(8) through the way we declare swap in /etc/fstab, that way we do not need to enter any passphrase because it is not needed to know it after attaching the partitions (see geli(8) for details).

As I said earlier, we will use encrypted swap and geli has automatic setup for that by adding the .eli suffix in /etc/fstab. So let’s create the gmirror configuration for swap.

gmirror label swap gpt/swap0 gpt/swap1

The /etc/fstab entry will look like the following:

/dev/mirror/swap.eli	none swap sw 0 0

In this schema, we do not encrypt each swap partition but the mirror itself (swap over geli over gmirror) to avoid doing things twice.

Filesystems

We will globally follow the filesystem layout we used in 2 with more or less the same options.

Compression seems to create issues for kernel loading so we will avoid that on tank/root. All other FS will inherit the compression property though.

As for compression scheme, in zpool v13 and later, we can select different algorithms for compression. lzjb is the fastest available but gzip is better compression-wise so pjd recommends using lzjb on things like /usr/obj where speed is more interesting than actual compression ratio.

I have not done any benchmarking yet on this pool-wide compression. It may be too slow to use the default gzip compression (-6), please feel free to experiment there. You may wish to only enable compression on selected filesets.

zfs create -o compression=off tank/root
zfs create -o mountpoint=/tank/root/usr tank/usr
zfs create -o mountpoint=/tank/root/usr/obj -o compression=lzjb tank/usr/obj
zfs create -o mountpoint=/tank/root/usr/local tank/usr/local

The reason why I create a separate /usr fileset is that I want to have different policies for compression, atime property, snapshots and all that. You can also create another fileset for /usr/local, once again to be able to snapshot separately the base system and the ports you will be using.

To complete what we want under /usr, we will create /usr/src to hold the system’s sources. We will need these to recompile a new trimmed-down kernel.

zfs create -o mountpoint=/tank/root/usr/src tank/usr/src

Now /var and a few useful filesets here with special properties we care about to avoid security issues.

zfs create -o mountpoint=/tank/root/var tank/var
zfs create -o exec=off -o setuid=off tank/var/run

zfs create -o mountpoint=/tank/root/var/tmp -o compression=lzjb  tank/var/tmp
zfs set exec=off tank/var/tmp
zfs set setuid=off  tank/var/tmp
chmod 1777 /tank/root/var/tmp

zfs create -o mountpoint=/tank/root/tmp -o compression=lzjb  tank/tmp
zfs set exec=off tank/tmp
zfs set setuid=off  tank/tmp
chmod 1777 /tank/root/tmp

I would also recommend to put users’ home in a separate fileset for the same reason, if not a fileset per user if you want to limit users area to specific sizes.

zfs create -o mountpoint=/tank/root/home tank/home

Later, you will want to create tank/usr/ports/{distfiles,packages} w/o compression as well. Properties like snapdir can be changed later on so we are not forced to set them right now.

zfs create -o mountpoint=/tank/root/usr/ports -o compression=lzjb -o setuid=off tank/usr/ports
zfs create -o mountpoint=/tank/root/usr/ports/distfiles  -o compression=off -o exec=off -o setuid=off tank/usr/ports/distfiles
zfs create -o mountpoint=/tank/root/usr/ports/packages -o compression=off -o exec=off -o setuid=off tank/usr/ports/packages

One thing you want to know about ZFS is that it uses the Copy-On-Write principle and never overwrite data. Anytime you rewrite a block, a fresh one is written elsewhere and pointers updated (very fast summary, see the ZFS docs for more details). The main result is that when you have a completely filled up fileset, you can not remove files to make space as it would require some free space first. A way to mitigate that is ensuring you do not filled up a fileset and you can reserve some space in the “root” fileset.

zfs set reservation=512m tank

Deduplication

ZFSv28 supports one interesting features among many called deduplication (see http://en.wikipedia.org/wiki/ZFS#Deduplication for more details). It needs to be enabled on all filesets you want to have deduplication on. Beware though that enabling deduplication will make ZFS use much more memory than before.

zfs set dedup=on tank/usr/src
...

Afterwards, when you have put some files in there, you can check the deduplication status with zpool:

scoite# zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
tank   1.78T  1.76G  1.78T     0%  1.01x  ONLINE  -
zboot  1.98G  92.5K  1.98G     0%  1.00x  ONLINE  -

Installing the system

Because of the constraints I talked about earlier, I had to generate my own release (or snapshot to be slightly more precise). As I trimmed it down pretty heavily, it is only 300 MB worth and a simple scp is enough to get it in a temporary filesystem.

I just put the whole tree in /tank/tmp, a temporary fileset created just for this purpose:

scoite# ls -l /tank/tmp/8.2-20110309-SNAP/
total 94
drwxr-xr-x  2 root  wheel   50 Mar  9 16:16 base
drwxr-xr-x  2 root  wheel   13 Mar  9 16:16 catpages
drwxr-xr-x  2 root  wheel    8 Mar  9 16:16 dict
drwxr-xr-x  2 root  wheel    9 Mar  9 16:16 doc
drwxr-xr-x  2 root  wheel    8 Mar  9 16:16 games
drwxr-xr-x  2 root  wheel    9 Mar  9 16:16 info
drwxr-xr-x  2 root  wheel   62 Mar  9 16:18 kernels
drwxr-xr-x  2 root  wheel   20 Mar  9 16:16 lib32
drwxr-xr-x  2 root  wheel   14 Mar  9 16:16 manpages
drwxr-xr-x  2 root  wheel   12 Mar  9 16:16 proflibs
drwxr-xr-x  2 root  wheel  122 Mar  9 16:20 src

Just like in 2, we will extract all distributions manually and fetch everything else from the ‘net.

Extract all distributions

By default, root is using /bin/csh as login shell, you need to type sh now in order to cut&paste the examples below.

cd /tank/tmp/8*
export DESTDIR=/tank/root
for i in base dict doc games info lib32 manpages; do \
   (cd $i && sh ./install.sh) \
done
You are about to extract the base distribution into /tank/root- are you SURE
you want to do this over your installed system (y/n)? y
You are about to extract the doc distribution into /tank/root - are you SURE
you want to do this over your installed system (y/n)? y

cd kernels
sh ./install.sh generic
cd /tank/root/boot
cp -Rp GENERIC/* kernel/

cd /tank/tmp/8*/src
sh ./install.sh all
Extracting sources into /tank/root/usr/src...
  	Extracting source component: base
...
  	Extracting source component: usbin
Done extracting sources.

Install configuration variables at proper places

We need to add variables to several files needed for the boot phase, you can use echo(1) or even vi(1).

In /boot/loader.conf:

zfs_load="YES"
geom_label_load="YES"
geom_mirror_load="YES"
geom_uzip_load="YES"
vm.kmem_size="32G"
vfs.root.mountfrom="zfs:tank/root"

For the first boot in order to verfiy everything is correct, you use the following variable to make the passphrase appear:

kern.geom.eli.visible_passphrase="1"

to be removed in production of course.

Then you can add some tunables for your ZFS installation:

# http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/061388.html
vfs.zfs.txg.timeout="5"

You also need to point geli(4) at the right bits for it:

geli_tank0_keyfile0_load="YES"
geli_tank0_keyfile0_type="gpt/tank0:geli_keyfile0"
geli_tank0_keyfile0_name="/boot/keys/boot.key"
geli_tank1_keyfile0_load="YES"
geli_tank1_keyfile0_type="gpt/tank1:geli_keyfile0"
geli_tank1_keyfile0_name="/boot/keys/boot.key"

Current recommendations for FS tuning includes setting kmem_size at between 1.5x and 2x the available RAM.

Do not forget to add the following (or another value) to /etc/sysctl.conf or you will have issues at boot-time with vnode depletion:

##-- tuning
kern.maxvnodes=260000

Your /etc/rc.conf should have some variables defined to properly boot:

zfs_enable="YES"
sshd_enable="YES"
hostname="hostname.example.com"
ntpd_enable="YES"
ntpd_sync_on_start="YES"
ifconfig_em0="inet a.b.c.d netmask 0xffffff00"
geli_swap_flags="-e aes -l 256 -s 4096 -d"

(do not forget things like defaultrouter and all that).

Exit the chroot area if you ere in one for easy editing of the previous files.

Finishing up

There are several steps to follow before even rebooting the first time (you do remember that every time you reboot, you have to log on the console and enter the encryption passphrase?):

Generate zpool.cache

mkdir /boot/zfs
zpool export tank && zpool import tank
cp /boot/zfs/zpool.cache /tank/root/boot/zfs/

Copy the /boot bits into place for real boot

cp -r /root/keys /tank/root/boot/

cd /tank/root
mv boot oldboot
mkdir /zboot/boot
cp -Rp oldboot/* /zboot/boot/

Configuring the encrypted swap in `/tank/root/etc/fstab`

scoite# cat /tank/root/etc/fstab
/dev/mirror/swap.eli	none swap sw 0 0

Another issue to look for is that by default, you won’t be able to have kernel crash dumps on a gmirror device (see gmirror(8) for the details and solution). We need to use two special scripts used in the boot process to work around that limitation (as we do not want to always use the prefer setting for mirrored swap):

echo 'gmirror configure -b prefer swap'>>/tank/root/etc/rc.early
echo 'gmirror configure -b round-robin swap'>>/tank/root/etc/rc.local

Fixing mount points

zfs umount -a
zfs set mountpoint=legacy tank
zfs set mountpoint=/tmp tank/tmp
zfs set mountpoint=/var tank/var
zfs set mountpoint=/var/tmp tank/var/tmp
zfs set mountpoint=/usr tank/usr
zfs set mountpoint=/usr/obj tank/usr/obj

…and for all other filesets you added back above without forgetting to set the bootfs property on the right fileset:

zpool set bootfs=tank/root tank

Feedback

Please send any comment, addition, correction to my FreeBSD mail or my personal mail. Thanks to all who have done it already.

History

1.0 Creation
1.1 Fixes by P. Guyot
1.2 Migration to 9.x, mention 4KB address alignment issues

Credits

Thanks to these people for input and corrections.

Stéphane “KingBug” Clodic
Paul Guyot paul@semiocast.com