Installation of a FreeBSD 8.2 system with root-on-ZFS, mfs driver and ZFSv28
This article explains how to install a FreeBSD 8.2 system on a remote server known as a dedibox1. It is closely related to my previous HOWTO2 but with several changes due to the remoteness and the level of security we need.
IMPORTANT NOTE
Please note that this article is being deprecated in favour of the 9.1 ZFS/GELI one. Every server I have is being moved or has moved over to 9.1. 9.1 is easier to deal with, no patching is required.
Table of content
- Installation of a FreeBSD 8.2 system with root-on-ZFS, mfs driver and ZFSv28
- IMPORTANT NOTE
- Table of content
- Prerequisites
- Installed system
- Hardware
- Notes on ZFS, disks and all that
- Constraints
- Generation of mfsbsd image
- Creation of the customized distribution
- Actual FreeBSD installation
- Finishing up
- Feedback
- History
- Credits
Prerequisites
Like many dedicated servers in an hosting datacenter, access to a console (such as iLO or iDRAC) is mandatory to be able to manipulate BIOS settings, access to some kind of rescue mode where you can upload an ISO image and boot from it. The example used here will be the Dedibox rescue system (such as described in 5 and used in 7)
You must have an MFSBSD3 generated image or the ability to generate one (which means an entire /usr/src
tree). See the above mfsbsd URL for details.
Please note that the whole process right now is not the faint of heart because not only you need to generate a custom mfsbsd image but you need also to generate your own release. The way one manages a dedicated server hosted at Dedibox (and possibly other hosting companies) features a rescue mode that we will be using to boot up our mfsbsd image and work up the installation that way.
Installed system
- ZFS-only system with Root-on-ZFS
- two disks are in the machine, running with ZFS/mirror
- the main zpool is encrypted with
geli(8)
so that if a disk needs to be replaced, data on it will be secure.4 - ZPool v28 along with ZFS v5 to get deduplication and performance fixes.
- two ZFS pools are defined, one encrypted to contain the minimal booting system and the real one that is mounted over as /.
Hardware
The hardware of choice is the Dedibox PRO R210 system (See 1 for reference), a rather powerful system with the following characteristics:
- L3426 Nehalem quad-core CPU running at 1.86 GHz
- 16 GB of RAM
- 2 disks of 2 TB each (Hitachi HUA72202 apparently)
the mps
driver that we will be using for the H200 HBA controller does not support the RAID1 option installed by default by the Online people. The first step is to break that RAID1 setup and configure the BIOS in passthrough mode to get the drives “back” as separate devices (da0
and da1
).
Do not be confused by the “SCSI” naming of the drives. The H200 controller, made by LSI and also called the LSI SAS 2009, is a SAS controller but SATA2 drives are compatible and will appear as SAS drives (hence the da
name).
LSI has released a new driver (using the existing mfi
driver) that does support the RAID option and has been merged into 10-CURRENT but we are quite happy not to use it anyway, relying on ZFS for that.
Notes on ZFS, disks and all that
Please go read this article to find useful information on ZFS, disks and how to use it. It is not specific to FreeBSD ZFS but will apply as well for most things.
Constraints
The fact that we want to use encryption to protect our data is a major constraint on the on-disk architecture and how we lay down/use partitions. /boot/loader
has the ability to ask for encryption keys at boot-time but that also means that it must be unencrypted…
Main choice to be made is whether we use a plain UFS booting partition then mount everything from the ZFS pool or we use ZFS pools for both partitions.
We choose to have two separate ZFS pools to be able to use the other ZFS features such as snapshots, mirroring and al.
Generation of mfsbsd image
Buildworld & buildkernel
make buildworld
make buildkernel KERNCONF=GENERIC
Please add any option you want like -j8
and define options as you see fit like NO_CLEAN=yes
Configuration
You need to modify and customize the various configuration files located in conf/
:
authorized_keys
interfaces.conf
loader.conf
rc.conf
resolv.conf
The main things to change are IP addresses of the server, DNS servers, defaultroute and all that. Most of these informations are on the page describing your server in the console.
loader.conf
is important because that where you specify the unables for the kernel. As we are installing a ZFS system, you will need at least to specify vm.kmem_size
. I’d advise you to use 2xRAM here.
There also on file we need to modify: Makefile
. By default, only some kernel modules are loaded in, a few more are put as loadable modules and there are some missing (in the context of the current article): geom_eli
and its dependencies (crypto
, zlib
). ZFS and opensolaris
are included though.
mfsbsd creation
make BASE=/data/work/freebsd/9 SRC_DIR=/data/work/freebsd/9 CUSTOM=1
Copy the newly created mfsboot.img
somewhere you can reach over the Internet. Any FTP or HTTP server will do.
Installation of the mfsbsd image
Put the dedibox server in “rescue mode”. Typically it will be running some form of Linux system like Ubuntu. Now get the image and install it:
wget -O - <url>/mfsimage.img | dd of=/dev/sda bs=1048576
Booting off the mfsbsd image
Reboot your dedibox normally by exiting the rescue mode, it should now boot off the mfsbsd image. A few minutes later you should be able to access the system through ssh
.
Creation of the customized distribution
In parallel to what we are going to do on the target machine, you can generate your custom distribution (incl. your modified source and kernel configuration) by visiting the release
directory and running the following command after creating a big enough space somewhere to hold the result:
mkdir /data/work/release
make release EXTSRCDIR=/data/work/freebsd/8v28-20110301 EXTPORTSDIR=/usr/ports \
CHROOTDIR=/data/work/release NODOC=yes NOPORTSATALL=yes NOPORTS=yes
Now, you can find the result in /data/work/release
under the snapshot’s name:
ls /data/work/release/R/cdrom
bootonly/ disc1/ disc2/ dvd1/ livefs/
and in the one we are most interested in (dvd1
):
ls /data/work/release/R/cdrom/dvd1
.cshrc etc/ sbin/
.profile lib/ stand@
8.2-20110309-SNAP/ libexec/ sys@
COPYRIGHT media/ tmp/
bin/ mnt/ usr/
boot/ proc/ var/
rescue/ dev/ root/
cdrom.inf
The main distribution is located even further below in the 8.2-*
snapshot:
base/ doc/ kernels/ proflibs/
catpages/ games/ lib32/ src/
dict/ info/ manpages/
That will need to be copied over to your shiny new server somewhere.
Actual FreeBSD installation
We will more or less follow the instructions in 4. Things that we will change are not essential but reflect our special requirements.
We will later be using the “dvd1” ISO image to install the system. For now, the mfsbsd has a subset of the installation disk with enough commands to get you going.
Partitioning the drives
As we will be using the two disks in mirror mode, we will be replicating the commands we do on da0
on da1
. That way, if either disk is broken at some point, the system will be able to find all the information it needs to boot.
Later on, when finished with the partitioning and encryption phases, we will be transferring the dvd1
image in memory and mount it as a directory with the mdconfig
command. In the meantime, let’s begin.
Before we install our own GPT partition table, we must wipe out the previous partition table installed by the Dedibox installation system.
dd if=/dev/zero of=/dev/da0 bs=512 count=10
dd if=/dev/zero of=/dev/da1 bs=512 count=10
then install our own:
gpart create -s gpt da0
gpart create -s gpt da1
scoite# gpart show
=> 34 3907029101 da0 GPT (1.8T)
34 3907029101 - free - (1.8T)
=> 34 3907029101 da1 GPT (1.8T)
34 3907029101 - free - (1.8T)
Now create the boot, 1st freebsd-zfs, swap then the 2nd freebsd-zfs partition. The first ZFS partition is not big because we need only what is necessary for booting. As we will be also encrypting the swap (it makes no sense encryting the data and not the swap as well) and also mirror it for safety reasons, swap will be twice the RAM (32 GB in our case).
An issue you want to look for is that now that we have really big hard drives (2 and 3 TB now and soon more), these have been getting 4 KB sectors and run really slowly in 512-bytes sectors so you want your partitions aligned on 4 KB boundaries so, from now on, I will adding “-a 4k” to the gpart(8)
command lines:
gpart add -s 64K -a 4k -t freebsd-boot da0
gpart add -s 2G -a 4k -t freebsd-zfs -l boot0 da0
gpart add -s 32G -a 4k -t freebsd-swap -l swap0 da0
gpart add -a 4k -t freebsd-zfs -l tank0 da0
It is probably only needed on the first call as the rest will be aligned due to their sizes.
We mirror that configuration on da1:
gpart add -s 64K -a 4k -t freebsd-boot da1
gpart add -s 2G -a 4k -t freebsd-zfs -l boot1 da1
gpart add -s 32G -a 4k -t freebsd-swap -l swap1 da1
gpart add -a 4k -t freebsd-zfs -l tank1 da1
You can not use something like gpart backup da0 | gpart restore -F da1
to copy the entire partition table in one go because the name of the labels would be the same.
You should end up with something like this:
=> 34 3907029101 da0 GPT (1.8T)
34 128 1 freebsd-boot (64K)
162 4194304 2 freebsd-zfs (2.0G)
4194466 67108864 3 freebsd-swap (32G)
71303330 3835725805 4 freebsd-zfs (1.8T)
=> 34 3907029101 da1 GPT (1.8T)
34 128 1 freebsd-boot (64K)
162 4194304 2 freebsd-zfs (2.0G)
4194466 67108864 3 freebsd-swap (32G)
71303330 3835725805 4 freebsd-zfs (1.8T)
You can check that the different partitions and labels do now exist in /dev/gpt
:
scoite# ls /dev/gpt
boot0 boot1 swap0 swap1 tank0 tank1
As we want to be able to boot from either disk, we mark the 2nd partition as a boot candidate:
gpart set -a bootme -i 2 da0
gpart set -a bootme -i 2 da1
We will load the bootcode in place on both disks
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da1
Encrypting the disks
Creating the keyfile for the partitions, using the same passphrase for convenience.
mkdir /root/keys
dd if=/dev/random of=/root/keys/boot.key bs=128k count=1
Now, you have to choose a passphrase (as usual, not too short, not guessable, remember you need it only at bootime).
geli init -b -K /root/keys/boot.key -s 4096 -l 256 /dev/gpt/tank0
geli init -b -K /root/keys/boot.key -s 4096 -l 256 /dev/gpt/tank1
Attach both drives:
geli attach -k /root/keys/boot.key /dev/gpt/tank0
geli attach -k /root/keys/boot.key /dev/gpt/tank1
Now, create the 1st ZFS partition in mirror mode for the unencrypted part:
zpool create zboot mirror gpt/boot0 gpt/boot1
and the encrypted also mirrored 2nd ZFS partition:
zpool create tank mirror gpt/tank0.eli gpt/tank1.eli
The two pools should be appearing like this:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tank 1.78T 1.76G 1.78T 0% 1.01x ONLINE -
zboot 1.98G 92.5K 1.98G 0% 1.00x ONLINE -
When we will have created some filesystems on the disks, we will set the bootfs
property on both pools. I use a separate root
filesystem on my pools, it makes changing the /
fs much more simple and allow to have different ones.
Now that the pools have been created, we switch the algorithm used to checksum disk blocks. “fletcher4” is only slightly slower but better (just like CRC16 vs CRC32).
zfs set checksum=fletcher4 zboot
zfs set checksum=fletcher4 tank
Encrypted swap
Swap is slightly different, we will use the “onetime” command to geli(8)
through the way we declare swap in /etc/fstab
, that way we do not need to enter any passphrase because it is not needed to know it after attaching the partitions (see geli(8)
for details).
As I said earlier, we will use encrypted swap and geli
has automatic setup for that by adding the .eli
suffix in /etc/fstab
. So let’s create the gmirror
configuration for swap.
gmirror label swap gpt/swap0 gpt/swap1
The /etc/fstab
entry will look like the following:
/dev/mirror/swap.eli none swap sw 0 0
In this schema, we do not encrypt each swap partition but the mirror itself (swap over geli over gmirror) to avoid doing things twice.
Filesystems
We will globally follow the filesystem layout we used in 2 with more or less the same options.
Compression seems to create issues for kernel loading so we will avoid that on tank/root
. All other FS will inherit the compression
property though.
As for compression scheme, in zpool v13 and later, we can select different algorithms for compression. lzjb
is the fastest available but gzip
is better compression-wise so pjd recommends using lzjb
on things like /usr/obj
where speed is more interesting than actual compression ratio.
I have not done any benchmarking yet on this pool-wide compression. It may be too slow to use the default gzip compression (-6), please feel free to experiment there. You may wish to only enable compression on selected filesets.
zfs create -o compression=off tank/root
zfs create -o mountpoint=/tank/root/usr tank/usr
zfs create -o mountpoint=/tank/root/usr/obj -o compression=lzjb tank/usr/obj
zfs create -o mountpoint=/tank/root/usr/local tank/usr/local
The reason why I create a separate /usr
fileset is that I want to have different
policies for compression, atime
property, snapshots and all that. You can also create another fileset for /usr/local
, once again to be able to snapshot separately the base system and the ports you will be using.
To complete what we want under /usr
, we will create /usr/src
to hold the system’s sources. We will need these to recompile a new trimmed-down kernel.
zfs create -o mountpoint=/tank/root/usr/src tank/usr/src
Now /var
and a few useful filesets here with special properties we care about to avoid security issues.
zfs create -o mountpoint=/tank/root/var tank/var
zfs create -o exec=off -o setuid=off tank/var/run
zfs create -o mountpoint=/tank/root/var/tmp -o compression=lzjb tank/var/tmp
zfs set exec=off tank/var/tmp
zfs set setuid=off tank/var/tmp
chmod 1777 /tank/root/var/tmp
zfs create -o mountpoint=/tank/root/tmp -o compression=lzjb tank/tmp
zfs set exec=off tank/tmp
zfs set setuid=off tank/tmp
chmod 1777 /tank/root/tmp
I would also recommend to put users’ home in a separate fileset for the same reason, if not a fileset per user if you want to limit users area to specific sizes.
zfs create -o mountpoint=/tank/root/home tank/home
Later, you will want to create tank/usr/ports/{distfiles,packages}
w/o compression as well. Properties like snapdir
can be changed later on so we are not forced to set them right now.
zfs create -o mountpoint=/tank/root/usr/ports -o compression=lzjb -o setuid=off tank/usr/ports
zfs create -o mountpoint=/tank/root/usr/ports/distfiles -o compression=off -o exec=off -o setuid=off tank/usr/ports/distfiles
zfs create -o mountpoint=/tank/root/usr/ports/packages -o compression=off -o exec=off -o setuid=off tank/usr/ports/packages
One thing you want to know about ZFS is that it uses the Copy-On-Write principle and never overwrite data. Anytime you rewrite a block, a fresh one is written elsewhere and pointers updated (very fast summary, see the ZFS docs for more details). The main result is that when you have a completely filled up fileset, you can not remove files to make space as it would require some free space first. A way to mitigate that is ensuring you do not filled up a fileset and you can reserve some space in the “root” fileset.
zfs set reservation=512m tank
Deduplication
ZFSv28 supports one interesting features among many called deduplication (see http://en.wikipedia.org/wiki/ZFS#Deduplication for more details). It needs to be enabled on all filesets you want to have deduplication on. Beware though that enabling deduplication will make ZFS use much more memory than before.
zfs set dedup=on tank/usr/src
...
Afterwards, when you have put some files in there, you can check the deduplication status with zpool:
scoite# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tank 1.78T 1.76G 1.78T 0% 1.01x ONLINE -
zboot 1.98G 92.5K 1.98G 0% 1.00x ONLINE -
Installing the system
Because of the constraints I talked about earlier, I had to generate my own release (or snapshot to be slightly more precise). As I trimmed it down pretty heavily, it is only 300 MB worth and a simple scp
is enough to get it in a temporary filesystem.
I just put the whole tree in /tank/tmp
, a temporary fileset created just for this purpose:
scoite# ls -l /tank/tmp/8.2-20110309-SNAP/
total 94
drwxr-xr-x 2 root wheel 50 Mar 9 16:16 base
drwxr-xr-x 2 root wheel 13 Mar 9 16:16 catpages
drwxr-xr-x 2 root wheel 8 Mar 9 16:16 dict
drwxr-xr-x 2 root wheel 9 Mar 9 16:16 doc
drwxr-xr-x 2 root wheel 8 Mar 9 16:16 games
drwxr-xr-x 2 root wheel 9 Mar 9 16:16 info
drwxr-xr-x 2 root wheel 62 Mar 9 16:18 kernels
drwxr-xr-x 2 root wheel 20 Mar 9 16:16 lib32
drwxr-xr-x 2 root wheel 14 Mar 9 16:16 manpages
drwxr-xr-x 2 root wheel 12 Mar 9 16:16 proflibs
drwxr-xr-x 2 root wheel 122 Mar 9 16:20 src
Just like in 2, we will extract all distributions manually and fetch everything else from the ‘net.
Extract all distributions
By default, root
is using /bin/csh
as login shell, you need to type sh
now in order to cut&paste the examples below.
cd /tank/tmp/8*
export DESTDIR=/tank/root
for i in base dict doc games info lib32 manpages; do \
(cd $i && sh ./install.sh) \
done
You are about to extract the base distribution into /tank/root- are you SURE
you want to do this over your installed system (y/n)? y
You are about to extract the doc distribution into /tank/root - are you SURE
you want to do this over your installed system (y/n)? y
cd kernels
sh ./install.sh generic
cd /tank/root/boot
cp -Rp GENERIC/* kernel/
cd /tank/tmp/8*/src
sh ./install.sh all
Extracting sources into /tank/root/usr/src...
Extracting source component: base
...
Extracting source component: usbin
Done extracting sources.
Install configuration variables at proper places
We need to add variables to several files needed for the boot phase, you can use echo(1)
or even vi(1)
.
In /boot/loader.conf
:
zfs_load="YES"
geom_label_load="YES"
geom_mirror_load="YES"
geom_uzip_load="YES"
vm.kmem_size="32G"
vfs.root.mountfrom="zfs:tank/root"
For the first boot in order to verfiy everything is correct, you use the following variable to make the passphrase appear:
kern.geom.eli.visible_passphrase="1"
to be removed in production of course.
Then you can add some tunables for your ZFS installation:
# http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/061388.html
vfs.zfs.txg.timeout="5"
You also need to point geli(4)
at the right bits for it:
geli_tank0_keyfile0_load="YES"
geli_tank0_keyfile0_type="gpt/tank0:geli_keyfile0"
geli_tank0_keyfile0_name="/boot/keys/boot.key"
geli_tank1_keyfile0_load="YES"
geli_tank1_keyfile0_type="gpt/tank1:geli_keyfile0"
geli_tank1_keyfile0_name="/boot/keys/boot.key"
Current recommendations for FS tuning includes setting kmem_size
at between 1.5x and 2x the available RAM.
Do not forget to add the following (or another value) to /etc/sysctl.conf
or you will have issues at boot-time with vnode depletion:
##-- tuning
kern.maxvnodes=260000
Your /etc/rc.conf
should have some variables defined to properly boot:
zfs_enable="YES"
sshd_enable="YES"
hostname="hostname.example.com"
ntpd_enable="YES"
ntpd_sync_on_start="YES"
ifconfig_em0="inet a.b.c.d netmask 0xffffff00"
geli_swap_flags="-e aes -l 256 -s 4096 -d"
(do not forget things like defaultrouter
and all that).
Exit the chroot area if you ere in one for easy editing of the previous files.
Finishing up
There are several steps to follow before even rebooting the first time (you do remember that every time you reboot, you have to log on the console and enter the encryption passphrase?):
Generate zpool.cache
mkdir /boot/zfs
zpool export tank && zpool import tank
cp /boot/zfs/zpool.cache /tank/root/boot/zfs/
Copy the /boot bits into place for real boot
cp -r /root/keys /tank/root/boot/
cd /tank/root
mv boot oldboot
mkdir /zboot/boot
cp -Rp oldboot/* /zboot/boot/
Configuring the encrypted swap in /tank/root/etc/fstab
scoite# cat /tank/root/etc/fstab
/dev/mirror/swap.eli none swap sw 0 0
Another issue to look for is that by default, you won’t be able to have kernel crash dumps on a gmirror device (see gmirror(8)
for the details and solution). We need to use two special scripts used in the boot process to work around that limitation (as we do not want to always use the prefer
setting for mirrored swap):
echo 'gmirror configure -b prefer swap'>>/tank/root/etc/rc.early
echo 'gmirror configure -b round-robin swap'>>/tank/root/etc/rc.local
Fixing mount points
zfs umount -a
zfs set mountpoint=legacy tank
zfs set mountpoint=/tmp tank/tmp
zfs set mountpoint=/var tank/var
zfs set mountpoint=/var/tmp tank/var/tmp
zfs set mountpoint=/usr tank/usr
zfs set mountpoint=/usr/obj tank/usr/obj
…and for all other filesets you added back above without forgetting to set the bootfs
property on the right fileset:
zpool set bootfs=tank/root tank
Feedback
Please send any comment, addition, correction to my FreeBSD mail or my personal mail. Thanks to all who have done it already.
History
1.0 Creation
1.1 Fixes by P. Guyot
1.2 Migration to 9.x, mention 4KB address alignment issues
Credits
Thanks to these people for input and corrections.
Stéphane “KingBug” Clodic
Paul Guyot paul@semiocast.com