A couple weeks ago I was coming into work and wanted to quickly finish up some
experimentation I was doing with regards to setting up automatic disk decryption
at boot using TPM 2.0 and LUKS. I have been using full disk encryption (FDE) for
a while now but I wanted to try out the different ways that systemd-cryptsetup
supports unlocking the disk. Of the various methods, it seems that using a TPM
2.0 device to unlock the disk, given the validity of the system firmware and
secure boot state, was most interesting. The most desireable approach is
actually to use TPM 2.0 along with a passphrase or PIN, but this
functionality is not yet
available as it will be released in systemd v251.
Anyways, I had previously added a slot to the TPM consisting of PCRs 0,2,4,7 and found that the disk was not automatically unlocking. I could verify that I had the slot added with:
# systemd-cryptenroll /dev/nvme0n1p2
I wanted to remove the old slot and try again, so I issued this command:
# systemd-cryptenroll /dev/nvme0n1p2 --wipe-slot=0,2,4,7
Then, I went to add a new slot but with different PCRs, so I did:
# systemd-cryptenroll /dev/nvme0n1p2 --tpm2-device=auto --tpm2-pcrs=0,7
This command prompted me for a passphrase like usual, but for some reason it wouldn’t accept my passphrase. When I reissued a command to view the slots, this is what I saw:
# systemd-cryptenroll /dev/nvme0n1p2
SLOT TYPE
1 tpm2
My passphrase was gone! 😱
At this point I was in a bit of a panic. I tried a few more times out of insanity to try and add in the passphrase again so that I could work on my encrypted disk, but to no avail. I started to collect myself and had a couple realizations. First, I couldn’t think of any way to restore the passphrase into the LUKS header of the disk without just reformatting it, and second, my disk was still decrypted at this point in time.
So, while continuously being paranoid of a kernel panic or largely improbable cosmic event knocking out my laptop and costing all my data, I continued to work until lunch came around, at which point I went back to the house and grabbed my SATA to USB hard drive adapter and a spare 8TB hard disk drive I had laying around. (Okay, the drive was meant to go into my desktop as a replacement but I have been too lazy to do it as of late.)
I brought the two devices back to the office and continued to work until the
closing of the day. I immediately hooked up the drive to my laptop and began
copying over all my data with an rsync
command. I left my laptop at the office
that day (locked to the desk, mind you) while the backup operation ran
overnight.
The next morning I came in to see that it had thankfully finished. With my
fingers crossed, I rebooted the laptop back into a Gentoo LiveDVD environment
and reformatted the LUKS partition with my old passphrase. Then, I began
recreating my LVM volumes on top of it and adjusting my /etc/fstab
accordingly. After the volumes were all mounted, I ran another rsync
to begin
the restoration process and found something else to do at work while this was
running. Needless to say this also took basically the entire day, and I left my
laptop at the office for a second night while this completed.
At the dawn of the final day, I returned to my laptop to see it had finished.
After verifying that my /etc/fstab
was correct and making the same adjustments
to the kernel commandline, I rebooted the machine and was greeted with a
pleasant TTY asking for my credentials. 👏
The last thing I needed to do after rebooting was fix the SELinux contexts on
the machine with a simple rlpkg -ar
and manually fix the contexts on the
various mountpoints for the root inode of the volumes (sit tight for a minor
writeup on this as well).
After all was done it was like I had never screwed up in the first place. What I
didn’t mention was that I was also researching on the side what exactly I did
wrong while the recovery was taking place. You may have already guessed it, but
my error was in thinking that the --wipe-slot
argument took PCRs to wipe and
not the slot itself. Normally when you encrypt a disk with LUKS and a
passphrase, the passphrase goes into slot 0. When issuing the above wipe
command, I ended up removing the passphrase slot entirely. If the TPM unlocking
had worked, I could in theory have not worried so much about it, however I would
not be able to update the kernel or initramfs without breaking the disk
decryption process. Additionally, I never got it to work in the first place with
those PCRs.
I am writing this from the same laptop a couple weeks since this happened with a fully working disk decryption with TPM 2.0 setup. It turns out that in addition to the errors I made above, dracut currently has a bug that omits some of the required libraries for systemd-cryptsetup to do its job. I had to manually include these in my dracut config for the time being until a new version with the fix is released. Lastly, I opted to stick with PCRs 0,7 for the time being as this allows me to update my kernel and initramfs together without needing to add a new slot for them every time.
Going forward, I intend to add a recovery key to the disk in the event that something like this happens again. That way, I should still be able to manipulate the LUKS header in case I do something similarly silly in my experimentation.
Hopefully this was an interesting read and a lesson to carefully read the corresponding documentation when working on particularly sensitive projects.
’til next time, see ya! 👋