Multi-Kernel Drifting
I was setting up some automation to build Windows images pre-loaded with some drivers and software (a story for another day). I had already gotten it working with QEMU under KVM on Linux but wanted to port it to propolis on our illumos distro, Helios. I figured it should be mostly straightforward; maybe a couple different flags or utilities to futz around with the disk images and mount them. Which was the case. Mostly. That is except for the one minor detail of not being able to mount an NTFS image.
$ pfexec mount -F ntfs-3g $LOOPBACK_DEV /mnt/test
fuse: mount failed: Not a directory
the setup
Ok, let's step back a second. To give some context, I was trying to create a raw image that contained an NTFS partition. Maybe it didn't like the way I created the GPT? Ok, let's try something simpler and forget partitions for a moment and just try solely creating an NTFS file system:
- Create an empty disk image:
$ qemu-img create -f raw test.img 8G
Formatting 'test.img', fmt=raw size=8589934592
- Create loopback device:
$ pfexec lofiadm -l -a test.img
/dev/dsk/c2t1d0p0
- Create an NTFS file system:
$ mkntfs -Q /dev/dsk/c2t1d0p0
The sector size was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 512 bytes.
The partition start sector was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 0.
The number of sectors per track was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 0.
The number of heads was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically. It has been set to 0.
Cluster size has been automatically set to 4096 bytes.
To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
Windows will not be able to boot from this device.
Creating NTFS volume structures.
mkntfs completed successfully. Have a nice day.
Ok, that's a lot of warnings (that I did handle properly in the real scenario!) but shouldn't be relevant right now. We don't care if Windows can't boot off of this image.
- Mount the file system:
$ pfexec mount -F ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test
fuse: mount failed: Not a directory
Something's clearly wrong.
linux
Mind you this same scenario using the fuse-based ntfs-3g driver works in linux:
➜ ~ qemu-img create -f raw test.img 8G
Formatting 'test.img', fmt=raw size=8589934592
➜ ~ sudo losetup -f --show test.img
/dev/loop1
➜ ~ sudo mkntfs -Q /dev/loop1
The partition start sector was not specified for /dev/loop1 and it could not be obtained automatically. It has been set to 0.
The number of sectors per track was not specified for /dev/loop1 and it could not be obtained automatically. It has been set to 0.
The number of heads was not specified for /dev/loop1 and it could not be obtained automatically. It has been set to 0.
Cluster size has been automatically set to 4096 bytes.
To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
Windows will not be able to boot from this device.
Creating NTFS volume structures.
mkntfs completed successfully. Have a nice day.
➜ ~ mkdir test
➜ ~ sudo mount -t ntfs-3g /dev/loop1 test
➜ ~ echo hello > test/world
➜ ~ cat test/world
hello
ntfs-3g
If you're using NTFS on non-Windows chances are you're using some ntfs-3g
based driver. It is used along with Filesystem in USErspace (FUSE) to provide access to NTFS volumes.
This arrangement consists of two parts: the FUSE kernel driver and the userspace application that
links against libfuse
. In this case, that is ntfs-3g
.
Let's skip the mount
wrapper and just ask ntfs-3g
directly to mount our image:
$ pfexec ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test
fuse: mount failed: Not a directory
Alas, still no good. I guess we gotta dig.
truss
On Linux you have strace
to trace system calls. On Illumos there's truss
:
$ pfexec truss ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test
execve("/opt/ooce/ntfs-3g/bin/ntfs-3g", 0xFFFFFC7FFFDFDE48, 0xFFFFFC7FFFDFDE68) argc = 3
sysinfo(SI_MACHINE, "i86pc", 257) = 6
[...snip...]
mount("/devices/pseudo/lofi@1:q", "/mnt/test", MS_NOSUID|MS_OPTIONSTR, "fuse", 0x00000000, 0, 0x00E63E60, 1024) Err#20 ENOTDIR
open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSLIB.mo", O_RDONLY) Err#2 ENOENT
fstat(2, 0xFFFFFC7FFFDFC720) = 0
fuse: mount failed: write(2, " f u s e : m o u n t ".., 20) = 20
Not a directorywrite(2, " N o t a d i r e c t".., 15) = 15
write(2, "\n", 1) = 1
close(5) = 0
fdsync(4, FSYNC) = 0
fcntl(4, F_SETLK, 0xFFFFFC7FFFDFDBD0) = 0
close(4) = 0
_exit(21)
Hmmm, mount("/devices/pseudo/lofi@1:q", "/mnt/test", MS_NOSUID|MS_OPTIONSTR, "fuse", 0x00000000, 0, 0x00E63E60, 1024) Err#20 ENOTDIR
.
That ENOTDIR
error is not from ntfs-3g
, in fact we see it returns an exit code of 21
and the
manual page tells us that's an "Unclassified FUSE error".
The mount
syscall here is what returned that ENOTDIR
and its manual page says:
ENOTDIR
The dir argument is not a directory, or a component of
a path prefix is not a directory.
Not a directory you say?
$ file /mnt/test
/mnt/test: directory
Presumably it is the fuse kernel driver which is handling the mount
syscall in this case. One
quick way to check: DTrace.
dtrace
DTrace on illumos offers a wealth of information on a live system. With a lot of introspection capabilities, it makes for a great debugging tool. I'm still learning to reach for it, but it works perfectly here:
$ pfexec dtrace -n 'fuse::return /arg1 == ENOTDIR && pid == $target/ { stack(); }' -c "ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test"
dtrace: description 'fuse::return ' matched 98 probes
fuse: mount failed: Not a directory
dtrace: pid 11852 has exited
CPU ID FUNCTION:NAME
1 17911 fuse_mount:return
genunix`fsop_mount+0x14
genunix`domount+0x948
genunix`mount+0xfe
genunix`syscall_ap+0x98
unix`sys_syscall+0x17d
So what did we do there? We ran dtrace (pfexec dtrace
) and:
-
told it what probes to match (
fuse::return
)Probes are specified as
[[[provider:] module:] function:] name
, where any unspecified field acts as a wildcard.We want to match the exit (
return
) probe of any function in thefuse
kernel module. -
for any such probes, a predicate to further filter them (
/arg1 == ENOTDIR && pid == $target/
)arg1
for areturn
probe corresponds to its return value. Here we compare to the error we're looking for:ENOTDIR
.pid == $target
is to further constrain it by using the provided$target
macro which refers to: -
the command we want to trace (
-c "ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test"
) -
and what actions to take for any matches (
{ stack(); }
)DTrace has a number of actions to inspect the system, here we use the
stack()
to record and print out the kernel stack trace for our matched probes.
FUSE
FUSE seems to be the one stumbling over our supposedly "not a directory" directory. Now that
dtrace
was helpful enough to point out where the error comes from, let's take a look at the
code.
It certainly doesn't take long to find the spot:
static int
fuse_mount(struct vfs *vfsp, struct vnode *mvp, struct mounta *uap,
struct cred *cr)
{
fuse_vfs_data_t *vfsdata;
fuse_session_t *se;
dev_t dev;
char *fdstr;
int err;
if (secpolicy_fs_mount(cr, mvp, vfsp) != 0)
return (EPERM);
if (mvp->v_type != VDIR)
return (ENOTDIR);
Every file is allocated a vnode
and mvp
here should represent the one for our mountpoint
(/mnt/test
). mount
understandably requires you only mount things at a directory and so every
file system driver should verify that is the case, just as FUSE does here. But if /mnt/test
isn't a directory (VDIR
), what is it?
Back to dtrace!
$ pfexec dtrace -n 'fuse_mount:entry /pid == $target/ { printf("v_type = %d", args[1]->v_type); }' -c "ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test"
dtrace: description 'fuse_mount:entry ' matched 1 probe
fuse: mount failed: Not a directory
dtrace: pid 11921 has exited
CPU ID FUNCTION:NAME
6 17910 fuse_mount:entry v_type = 2
This time we match just on entry to fuse_mount
and for an entry probe we have access to args
,
which allows typed access to the function arguments. In this case we print out the v_type
field
of the second arg (mvp = args[1]
). Let's take a look at the enum definition:
typedef enum vtype {
VNON = 0,
VREG = 1,
VDIR = 2,
VBLK = 3,
VCHR = 4,
VLNK = 5,
VFIFO = 6,
VDOOR = 7,
VPROC = 8,
VSOCK = 9,
VPORT = 10,
VBAD = 11
} vtype_t;
...it's VDIR
?
This is when I started questioning my sanity a little. Theories of weird corruption happening
between function entry and the condition check. Was secpolicy_fs_mount
secretly modifying it?
(No.)
Eventually I decide to look at the actual code running on my machine and use the kernel debugger
to disassemble the fuse
module in-memory:
$ pfexec mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata ip hook neti sockfs arp usba xhci mm smbios stmf stmf_sbd lofs crypto random cpc ufs logindmux nsmb ptm smbsrv nf
s ]
> fuse`fuse_mount::dis
[...snip...]
fuse_mount+0x28: call +0x3a21ec3 <secpolicy_fs_mount>
fuse_mount+0x2d: testl %eax,%eax
fuse_mount+0x2f: jne +0x110 <fuse_mount+0x145>
fuse_mount+0x35: cmpl $0x2,0x30(%r13)
fuse_mount+0x3a: movl $0x14,%ebx
fuse_mount+0x3f: jne +0x100 <fuse_mount+0x145>
0x14
is ENOTDIR
and the cmpl $0x2,0x30(%r13)
would line up with the v_type != VDIR
check.
But a quick hacked-up validation of that offset does not square:
v_type is at: 0x28
How about comparing other file system drivers since they all have the same check:
NFS?
nfs_mount+0x66: call +0x3678985 <secpolicy_fs_mount>
nfs_mount+0x6b: testl %eax,%eax
nfs_mount+0x6d: jne +0x2fd <nfs_mount+0x370>
nfs_mount+0x73: cmpl $0x2,0x28(%rbx)
nfs_mount+0x77: jne +0x31b <nfs_mount+0x398>
[...snip...]
nfs_mount+0x398: movl $0x14,%eax # ENOTDIR
nfs_mount+0x39d: jmp -0x2f <nfs_mount+0x370>
It uses an offset of 0x28
. We need another data point, tmpfs?
tmp_mount+0x58: call +0x3f1a993 <secpolicy_fs_mount>
tmp_mount+0x5d: testl %eax,%eax
tmp_mount+0x5f: movl %eax,%r15d
tmp_mount+0x62: jne +0xc <tmp_mount+0x70>
tmp_mount+0x64: cmpl $0x2,0x28(%rbx)
tmp_mount+0x68: movl $0x14,%r15d # ENOTDIR
tmp_mount+0x6e: je +0x30 <tmp_mount+0xa0>
It also uses an offset of 0x28
.
local build
Something's definitely going on here. At this point it's looking like the fuse driver has a
different idea of what the vnode
struct looks like. If for some reason it was compiled without
_LP64
defined then an offset of 0x30
could make sense but would certainly lead to other issues.
And this is definitely a 64-bit module:
$ file /usr/kernel/drv/amd64/fuse
/usr/kernel/drv/amd64/fuse: ELF 64-bit LSB relocatable AMD64 Version 1
This is the point I took a detour into building the module locally. After some time trawling through build scripts I got it built. TL;DR:
wget https://mirrors.omnios.org/fuse/Version-1.4.tar.gz -O illumos-fusefs-Version-1.4.tar.gz
gtar xf illumos-fusefs-Version-1.4.tar.gz
cd illumos-fusefs-Version-1.4/kernel/amd64
PATH=$PATH:/opt/onbld/bin/i386 dmake CC=gcc CFLAGS="-fident -fno-builtin -fno-asm -nodefaultlibs -Wall -Wno-unknown-pragmas -Wno-unused -fno-inline-functions -m64 -mcmodel=kernel -g -O2 -fno-inline -ffreestanding -fno-strict-aliasing -Wpointer-arith -gdwarf-2 -std=gnu99 -mno-red-zone -D_KERNEL -D__SOLARIS__ -mindirect-branch=thunk-extern -mindirect-branch-register"
Now to check what offset our newly built driver uses:
$ objdump -D fuse | less
0000000000008040 <fuse_mount>:
[...snip...]
8069: e8 00 00 00 00 call 806e <fuse_mount+0x2e>
806e: 85 c0 test %eax,%eax
8070: 0f 85 0c 01 00 00 jne 8182 <fuse_mount+0x142>
8076: 83 7b 28 02 cmpl $0x2,0x28(%rbx)
807a: 41 be 14 00 00 00 mov $0x14,%r14d
8080: 0f 85 fc 00 00 00 jne 8182 <fuse_mount+0x142>
It's 0x28
! Not 0x30
! Does that mean it would work? Let's try
# unload the current driver
$ modinfo | grep fuse
257 fffffffff800d000 d188 284 1 fuse (fuse driver)
257 fffffffff800d000 d188 28 1 fuse (filesystem for fuse)
$ pfexec modunload -i 257
# load the newly built one
$ pfexec modload ./fuse
# try mounting the image again
$ pfexec ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test
$ echo hello > /mnt/test/world
$ cat /mnt/test/world
hello
🎉 Success! 🎉
Although, in this case success just brings more questions than answers.
breakthrough
At this point I'm really confused. Using the pre-built binary package fails on Helios. Building the driver locally works on Helios.
I also gave OmniOS (a different Illumos distro and the source for a lot of
the build scripts used for Helios packages) a try. The pre-built packages worked there too. But it
was on OmniOS that I discovered that the "bad" offset of 0x30
was actually fine!? And not just
for the FUSE driver but also NFS and tmpfs.
Eventually while trying to figure out this odd difference between Helios and OmniOS came the breakthrough. Recall we were able to use typed arguments in our DTrace commands; that is enabled by the fact that a lot of software on Illumos comes with Compressed Type Format (CTF) data. CTF is a compact representation of data types and function signatures stored inside ELF objects. It is much smaller than the DWARF it is derived from. The smaller footprint makes it easy to ship by default and enable rich usescases like DTrace.
We can use ctfdump
to print out all the CTF data in our pre-built vs locally built driver and
compare the vnode
definitions used:
First for the local build:
$ ctfdump -t ./fuse | grep -A 8 'struct vnode ('
<208> struct vnode (216 bytes)
v_lock type=98 off=0
v_flag type=28 off=64
v_count type=28 off=96
v_data type=36 off=128
v_vfsp type=735 off=192
v_stream type=736 off=256
v_type type=230 off=320
v_rdev type=65 off=384
v_type
is at offset 320 bits = 40 / 0x28 bytes, as expected. What about the pre-built:
$ ctfdump -t /usr/kernel/drv/amd64/fuse | grep -A 8 'struct vnode ('
<229> struct vnode (224 bytes)
v_lock type=103 off=0
v_flag type=28 off=64
v_count type=28 off=96
v_phantom_count type=28 off=128
v_data type=37 off=192
v_vfsp type=830 off=256
v_stream type=831 off=320
v_type type=255 off=384
Would you look at that v_type
shows an offset of 384 bits = 48 / 0x30 bytes. The even more
suspicious line is this field that's not present in our local version: v_phantom_count
(aptly
named in this instance).
So uh, what gives? The upstream header (which we're more or less using in Helios) certainly doesn't contain it. A little searching leads us to this PR adding it to SmartOS's (another distro) illumos fork. But what's probably more relevant in this case is that it also exists in the OmniOS fork.
A couple messages later and that more-or-less explains it: this package meant for Helios was accidentally built on an OmniOS box, which has a slightly different definition of some kernel structure.