I was setting up some automation to build Windows images pre-loaded with some drivers and software (a story for another day). I had already gotten it working with QEMU under KVM on Linux but wanted to port it to propolis on our illumos distro, Helios. I figured it should be mostly straightforward; maybe a couple different flags or utilities to futz around with the disk images and mount them. Which was the case. Mostly. That is except for the one minor detail of not being able to mount an NTFS image.

$ pfexec mount -F ntfs-3g $LOOPBACK_DEV /mnt/test
fuse: mount failed: Not a directory

the setup

Ok, let's step back a second. To give some context, I was trying to create a raw image that contained an NTFS partition. Maybe it didn't like the way I created the GPT? Ok, let's try something simpler and forget partitions for a moment and just try solely creating an NTFS file system:

  1. Create an empty disk image:
$ qemu-img create -f raw test.img 8G
Formatting 'test.img', fmt=raw size=8589934592
  1. Create loopback device:
$ pfexec lofiadm -l -a test.img
/dev/dsk/c2t1d0p0
  1. Create an NTFS file system:
$ mkntfs -Q /dev/dsk/c2t1d0p0
The sector size was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically.  It has been set to 512 bytes.
The partition start sector was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically.  It has been set to 0.
The number of sectors per track was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically.  It has been set to 0.
The number of heads was not specified for /dev/dsk/c2t1d0p0 and it could not be obtained automatically.  It has been set to 0.
Cluster size has been automatically set to 4096 bytes.
To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
Windows will not be able to boot from this device.
Creating NTFS volume structures.
mkntfs completed successfully. Have a nice day.

Ok, that's a lot of warnings (that I did handle properly in the real scenario!) but shouldn't be relevant right now. We don't care if Windows can't boot off of this image.

  1. Mount the file system:
$ pfexec mount -F ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test
fuse: mount failed: Not a directory

Something's clearly wrong.

linux

Mind you this same scenario using the fuse-based ntfs-3g driver works in linux:

➜  ~ qemu-img create -f raw test.img 8G
Formatting 'test.img', fmt=raw size=8589934592
➜  ~ sudo losetup -f --show test.img   
/dev/loop1
➜  ~ sudo mkntfs -Q /dev/loop1
The partition start sector was not specified for /dev/loop1 and it could not be obtained automatically.  It has been set to 0.
The number of sectors per track was not specified for /dev/loop1 and it could not be obtained automatically.  It has been set to 0.
The number of heads was not specified for /dev/loop1 and it could not be obtained automatically.  It has been set to 0.
Cluster size has been automatically set to 4096 bytes.
To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
Windows will not be able to boot from this device.
Creating NTFS volume structures.
mkntfs completed successfully. Have a nice day.
➜  ~ mkdir test
➜  ~ sudo mount -t ntfs-3g /dev/loop1 test
➜  ~ echo hello > test/world
➜  ~ cat test/world 
hello

ntfs-3g

If you're using NTFS on non-Windows chances are you're using some ntfs-3g based driver. It is used along with Filesystem in USErspace (FUSE) to provide access to NTFS volumes. This arrangement consists of two parts: the FUSE kernel driver and the userspace application that links against libfuse. In this case, that is ntfs-3g.

Let's skip the mount wrapper and just ask ntfs-3g directly to mount our image:

$ pfexec ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test
fuse: mount failed: Not a directory

Alas, still no good. I guess we gotta dig.

truss

On Linux you have strace to trace system calls. On Illumos there's truss:

$ pfexec truss ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test                                                                                                                                                          
execve("/opt/ooce/ntfs-3g/bin/ntfs-3g", 0xFFFFFC7FFFDFDE48, 0xFFFFFC7FFFDFDE68)  argc = 3                                                                                                                   
sysinfo(SI_MACHINE, "i86pc", 257)               = 6
[...snip...]
mount("/devices/pseudo/lofi@1:q", "/mnt/test", MS_NOSUID|MS_OPTIONSTR, "fuse", 0x00000000, 0, 0x00E63E60, 1024) Err#20 ENOTDIR
open("/usr/lib/locale/en_US.UTF-8/LC_MESSAGES/SUNW_OST_OSLIB.mo", O_RDONLY) Err#2 ENOENT
fstat(2, 0xFFFFFC7FFFDFC720)                    = 0
fuse: mount failed: write(2, " f u s e :   m o u n t  ".., 20)  = 20
Not a directorywrite(2, " N o t   a   d i r e c t".., 15)       = 15

write(2, "\n", 1)                               = 1
close(5)                                        = 0
fdsync(4, FSYNC)                                = 0
fcntl(4, F_SETLK, 0xFFFFFC7FFFDFDBD0)           = 0
close(4)                                        = 0
_exit(21)

Hmmm, mount("/devices/pseudo/lofi@1:q", "/mnt/test", MS_NOSUID|MS_OPTIONSTR, "fuse", 0x00000000, 0, 0x00E63E60, 1024) Err#20 ENOTDIR.

That ENOTDIR error is not from ntfs-3g, in fact we see it returns an exit code of 21 and the manual page tells us that's an "Unclassified FUSE error".

The mount syscall here is what returned that ENOTDIR and its manual page says:

       ENOTDIR
                       The dir argument is not a directory, or a component of
                       a path prefix is not a directory.

Not a directory you say?

$ file /mnt/test
/mnt/test:      directory

Presumably it is the fuse kernel driver which is handling the mount syscall in this case. One quick way to check: DTrace.

dtrace

DTrace on illumos offers a wealth of information on a live system. With a lot of introspection capabilities, it makes for a great debugging tool. I'm still learning to reach for it, but it works perfectly here:

$ pfexec dtrace -n 'fuse::return /arg1 == ENOTDIR && pid == $target/ { stack(); }' -c "ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test"
dtrace: description 'fuse::return ' matched 98 probes
fuse: mount failed: Not a directory
dtrace: pid 11852 has exited
CPU     ID                    FUNCTION:NAME
  1  17911                fuse_mount:return 
              genunix`fsop_mount+0x14
              genunix`domount+0x948
              genunix`mount+0xfe
              genunix`syscall_ap+0x98
              unix`sys_syscall+0x17d

So what did we do there? We ran dtrace (pfexec dtrace) and:

  1. told it what probes to match (fuse::return)

    Probes are specified as [[[provider:] module:] function:] name, where any unspecified field acts as a wildcard.

    We want to match the exit (return) probe of any function in the fuse kernel module.

  2. for any such probes, a predicate to further filter them (/arg1 == ENOTDIR && pid == $target/)

    arg1 for a return probe corresponds to its return value. Here we compare to the error we're looking for: ENOTDIR.

    pid == $target is to further constrain it by using the provided $target macro which refers to:

  3. the command we want to trace (-c "ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test")

  4. and what actions to take for any matches ({ stack(); })

    DTrace has a number of actions to inspect the system, here we use the stack() to record and print out the kernel stack trace for our matched probes.

FUSE

FUSE seems to be the one stumbling over our supposedly "not a directory" directory. Now that dtrace was helpful enough to point out where the error comes from, let's take a look at the code.

It certainly doesn't take long to find the spot:

static int
fuse_mount(struct vfs *vfsp, struct vnode *mvp, struct mounta *uap,
    struct cred *cr)
{
	fuse_vfs_data_t	 *vfsdata;
	fuse_session_t	 *se;
	dev_t dev;
	char *fdstr;
	int err;

	if (secpolicy_fs_mount(cr, mvp, vfsp) != 0)
		return (EPERM);

	if (mvp->v_type != VDIR)
		return (ENOTDIR);

Every file is allocated a vnode and mvp here should represent the one for our mountpoint (/mnt/test). mount understandably requires you only mount things at a directory and so every file system driver should verify that is the case, just as FUSE does here. But if /mnt/test isn't a directory (VDIR), what is it?

Back to dtrace!

$ pfexec dtrace -n 'fuse_mount:entry /pid == $target/ { printf("v_type = %d", args[1]->v_type); }' -c "ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test"
dtrace: description 'fuse_mount:entry ' matched 1 probe
fuse: mount failed: Not a directory
dtrace: pid 11921 has exited
CPU     ID                    FUNCTION:NAME
  6  17910                 fuse_mount:entry v_type = 2

This time we match just on entry to fuse_mount and for an entry probe we have access to args, which allows typed access to the function arguments. In this case we print out the v_type field of the second arg (mvp = args[1]). Let's take a look at the enum definition:

typedef enum vtype {
	VNON	= 0,
	VREG	= 1,
	VDIR	= 2,
	VBLK	= 3,
	VCHR	= 4,
	VLNK	= 5,
	VFIFO	= 6,
	VDOOR	= 7,
	VPROC	= 8,
	VSOCK	= 9,
	VPORT	= 10,
	VBAD	= 11
} vtype_t;

...it's VDIR?

This is when I started questioning my sanity a little. Theories of weird corruption happening between function entry and the condition check. Was secpolicy_fs_mount secretly modifying it? (No.)

Eventually I decide to look at the actual code running on my machine and use the kernel debugger to disassemble the fuse module in-memory:

$ pfexec mdb -k                                                                                       
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata ip hook neti sockfs arp usba xhci mm smbios stmf stmf_sbd lofs crypto random cpc ufs logindmux nsmb ptm smbsrv nf
s ]                                                
> fuse`fuse_mount::dis                            
[...snip...]
fuse_mount+0x28:                call   +0x3a21ec3       <secpolicy_fs_mount>
fuse_mount+0x2d:                testl  %eax,%eax
fuse_mount+0x2f:                jne    +0x110   <fuse_mount+0x145>
fuse_mount+0x35:                cmpl   $0x2,0x30(%r13)
fuse_mount+0x3a:                movl   $0x14,%ebx
fuse_mount+0x3f:                jne    +0x100   <fuse_mount+0x145>

0x14 is ENOTDIR and the cmpl $0x2,0x30(%r13) would line up with the v_type != VDIR check. But a quick hacked-up validation of that offset does not square:

v_type is at: 0x28

How about comparing other file system drivers since they all have the same check:

NFS?

nfs_mount+0x66:                 call   +0x3678985       <secpolicy_fs_mount>                                                                                                                                
nfs_mount+0x6b:                 testl  %eax,%eax                                                                                                                                                            
nfs_mount+0x6d:                 jne    +0x2fd   <nfs_mount+0x370>                                                                                                                                           
nfs_mount+0x73:                 cmpl   $0x2,0x28(%rbx)                                                                                                                                                      
nfs_mount+0x77:                 jne    +0x31b   <nfs_mount+0x398>
[...snip...]
nfs_mount+0x398:                movl   $0x14,%eax       # ENOTDIR
nfs_mount+0x39d:                jmp    -0x2f    <nfs_mount+0x370>

It uses an offset of 0x28. We need another data point, tmpfs?

tmp_mount+0x58:                 call   +0x3f1a993       <secpolicy_fs_mount>
tmp_mount+0x5d:                 testl  %eax,%eax
tmp_mount+0x5f:                 movl   %eax,%r15d
tmp_mount+0x62:                 jne    +0xc     <tmp_mount+0x70>
tmp_mount+0x64:                 cmpl   $0x2,0x28(%rbx)
tmp_mount+0x68:                 movl   $0x14,%r15d      # ENOTDIR
tmp_mount+0x6e:                 je     +0x30    <tmp_mount+0xa0>

It also uses an offset of 0x28.

local build

Something's definitely going on here. At this point it's looking like the fuse driver has a different idea of what the vnode struct looks like. If for some reason it was compiled without _LP64 defined then an offset of 0x30 could make sense but would certainly lead to other issues. And this is definitely a 64-bit module:

$ file /usr/kernel/drv/amd64/fuse
/usr/kernel/drv/amd64/fuse:     ELF 64-bit LSB relocatable AMD64 Version 1

This is the point I took a detour into building the module locally. After some time trawling through build scripts I got it built. TL;DR:

wget https://mirrors.omnios.org/fuse/Version-1.4.tar.gz -O illumos-fusefs-Version-1.4.tar.gz
gtar xf illumos-fusefs-Version-1.4.tar.gz
cd illumos-fusefs-Version-1.4/kernel/amd64
PATH=$PATH:/opt/onbld/bin/i386 dmake CC=gcc CFLAGS="-fident -fno-builtin -fno-asm -nodefaultlibs -Wall -Wno-unknown-pragmas -Wno-unused -fno-inline-functions -m64 -mcmodel=kernel -g -O2 -fno-inline -ffreestanding -fno-strict-aliasing -Wpointer-arith -gdwarf-2 -std=gnu99 -mno-red-zone -D_KERNEL -D__SOLARIS__ -mindirect-branch=thunk-extern -mindirect-branch-register"

Now to check what offset our newly built driver uses:

$ objdump -D fuse | less
0000000000008040 <fuse_mount>:
[...snip...]
    8069:       e8 00 00 00 00          call   806e <fuse_mount+0x2e>
    806e:       85 c0                   test   %eax,%eax
    8070:       0f 85 0c 01 00 00       jne    8182 <fuse_mount+0x142>
    8076:       83 7b 28 02             cmpl   $0x2,0x28(%rbx)
    807a:       41 be 14 00 00 00       mov    $0x14,%r14d
    8080:       0f 85 fc 00 00 00       jne    8182 <fuse_mount+0x142>

It's 0x28! Not 0x30! Does that mean it would work? Let's try

# unload the current driver
$ modinfo | grep fuse
257 fffffffff800d000   d188 284   1  fuse (fuse driver)
257 fffffffff800d000   d188  28   1  fuse (filesystem for fuse)
$ pfexec modunload -i 257                                                                      

# load the newly built one
$ pfexec modload ./fuse

# try mounting the image again
$ pfexec ntfs-3g /dev/dsk/c2t1d0p0 /mnt/test                                                                                   
$ echo hello > /mnt/test/world
$ cat /mnt/test/world 
hello

🎉 Success! 🎉

Although, in this case success just brings more questions than answers.

breakthrough

At this point I'm really confused. Using the pre-built binary package fails on Helios. Building the driver locally works on Helios.

I also gave OmniOS (a different Illumos distro and the source for a lot of the build scripts used for Helios packages) a try. The pre-built packages worked there too. But it was on OmniOS that I discovered that the "bad" offset of 0x30 was actually fine!? And not just for the FUSE driver but also NFS and tmpfs.

Eventually while trying to figure out this odd difference between Helios and OmniOS came the breakthrough. Recall we were able to use typed arguments in our DTrace commands; that is enabled by the fact that a lot of software on Illumos comes with Compressed Type Format (CTF) data. CTF is a compact representation of data types and function signatures stored inside ELF objects. It is much smaller than the DWARF it is derived from. The smaller footprint makes it easy to ship by default and enable rich usescases like DTrace.

We can use ctfdump to print out all the CTF data in our pre-built vs locally built driver and compare the vnode definitions used:

First for the local build:

$ ctfdump -t ./fuse | grep -A 8 'struct vnode ('
  <208> struct vnode (216 bytes)
        v_lock type=98 off=0
        v_flag type=28 off=64
        v_count type=28 off=96
        v_data type=36 off=128
        v_vfsp type=735 off=192
        v_stream type=736 off=256
        v_type type=230 off=320
        v_rdev type=65 off=384

v_type is at offset 320 bits = 40 / 0x28 bytes, as expected. What about the pre-built:

$ ctfdump -t /usr/kernel/drv/amd64/fuse | grep -A 8 'struct vnode ('
  <229> struct vnode (224 bytes)
        v_lock type=103 off=0
        v_flag type=28 off=64
        v_count type=28 off=96
        v_phantom_count type=28 off=128
        v_data type=37 off=192
        v_vfsp type=830 off=256
        v_stream type=831 off=320
        v_type type=255 off=384

Would you look at that v_type shows an offset of 384 bits = 48 / 0x30 bytes. The even more suspicious line is this field that's not present in our local version: v_phantom_count (aptly named in this instance).

So uh, what gives? The upstream header (which we're more or less using in Helios) certainly doesn't contain it. A little searching leads us to this PR adding it to SmartOS's (another distro) illumos fork. But what's probably more relevant in this case is that it also exists in the OmniOS fork.

A couple messages later and that more-or-less explains it: this package meant for Helios was accidentally built on an OmniOS box, which has a slightly different definition of some kernel structure.

Mutli-Track Drifting Meme: Train labeled "FUSE" with front wheels on track labeled "OmniOS Kernel" and back labeled "Helios Kernel". Bottom panel is close-up of manga character's eyes looking surprised/intense with action bubble to the left, "MULTI-KERNEL DRIFTING!!!"