systrace overall design

This is overview of systrace design. For intro to systrace, go to Niels Provos' systrace web, read onlamp.com's articles or buy book about OpenBSD covering systrace.
This is documenting FreeBSD specific interface, implementations in other BSD systems are very similar.



  kernel				    .  userland   
  ---------------------------------------|  .		 
  |	      /-->  [ in kernel policy ] |  .		
  |	      |			         |  .  --------------------------
  |	      |			         |  .  |   systraced binary     |
  |	systrace_enter() <------------------.--| (p_flag & P_SYSTRACE)  |
  |	       	|  		  syscall2().  --------------------------
  |		v			 |  .		 ^
  |	systrace_msg_ask()	 	 |  .		 |
  ----------------------------------------  .		 | execvp()
  		|			    .	         |
		| 			    .  ------------------------
		|  ioctl	            .  |     /bin/systrace    |
		\----<--------------->------.->|  [ userland policy ] |
				    poll    .  ------------------------
					    .		 |  requestor_start()
					    .		 v   ( execvp() )
					    .   ---------------------
					    .	|    xsystrace	    |
					    .	---------------------
Binary /bin/systrace uses execvp() system call to launch traced binary. It also sets specific flag (P_SYSTRACE) on this binary to indicate that this binary is to be systraced. When traced binary wants to call certain syscall, it will do it via speciall assembler instruction which switches processor into kernel mode (int 0x80 on i386 architecture, syscall on MIPS arch) and invokes exception handler. (let's say syscall is speciall kind of execption)
In kernel, syscall handler is called. For i386 architecture that is syscall2(), defined in /sys/i386/i386/trap.c. This function contains hooks for calling systrace functions which evaluate policy for this syscall and decide whether it will be called or not and clean up after syscall has (not) been called. These hooks are systrace_enter() and systrace_exit(). More systrace hooks exist in kernel code, but this number is low. (less than 5)

So, the most interesting part of syscall handler looks like this:

if (ISSET(p->p_flag, P_SYSTRACE)) 
  error = systrace_enter(p, code, args, p->p_retval);

if (error == 0) /* permit decision */
  error = (*callp->sy_call)(p, args); /* call the syscall */

if (ISSET(p->p_flag, P_SYSTRACE))
  systrace_exit(p, code, args, p->p_retval, error);
fig.2: this is just pseudo-code, see /sys/i386/i386/trap.c for actuall code.

systrace kernel portion then needs to decide to permit/deny this particular syscall. This presumes that systrace policies are loaded. Systrace policies for given binary (and optionaly its children) consist of userland policies and kernel policies. Kernel policy set contains syscalls which don't need their arguments evaluated. This is done for performance reasons.

This decision may require search in userland policy. kernel portion 'sends a message' to userland portion via systrace_msg_ask(). This function returns only after it got answer from userspace or some error has occured. (e.g. traced process has ended) In the meantime, traced process is put to sleep.
Userspace systrace gets this answer via poll() on descriptor associated with /dev/systrace. It will 'send' the answer via ioctl() call for the same descriptor. More on this is in kernel-userspace communication.

If syscall is permitted, kernel portion in systrace_enter() will eventualy rewrite its arguments via systrace_replace() and after that, elevate privileges via systrace_seteuid(), systrace_setegid().


Vladimir Kotal <vlada--at--devnull.cz>
$Id: overall-design.html,v 1.3 2003/12/18 09:15:44 techie Exp $