Browse Source

added PaX docs from http://pax.grsecurity.net/docs/

Signed-off-by: Oliver Pinter <oliver.pntr@gmail.com>
master
Oliver Pinter 7 years ago
commit
0ee5482c0c
18 changed files with 2676 additions and 0 deletions
  1. +208
    -0
      aslr.txt
  2. +40
    -0
      emusigrt.txt
  3. +75
    -0
      emutramp.txt
  4. +131
    -0
      mprotect.txt
  5. +72
    -0
      noexec.txt
  6. +145
    -0
      pageexec.txt
  7. +423
    -0
      pax-future.txt
  8. +141
    -0
      pax.txt
  9. +79
    -0
      paxteam-on-kaslr.txt
  10. +126
    -0
      randexec.txt
  11. +73
    -0
      randkstack.txt
  12. +46
    -0
      randmmap.txt
  13. +48
    -0
      randustack.txt
  14. +64
    -0
      segmexec.txt
  15. +8
    -0
      uderef-SMAP.txt
  16. +74
    -0
      uderef-amd64.txt
  17. +559
    -0
      uderef.txt
  18. +364
    -0
      vmmirror.txt

+ 208
- 0
aslr.txt View File

@@ -0,0 +1,208 @@
1. Design

The goal of Address Space Layout Randomization is to introduce randomness
into addresses used by a given task. This will make a class of exploit
techniques fail with a quantifiable probability and also allow their
detection since failed attempts will most likely crash the attacked task.

To help understand the ideas behind ASLR, let's look at an example task
and its address space: we made a copy of /bin/cat into /tmp/cat then
disabled all PaX features on it and executed "/tmp/cat /proc/self/maps".
The [x] marks are not part of the original output, we use them to refer
to the various lines in the explanation (note that the VMMIRROR document
contains more examples with various PaX features active).

[1] 08048000-0804a000 R+Xp 00000000 00:0b 812 /tmp/cat
[2] 0804a000-0804b000 RW+p 00002000 00:0b 812 /tmp/cat
[3] 40000000-40015000 R+Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
[4] 40015000-40016000 RW+p 00014000 03:07 110818 /lib/ld-2.2.5.so
[5] 4001e000-40143000 R+Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
[6] 40143000-40149000 RW+p 00125000 03:07 106687 /lib/libc-2.2.5.so
[7] 40149000-4014d000 RW+p 00000000 00:00 0
[8] bfffe000-c0000000 RWXp fffff000 00:00 0

As we can see, /tmp/cat is a dynamically linked ELF executable, its address
space contains several file mappings.

[1] and [2] correspond to the loadable ELF segments of /tmp/cat containing
code and data (both initialized and uninitialized), respectively.

[3] and [4] represent the dynamic linker whereas [5], [6] and [7] are the
segments of the C runtime library ([7] holds its uninitialized data that is
big enough to not fit into the last page of [6]).

[8] is the stack which grows downwards.

There are other mappings as well that this simple example does not show us:
the brk() managed heap that would directly follow [2] and various anonymous
and file mappings that the task can create via mmap() and would be placed
between [7] and [8] (unless an explicit mapping address outside this region
was requested using the MAP_FIXED flag).

For our purposes all these possible mappings can be split into three groups:
- [1], [2] and the brk() managed heap following them,
- [3]-[7] and all the other mappings created by mmap(),
- [8], the stack.

The mappings in the first and last groups are established during execve()
and do not move (only their size can change) whereas the mappings in the
second group may come and go during the lifetime of the task. Since the
base addresses used to map each group are not related to each other, we can
apply different amount of randomization to each. This also has the benefit
that whenever a given attack technique needs advance knowledge of addresses
from more than group, the attacker will likely have to guess or brute force
all entropies at once which further reduces the chances of success.

Let's analyze now the (side) effects of ASLR. For our purposes the most
important effect is on the class of exploit techniques that need advance
knowledge of certain addresses in the attacked task, such as the address
of the current stack pointer or libraries. If there is no way to exploit
a given bug to divulge information about the attacked task's randomized
address space layout then there is only one way left to exploit the bug:
guessing or brute forcing the randomization.

Guessing occurs when the randomization applied to a task changes in every
attacked task in an unpredictable manner. This means that the attacker
cannot learn anything of future randomizations and has the same chance of
succeeding in each attack attempt. Brute forcing occurs when the attacker
can learn something about future randomizations and build that knowledge
into his attack. In practice brute forcing can be applied to bugs that are
in network daemons that fork() on each connection since fork() preserves
the randomized layout, as opposed to execve() which replaces it with a new
one. This distinction between the attack methods becomes meaningless if the
system has monitoring and reaction mechanisms for program crashes because
the reaction can then be triggered at such low levels that the two attack
methods will have practically the same (im)probability to succeed.

To quantify the above statements about probability of success, let's first
introduce a few variables:

Rs: number of bits randomized in the stack area,
Rm: number of bits randomized in the mmap() area,
Rx: number of bits randomized in the main executable area,
Ls: least significant randomized bit position in the stack area,
Lm: least significant randomized bit position in the mmap() area,
Lx: least significant randomized bit position in the main executable area,
As: number of bits of stack randomness attacked in one attempt,
Am: number of bits of mmap() randomness attacked in one attempt,
Ax: number of bits of main executable randomness attacked in one attempt.

For example, for i386 we have Rs = 24, Rm = 16, Rx = 16, Ls = 4, Lm = 12
and Lx = 12 (e.g., the stack addresses have 24 bits of randomness in bit
positions 4-27 leaving the least and most significant four bits unaffected
by randomization). The number of attacked bits represents the fact that in
a given situation more than one bit at a time can be attacked (obviously
A <= R), e.g., by duplicating the attack payload multiple times in memory
one can overcome the least significant bits of the randomization.

The probabilities of success within x number of attempts are given by the
following formulae (for guessing and brute forcing, respectively):

(1) Pg(x) = 1 - (1 - 2^-N)^x, 0 <= x
(2) Pb(x) = x / 2^N, 0 <= x <= 2^N

where N = Rs-As + Rm-Am + Rx-Ax, the number of randomized bits to find.

Based on the above the following tables summarize the probabilities of
success as a function of how many bits are tried in one attempt and the
number of attempts.

Pg(x)| x
-----+----------------------------------------------------------------------
N| 1 4 16 64 256 2^10 2^14 2^18 2^20 2^24 2^32 2^40 2^56 2^64
-----+----------------------------------------------------------------------
1| 0.50 0.94 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1
2| 0.25 0.68 0.99 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1
4| 0.06 0.23 0.64 0.98 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1
8| ~0 0.02 0.06 0.22 0.63 0.98 ~1 ~1 ~1 ~1 ~1 ~1 ~1 ~1
16| ~0 ~0 ~0 ~0 ~0 0.02 0.22 0.98 ~1 ~1 ~1 ~1 ~1 ~1
24| ~0 ~0 ~0 ~0 ~0 ~0 ~0 0.02 0.06 0.63 ~1 ~1 ~1 ~1
32| ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 0.63 ~1 ~1 ~1
40| ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 0.63 ~1 ~1
56| ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 0.63 ~1

Pb(x)| x
-----+----------------------------------------------------------------------
N| 1 4 16 64 256 2^10 2^14 2^18 2^20 2^24 2^32 2^40 2^56 2^64
-----+----------------------------------------------------------------------
1| 0.50
2| 0.25 1
4| 0.06 0.25 1
8| ~0 0.02 0.06 0.25 1
16| ~0 ~0 ~0 ~0 ~0 0.02 0.25
24| ~0 ~0 ~0 ~0 ~0 ~0 ~0 0.02 0.06 1
32| ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 1
40| ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 1
56| ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 1

It is obvious that from the defense point of view the goal would be to make
N as high as possible while keeping x as low as possible. Unfortunately N is
not under control by the defense side (re-randomizing the address space at
runtime is not feasible because part of the necessary relocation information
is simply lost), but rather that of the nature of the bug and the attacker's
exploit skills. What we know is that there has been very little research
done and published on countering ASLR so far (e.g., it is unknown how certain
real-life bugs such as stack or heap overflows can be used for information
leaking in a general way). Reducing N is possible if an attacker can store
multiple instances of the attack payload (e.g., stack frame chain if NOEXEC
is active, or some shellcode when it is not) in the attacked task's address
space. This would typically be possible by exploiting an overflow style bug
where the attacker can fill a contiguous range of memory with data of his
choice. As the size of this memory range grows above the value of L relevant
to the given range, more and more randomized bits can be ignored in the
attack payload. For example, to overcome all of R for a given range on i386,
the attacker would have to send 256 MB of data, something that is not always
possible (e.g., the stack typically has a maximum limit of 8 MB and grows to
much less in practice).

It is also unknown how bugs that have been neglected so far can be used
against ASLR, that is, bugs that give only read access (vs. write) to the
attacker and were not considered as a serious security problem before but
may now be used to help counter ASLR in an exploit attempt of another bug.

On the other hand however the defense side has quite some control over the
value of x: whenever an attack attempt makes a wrong guess on the randomized
bits, the attacked application will go into a state that will likely result
in a crash and hence becomes detectable by the kernel. It is therefore a
good strategy to use a crash detection and reaction mechanism together with
ASLR (PaX itself contains no such mechanism).

The last set of side effects of ASLR is address space fragmentation and
entropy pool exhaustion. Since randomization shifts entire ranges of memory,
it will also randomly change the gaps between them (which were constant
before). This in turn will change the maximum size of memory mappings that
will fit in there and applications expecting to be able to create them will
fail. Finally, ASLR increases the consumption of the system's entropy pool
since every task creation (through the execve() system call) requires some
bits of randomness to determine the new address space layout. Depending on
the system's threat model however a given implementation can relax the
requirements for the quality of this entropy. In particular, if only remote
attacks are considered, then ASLR does not need cryptographically secure
random bits as a remote attacker cannot observe them (or if he can, he does
not need to care about ASLR at all).


2. Implementation

PaX can apply ASLR to tasks that are created from ELF executables and
use ELF libraries. The randomized layout is determined at task creation
time in the load_elf_binary() function in fs/binfmt_elf.c where three per
task (or more precisely, mm_struct) variables are initialized with random
numbers: delta_exec, delta_mmap and delta_stack.

The following list specifies which ASLR feature affects which part of
the task address space layout (they are discussed in detail in separate
documents):

RANDEXEC/RANDMMAP (delta_exec) - main executable code/data/bss segments
RANDEXEC/RANDMMAP (delta_exec) - brk() managed memory (heap)
RANDMMAP (delta_mmap) - mmap() managed memory (libraries, heap,
thread stacks, shared memory)
RANDUSTACK (delta_stack) - user stack
RANDKSTACK - kernel stack (not part of the task's
address space)

The main executable and the brk() managed heap can be randomized in two
different ways depending on the file format of the main executable. If
it is an ET_EXEC ELF file, then RANDEXEC can be applied to it, if it is
an ET_DYN ELF file then RANDMMAP.

+ 40
- 0
emusigrt.txt View File

@@ -0,0 +1,40 @@
1. Design

The goal of EMUSIGRT is to automatically emulate instruction sequences that
the kernel generates for the signal return stubs.

While EMUTRAMP allows one to enable certain instruction sequence emulations
on a per task basis, there are some situations where this is not enough or
practical (libc does not use a restorer, many applications are statically
linked to such, etc). EMUSIGRT solves this problem by allowing to bypass
EMUTRAMP when the conditions are right. These conditions are established
to limit the security hole that arises from automatic emulation (it is
possible in an attack to simulate the signal stack and cause an arbitrary
change in the task's registers).

What we can verify before proceeding with the emulation is that the signal
stack has a valid signal number (which the kernel puts there before it
dispatches a signal to userland, so it must be there upon return as well)
and that the task has actually established a signal handler for the given
signal (otherwise the kernel would not have delivered the signal in the
first place and hence the task could not have executed a signal return
trampoline, in this case we will require EMUTRAMP for emulation). The last
check we can do is the consistency between the type of the signal return
trampoline and that of the signal handler (for historical reasons Linux
has two of them, one supports real-time signals whereas the legacy one
does not).


2. Implementation

Emulation is implemented by pax_handle_fetch_fault() in arch/i386/mm/fault.c
where both the kernel signal return stubs and the gcc nested function
trampolines are recognized and emulated. EMUSIGRT changes the former only
by retrieving the signal number from the userland stack and then verifying
that it is a valid signal number for which the task has a signal handler:
the signal number must be in the range of [1,_NSIG] and cannot be one for
which userland cannot establish a signal handler (and consequently the
kernel never delivers to userland). Next we look up the signal handler and
verify that it is neither the default nor ignored (if it is, then we will
check for EMUTRAMP before proceeding with the emulation) and that it is of
the right type (the SA_SIGINFO flag differentiates between the two types).

+ 75
- 0
emutramp.txt View File

@@ -0,0 +1,75 @@
1. Design

The goal of EMUTRAMP is to emulate certain instruction sequences that are
known to be generated at runtime in otherwise non-executable memory regions
and hence would cause task termination when the non-executable feature is
enforced (by PAGEEXEC or SEGMEXEC).

While there are many sources of runtime generated code, PaX emulates only
two of them: the kernel generated signal return stubs and the gcc nested
function trampolines. We chose to emulate them and only them because they
are the most common and still the shortest and best defined sequences. The
other reason is that runtime code generation is by its nature incompatible
with PaX's PAGEEXEC/SEGMEXEC and MPROTECT features, therefore the real
solution is not in emulation but by designing a kernel API for runtime
code generation and modifying userland to make use of it.

Whenever a task attempts to execute code from a non-executable page, the
PAGEEXEC or SEGMEXEC logic raises a page fault and therefore the page fault
handler is the place where EMUTRAMP can carry out its job. Since the code
to be emulated is architecture specific, EMUTRAMP is architecture dependent
as well (for now it exists for IA-32 only).

The kernel signal return stubs are generated by the kernel itself before
a signal is dispatched to the userland signal handler. There are two stubs
for returning from normal (old style) and real-time signal (new style)
handlers, respectively. By default the kernel puts all information for the
userland signal handler on the userland stack therefore the signal return
stubs end up there as well, hence the need for emulating them. It is worth
noting however that userland can tell the kernel to rely on userland
supplied signal return code (the SA_RESTORER flag) and glibc 2.1+ actually
makes use of this feature and therefore applications linking against it
do not require this emulation.

The gcc nested function trampolines implement a gcc specific C extension
where one can define a function within another one (this is the nesting).
Since the inner function has access to all the local variables of the
outer one, the compiler has to pass an extra parameter (pointer to the
outer function's stack frame) to such a nested function. This becomes a
challenge when one takes the address of the nested function and calls it
later via a function pointer (while the outer function's stack frame still
exists of course). In this case the address of the function body does not
fully identify the function since the inner function also needs access to
the outer function's stack frame. The gcc solution is that whenever the
address of a nested function is taken, gcc generates a small stub on the
stack (in the stack frame of the outer function) and uses the stub's
address as the function pointer. The stub then passes the current frame's
address to the actual inner function body. While this is pretty much the
only generic solution for implementing function pointers to nested
functions, there is a better one for a special (but likely the most often
encountered) situation: if the outer function is not called recursively
then we know that there's at most one instance of the stack frame of the
outer function and therefore its address can be stored in a normal (per
thread in case of a multithreaded program) static variable and referenced
from within the nested function's body - no need to generate the trampoline
stub at runtime.


2. Implementation

EMUTRAMP is implemented by the pax_handle_fetch_fault() function in
arch/i386/mm/fault.c. This function is called during page fault processing
when an instruction fetch attempt from a non-executable page is detected.
Each signal return stub and nested function trampoline is recognized by
checking for their instruction sequence signature and ensuring that some
additional conditions (e.g., register contents) are met as well. Emulation
itself is very simple, we only modify the userland register context which
was saved by the kernel and will be restored when it returns to userland.

The signal return stub emulation requires a bit of extra handling because
it wants to invoke a system call at the end and the task has already entered
the kernel (it is processing a page fault). Whenever EMUTRAMP is enabled,
the low-level page fault handler stub in arch/i386/kernel/entry.S can be
told to simulate a system call entry into the kernel by simply restoring
the kernel stack to the point before the page fault processing began then
jumping to the system call processing stub.

+ 131
- 0
mprotect.txt View File

@@ -0,0 +1,131 @@
1. Design

The goal of MPROTECT is to help prevent the introduction of new executable
code into the task's address space. This is accomplished by restricting the
mmap() and mprotect() interfaces.

The restrictions prevent
- creating executable anonymous mappings
- creating executable/writable file mappings
- making an executable/read-only file mapping writable except for performing
relocations on an ET_DYN ELF file (non-PIC shared library)
- making a non-executable mapping executable

To understand the restrictions consider the writability/executability of a
mapping as state information. This state is stored in the vma structure in
the vm_flags field and determines whether the given area (and consequently
each page covered by it) is currently writable/executable and/or can be
made writable/executable (by using mprotect() on the area). The flags that
describe each attribute are: VM_WRITE, VM_EXEC, VM_MAYWRITE and VM_MAYEXEC.

These four attributes mean that any mapping (vma) can be in 16 different
states (for our discussion at least, we ignore the other attributes here),
and our goal can be achieved by restricting what state a vma can be in or
change to throughout its lifetime.

Introducing new executable code into a mapping is impossible in any of the
following ('good') states:
VM_WRITE
VM_MAYWRITE
VM_WRITE | VM_MAYWRITE
VM_EXEC
VM_MAYEXEC
VM_EXEC | VM_MAYEXEC

In every other state it is either possible to directly write new executable
code into the mapping or the mapping can be changed by mprotect() so that
it becomes writable/executable.

Note that the default kernel behaviour does already prevent certain states
(in particular, a mapping cannot have VM_WRITE and VM_EXEC without also
having VM_MAYWRITE and VM_MAYEXEC, respectively) so this leaves us with
4 good states:
VM_MAYWRITE
VM_MAYEXEC
VM_WRITE | VM_MAYWRITE
VM_EXEC | VM_MAYEXEC

Let's see now what kind of mappings the kernel creates and what MPROTECT
has to change in them:

- anonymous mappings (stack, brk() and mmap() controlled heap): these
are created in the VM_WRITE | VM_EXEC | VM_MAYWRITE | VM_MAYEXEC state
which is not a good state. Since these mappings have to be writable, we
can only change the executable status (this will still break real life
applications, see later what could be done about them), MPROTECT simply
changes their state to VM_WRITE | VM_MAYWRITE,

- shared memory mappings: these are created in the VM_WRITE | VM_MAYWRITE
state which is a good state,

- file mappings: similarly to anonymous mappings, these can be created in
all the bad states (list omitted for brevity), in particular the kernel
grants VM_MAYWRITE | VM_MAYEXEC to any mapping regardless of what rights
were requested. In order to break as few applications as possible yet
still achieve our goal, we decided to use the following states for file
mappings:

- VM_WRITE | VM_MAYWRITE or VM_MAYWRITE if PROT_WRITE was requested at
mmap() time

- VM_EXEC | VM_MAYEXEC if PROT_WRITE was not requested.

Effectively executable mappings are forced to be non-writable and writable
mappings are forced to be non-executable (including the impossibility to
change this state during their existence). There is one exception to this
which is needed in order for the dynamic linker to be able to perform
relocations on the executable segment of non-PIC ELF files. If one can
ensure that no such libraries exist on his system (libraries should be
PIC anyway), then this exception can be removed. Note that the ET_DYN ELF
executables suggested for use under RANDMMAP should also be PIC (for this
one needs a PIC version of crt1.o however).

The above restrictions ensure that the only way to introduce executable
code into a task's address space is by mapping a file into memory while
requesting PROT_EXEC as well. For an attacker it means that he has to be
able to create/write to a file on the target system before he can mmap()
it into the attacked task's address space. There are various ways of
preventing/detecting such venues of attack but they are beyond the scope
of the PaX project.

As mentioned before, the MPROTECT restrictions break existing applications
that rely on the bad vma states. Most often this means the non-executable
anonymous mappings as they are used for satisfying higher-level memory
allocation requests (such as the malloc() family in C) and are assumed to
be executable (java, gcc trampolines, etc). One way of allowing such
applications to work under MPROTECT would be to extend the mmap() interface
and allow setting the VM_MAY* flags to certain states. The following
example demonstrates how an application would make use of this change:

- mmap(..., PROT_READ | PROT_WRITE | PROT_MAYREAD | PROT_MAYEXEC, ...)
- generate code into the above area
- mprotect(..., PROT_READ | PROT_EXEC)

Note that PROT_EXEC is neither requested nor allowed in the initial mmap()
call therefore application programmers are forced to call mprotect()
explicitly and hence cannot accidentally violate the MPROTECT policy.


2. Implementation

The first two restrictions are implemented in do_mmap_pgoff() and do_brk()
in mm/mmap.c while the other two are in sys_mprotect() in mm/mprotect.c
(non-PIC ELF libraries are handled by pax_handle_maywrite()).

Since MPROTECT makes sense only when non-executable pages are enforced as
well, the restrictions are enabled only when either of PAGEEXEC or SEGMEXEC
is enabled for the given task. Furthermore some of the restrictions are
already meaningful/necessary for enforcing just the non-executables pages,
therefore they are applied even if MPROTECT itself is not enabled (but
enabling MPROTECT is necessary to complete the feature).

The special case of allowing non-PIC ELF relocations is managed by
pax_handle_maywrite() in mm/mprotect.c. The logic is quite straightforward,
first we verify that the mapping for which PROT_WRITE was requested is a
candidate for relocations (it has to be an executable file mapping that has
not yet been made writable) then we check that the backing file is an
ET_DYN ELF file whose dynamic table has an entry showing the need for text
relocations. If it is to be allowed we simply change the mapping state that
will have the rest of the do_mprotect() logic allow the request and we also
set the VM_MAYNOTWRITE flag that will disallow further PROT_WRITE requests
on the mapping.

+ 72
- 0
noexec.txt View File

@@ -0,0 +1,72 @@
1. Design

The goal of NOEXEC is to prevent the injection and execution of code into a
task's address space and render this exploit technique unusable under PaX.

There are two ways of introducing new executable code into a task's address
space: creating an executable mapping or modifying an already existing
writable/executable mapping. The first method is not handled by PaX, we
believe that various access control systems should manage it. The second
method can be stopped if we prevent the creation of writable/executable
mappings altogether. This also implies that we can have proper executable
semantics for mappings (i.e., one that is not coupled to the readability
of the mapping).

While having the executable semantics is not a security feature itself, it
plays an important role in the PaX strategy because it allows the separation
of the writable and executable properties on memory pages (in most systems
if a page is writable it is also readable and hence executable). This in
turn makes a number of approaches possible that otherwise would be easy to
defeat.

From another point of view NOEXEC is a form of least privilege enforcement.
In our case the idea is that if some data in a task's address space does
not need to be executable then it should not be, hence we need the ability
to mark such data (pages) non-executable. Taking this idea further, if an
application does not need to generate code at runtime then it should not
be able to, therefore we need the ability to prevent state transitions on
memory pages between writability and executability (if an application does
need the ability to generate code at runtime, then these restrictions cannot
be applied to it nor can we then make guarantees about exploit techniques
that are unusable against its bugs).

The first feature NOEXEC implements is the executable semantics on memory
pages. On architectures where the Memory Management Unit (MMU) has direct
support for this we can trivially make use of it. The main and (so far)
only exception is IA-32 where we have to resort to some tricks to get true
non-executable pages. The two approaches PaX has are based on the paging
and the segmentation logic of IA-32 and make various tradeoffs between
performance and usability (the other architectures make no such tradeoff).

The second feature of NOEXEC is making the kernel actually use the finally
available executable semantics. In particular, PaX makes the stack and the
heap (all anonymous mappings in general) non-executable. Furthermore, ELF
file mappings are created with the requested access rigths, that is, only
ELF segments holding code will actually be executable.

The last feature of NOEXEC is the locking down of permissions on memory
pages. This means that we prevent the creation of writable/executable file
mappings (anonymous mappings are already made non-executable) and we also
refuse to turn a writable or non-executable mapping into an executable one
and vice versa. Since this lockdown breaks real life applications that do
need to generate code at runtime, PaX allows to relax these restrictions
on a per executable basis (along with other features).


2. Implementation

The Linux implementation of NOEXEC is split into two main feature sets: the
actual non-executable page implementations (PAGEEXEC and SEGMEXEC, the latter
is IA-32 only) and the page protection restrictions (MPROTECT).

PAGEEXEC uses the paging logic of the CPU whereas SEGMEXEC uses the IA-32
specific segmentation logic to implement the non-executable page semantics
(the exact details are discussed in separate documents). There is also a set
of changes that actually makes the kernel use the non-executable page feature
where it makes sense, e.g., the stack or the brk() managed heap - this is
implemented by simply modifying the appropriate constants to not request
executable memory for these areas.

Since page protection rights originate from mmap() and can be changed by
mprotect(), they are modified to enforce the restrictions, the details are
described in the MPROTECT document.

+ 145
- 0
pageexec.txt View File

@@ -0,0 +1,145 @@
1. Design

The goal of PAGEEXEC is to implement the non-executable page feature using
the paging logic of IA-32 based CPUs.

Traditionally page protection is implemented by using the features of the
CPU Memory Management Unit. Unfortunately IA-32 lacks the hardware support
for execution protection, i.e., it is not possible to directly mark a page
as executable/non-executable in the paging related structures (the page
directory (pde) and table entries (pte)). What still makes it possible to
implement non-executable pages is the fact that from the Pentium series on
the Intel CPUs have a split Translation Lookaside Buffer for code and data
(AMD CPUs have a split TLB since the K5 series however due to its
organization it is usable for our purposes only since the K7 core based
CPUs).

The role of the TLB is to act as a cache for virtual/physical address
translations that the CPU has to perform for every single memory access (be
that instruction fetch or data read/write). Without the TLB the CPU would
have to perform an expensive page table walk operation for every such
memory access and obviously that would be detrimental to performance.

The TLB operates in a simple manner: whenever the CPU wants to access a
given virtual address, it will first check whether the TLB has a cached
translation or not. On a TLB hit it will take the physical address directly
from the TLB, otherwise it will perform a page table walk to look up the
required translation and cache the result in the TLB as well (if the page
table walk is unable to find the translation or the result is in conflict
with the access type, e.g., a write to a read-only page, then the CPU will
instead raise a page fault exception). Note that hardware assisted page
table walking and automatic TLB loading are features specific to IA-32,
other CPUs may have or need software assistance in this operation. Since
the TLB has a finite size, sooner or later it becomes full and the CPU will
have to purge entries to make room for new translations (on IA-32 this is
again automatically done in hardware). Software can also purge TLB entries
by either removing all translations (e.g., whenever a userland context
switch happens) or those corresponding to a specific virtual address.

As mentioned already, from the Pentium on Intel CPUs have a split TLB, that
is, virtual/physical translations are cached in two independent TLBs
depending on the access type: instruction fetch related memory accesses will
load the ITLB, everything else loads the DTLB (if both kinds of accesses are
made to a page then both TLBs will have an entry). TLB entry replacement
works also on a per TLB basis except for the software initiated purges which
act on both.

The above described TLB behaviour means that software has explicit control
over ITLB/DTLB loading: it can get notified on hardware TLB load attempts
if it sets up the page tables so that such attempts will fail and trigger
a page fault exception, and it can also initiate a TLB load by making the
appropriate memory access to have the CPU walk the page tables and load one
of the TLBs. This in turn is the key to implement non-executable pages:
such pages can be marked either as non-present or requiring supervisor level
access in the page tables hence userland memory accesses would raise a page
fault. The page fault handler can then decide whether it was an instruction
fetch attempt (by comparing the fault address to that of the instruction
that raised the fault) or a legitimate data access. In the former case we
will have detected an execution attempt in a non-executable page and can
act accordingly (terminate the task), in the latter case we can just change
the affected page table entry temporarily to allow user level access and
have the CPU load it into the DTLB (we will of course have to restore the
page table entry to the old state so that further page table walks will
again raise a page fault).

The decision between using non-present or supervisor mode page table entries
for marking a page as non-executable comes down to performance in the end,
the latter being less intrusive because kernel initiated data accesses to
userland pages will not raise a page fault.

To sum it up, PAGEEXEC as implemented in PaX overloads the meaning of the
User/Supervisor bit in the ptes to mean the executable/non-executable status
and also makes sure that data accesses to non-executable pages still work as
before.


2. Implementation

PAGEEXEC requires two sets of changes in Linux: the kernel has to be taught
that the i386 architecture can do the proper non-executable semantics, and
next we have to deal with the special page faults that require kernel
assisted DTLB loading.

The low-level definitions of the capabilities of the paging logic are in
include/asm-i386/pgtable.h. Here we simply redefine the constants that are
used for creating the ptes of non-executable pages. One such use of these
constants is the protection_map[] array defined in mm/mmap.c which is
referenced whenever the kernel sets up a pte for a userland mapping. Since
PAGEEXEC can be disabled on a per task basis we have to modify all code
that accesses this array so that we provide an executable pte even if it
was not explicitly requested. Affected files include fs/exec.c (where the
stack pages are set up), mm/mprotect.c, mm/filemap.c and mm/mmap.c. The
changes in the latter two cooperate in a non-trivial way: do_mmap_pgoff()
creates executable non-anonymous mappings by default and it is the job of
generic_file_mmap() to turn it into a non-executable one (as the mapping
turned out to be a file mapping). This logic ensures that non-anonymous
mappings of devices remain executable regardless of PAGEEXEC. We opted for
this approach to remain as compatible as possible (by not affecting all
non-anonymous mappings) yet still make use of the non-executable feature
in the most frequently encountered case.

The kernel assisted DTLB loading logic is in the IA-32 specific page fault
handler which in Linux is do_page_fault() in arch/i386/mm/fault.c. For
easier code maintenance we created our own page fault entry point called
pax_do_page_fault() which gets called first from the low-level page fault
exception handler page_fault found in arch/i386/kernel/entry.S.

First we verify that the given page fault is ours by checking for a userland
fault caused by access conflict (vs. not present page). Next we pay special
attention to faults caused by an instruction fetch since this means an
attempt of code execution in a non-executable page. Such faults are easily
identified by checking for a read access where the target address of the
page fault is equal to the userland instruction pointer (which is saved by
the CPU at the time of the fault for us). The default action is of course a
task termination along with a log message, only EMUTRAMP when enabled can
change it (see separate document).

Next we prepare the mask for setting up the special pte for loading the
DTLB and then we acquire the spinlock that guards MMU state changes (since
we are about to cause such a change ourselves). Holding the spinlock is
also necessary for looking up the target pte that we will modify and load
into the DTLB. If the pte state we looked up no longer corresponds to the
fault type then we must have raced with other MMU state changing code and
pass down the fault to the original fault handler. It is also the time when
we can identify (and pass down) copy-on-write page faults that have the
same fault type but a different pte state than what is caused by the
PAGEEXEC logic.

Finally we change the pte to allow userland accesses to the given page then
perform a dummy read memory access that will have the CPU page table walk
logic load it into the DTLB and then we change the state back to be in
supervisor mode. There is a trick in this part of the code that is worth
a few words. If the TLB already has an entry for a given virtual/physical
translation then initiating a memory access will not cause a page table
walk, that is, for our DTLB loading to work we would have to ensure that
the DTLB has no entries for our virtual address. It turns out that different
members of the Intel IA-32 family have a different behaviour when the CPU
raises a page fault during a page table walk (which is our case): the old
Pentium (but not the MMX version) CPUs would still cache the translation
if it described a present mapping but had an access conflict (which is our
case since we have a supervisor mode pte that is accessed while executing
in user mode) whereas newer CPUs (P6 core based ones, P4 and probably future
CPUs as well) would not cache them at all. This means that in the second
case we can be sure that the DTLB has no translations for our target virtual
address and can omit a very expensive 'invlpg' instruction (it sped up the
fast path by some 20% on a P3).

+ 423
- 0
pax-future.txt View File

@@ -0,0 +1,423 @@
1. Generic Notes & Some Design

To understand the future direction of PaX, let's summarize what we achieve
currently. The goal is to prevent/detect exploiting of software bugs that
allow arbitrary read/write access to the attacked process. Exploiting such
bugs gives the attacker three different levels of access into the life of
the attacked process:

(1) introduce/execute arbitrary code
(2) execute existing code out of original program order
(3) execute existing code in original program order with arbitrary data

Non-executable pages (NOEXEC) and mmap/mprotect restrictions (MPROTECT)
prevent (1) with one exception: if the attacker is able to create/write
to a file on the target system then mmap() it into the attacked process
then he will have effectively introduced and executed arbitrary code.

Address space layout randomization (ASLR) prevents all of (1), (2) and (3)
in a probabilistic sense for bugs where the attacker needs advance knowledge
of addresses in the attacked process *and* he cannot learn about them (i.e.,
there is no information leaking bug in the target).

It is also worth noting that since all of PaX is implemented in the kernel,
the kernel is considered as the Trusted Computing Base which can be subject
to the same kind of attacks as userland.

Based on the above the future efforts will aim at the following:

(a) try to handle the exception to (1)
(b) implement all possible features for protecting the kernel itself against
exploiting kernel bugs, optionally (long term) create a non-kernel mode
TCB
(c) implement deterministic protection for (2) and maybe (3)
(d) implement probabilistic protection for (2) potentially resisting
information leaking

We do not plan to deal with (a) in general, it is better left for Access
Control and/or Trusted Path Execution systems which are beyond our project
(e.g., grsecurity or some LSM modules).

Note also that detecting/preventing (3) is probably equivalent to the
halting problem (since the protection would have to detect data changes and
their influences on all possible execution flows), so we do not expect
generic solutions such as for (1). Even detecting/preventing (2) looks
suspicious in this regard (given our attack model), but we have yet to see
where and to what extent we can make a compromise to achieve the goal.


2. More Design & Implementation

In the following we will give some details on (a), (b), (c) and (d) although
it should be noted that all this is just plans, we do not know yet what will
prove to be practical in the end. There are three measures that will help
answer this question though:
- how many/extensive changes are required (complexity),
- how much protection is achieved (efficiency),
- is it fast enough for practical purposes (performance).
In the following we will try to answer these questions as well.

(a.1) for a certain subset of the exception there can be a simple solution,
requiring only some changes in the dynamic linker: if the target
application is dynamically linked but does not load libraries
explicitly (via the dlopen() interface) then the dynamic linker could
be modified to tell the kernel when it is done loading all the shared
libraries and afterwards the kernel could prevent all executable file
mappings (or at least remove the executable status on them).

This solution will require userland changes (vs. kernel only) but
they will be fairly simple. It will also be efficient (achieves the
goal for this set of applications) and fast (basically no measurable
performance impact).

(a.2) actually, the above kernel notification is not strictly needed, the
kernel itself could determine if previous file mappings reference
dlopen() and whether the dynamic linker is done loading libraries.

This solution will be kernel only however it will be more complex
as we will have to parse the ELF headers to determine whether the
dlopen() interface is referenced for dynamic linking (in (a.1) the
dynamic linker already has the necessary parsing code). It will also
be as efficient and fast as (a.1) since it will do the same kind of
processing, just inside the kernel.

(b.1) non-executable kernel pages: using the segmentation logic of IA-32
we can reduce the KERNEL_CS segment to cover only the kernel image,
furthermore we can make this area read-only then using the paging
logic. This step will most likely require the reorganization of the
kernel image (by modifying the kernel linker script and maybe even
source code). How module support fits into this is yet to be determined.

This solution will be complex and may not be kernel-only (depends on
how the module problem can be solved, in particular if the current
modutils support mapping different ET_REL ELF file sections into non-
contiguous memory areas). We expect no performance impact from this
(much like SEGMEXEC) and to be very efficient (i.e., this will prevent
introduction/execution of arbitrary code in kernel land).

(b.2) read-only kernel function pointers: one way of doing (2) is to modify
function pointers and this is what this method would prevent.

This solution will require kernel-only changes, however they will be
quite extensive (although simple) as normally Linux does not use the
'const' qualifier for structures that contain such pointers (even if
they are really not meant to be modified). Maybe this could also be
brought to the attention of the mainline kernel developers since this
is more of a program correctness issue than that of security per se.

We expect no performance impact from this however we note that this
is not an efficient solution since there are still methods of
achieving (2), in particular there are function pointer like entities
on the kernel stack (saved return EIP values from function calls).
These can however be addressed as well as we will show it later.

(b.3) non-kernel based TCB: IA-32 has two execution modes that are normally
beyond the scope of normal programs (be that OS kernels or userland):
System Management Mode (SMM) and Probe Mode. SMM is normally used to
implement power management schemes (APM/ACPI), while Probe Mode is
used for low-level (hardware) debugging. Since using Probe Mode would
require extra hardware normally not present in consumer systems, we
will discuss SMM only as most current systems have the necessary
hardware support in place already (the PCI chipset) and the rest
requires pure software only.

SMM has a few features that would allow for implementing a new TCB,
beyond the reach and control of traditional OS kernels (i.e., it could
not be compromised even by ring-0 protected mode code).

The first feature is that the PCI chipset (North Bridge or Memory
Controller Hub in particular) allows to carve out some amount of RAM
for SMM use only (i.e., such memory could be accessed only while the
CPU is in SMM). Normally this memory is set up by the BIOS which will
store the power management code there (the System Management Interrupt
(SMI) handler and related data).

The second feature is that SMI can be generated by both software and
hardware means, i.e., it is possible to create an API between the SMI
handler and the rest of the kernel. Also the SMI handler can be
invoked periodically, beyond normal kernel control.

These features would allow for storing various cryptographic
primitives inside the SMI handler and have them validate the state of
the kernel periodically. They would also provide the kernel with some
limited amount of secure data storage (albeit volatile).

This solution is very complex, very hardware specific however very
efficient as well. Depending on the frequency of the SMI, it may or
may not have a noticable performance impact.

(c.0) an attacker can achieve (2) if he is able to change the execution flow
of the target to his liking. Since such changes normally occur at very
specific instructions (jmp, call, retn), our goal is to protect these.

All of these methods will necessarily be userland changes, the kernel
could not perform them fast enough, if at all. Note also that these
methods can be implemented at various levels of abstraction, such as
the source code, the compiler, the static and the dynamic linker.

Complexity is expected to be high in general regardless of what level
they are implemented at. We also expect that we will have to make
tradeoffs between efficiency and performance, and even in the extreme
we do not expect to be very efficient (that is, there will remain
attack methods under specific circumstances and we may not even be
able to precisely determine them for a given target).

With this said, let's look at a few ideas now.

(c.1) To protect execution flow changes via the jmp/call instructions, we
would have to be able to identify all explicit function pointers in
the target and make them read-only. While we can at least identify
them if we have the source code or there are sufficient relocations
present in the given ELF file (this is where using ET_DYN ELF files
becomes important, even for normal executables), making them read-only
is a problem.

First, such function pointers are normally not qualified as 'const'
in the original source code (in whatever language), so in memory
they will end up being mixed with other writable data and hence using
the paging logic to make such pages read-only would introduce a big
performance impact. Second, there are function pointers which need
to be writable by the program logic itself. In any case, reducing the
number of writable function pointers does decrease the success rate
of (2) since there will be less places where the execution flow can be
diverted (and the remaining places may not be enough to successfully
attack the given target).

The most important function pointers are the following:

- GOT entries
- .ctors/.dtors
- atexit() handlers
- malloc() callbacks
- linuxthread callbacks (atfork() handlers)

Of these at least the first two sets can be easily isolated and made
read-only, the others are by their nature writable. To protect the
GOT entries we will have to make changes to both the static and the
dynamic linker. The static linker will have to place the GOT into
its own PT_LOAD segment so that later the dynamic linker can use
mprotect() to make it read-only. Making the GOT read-only is possible
either at application startup or on every GOT entry update. The former
requires the enforcement of the LD_BIND_NOW behaviour and it affects
startup time (for the worse), whereas the latter affects GOT entry
resolution time (it needs two mprotect() calls to make the GOT
temporarily writable then read-only again) and is open to races
(although probably not easy to exploit since by the time an attack
would take place, most needed GOT entries should have been resolved).

Note that being able to change the writability status of the GOT is
not a problem since under PaX abusing it would require function
pointer hijacking (to call mprotect() with proper arguments) which
is either not possible or if it is, then the attacker is already able
to call any function and pass arguments to it, so being able to change
the GOT does not give him anything extra.

The actual implementation does not need to be an ld.so modification,
it can be another shared library loaded through /etc/ld.so.preload.

Protecting .ctors/.dtors is as simple as changing the linker script
to put these sections into a read-only segment.

(c.2) To protect execution flow changes via the retn instruction, we face
the same problem as above with writable (by their nature) function
pointers. What makes the situation different is that we can determine
in advance (at compile/link time) and verify at runtime the valid
values for a given saved EIP on the stack. An efficient and simple
example is shown below:

callee
epilogue: mov register,[esp]
cmp [register+1],MAGIC
jnz .1
retn
.1: jmp esp

caller:
call callee
test eax,MAGIC

What happens here is that we insert an instruction (that does not
otherwise affect the program logic) after the call which encodes in
its body a magic number derived from the callee's prototype and then
before returning from the callee we verify that we are returning to
a proper call place or not (register is anything that can be trashed
inside a function). The jmp esp will simply trigger the non-executable
page enforcement logic and terminate the task.

This method is efficient in that it sharply reduces the number of code
locations where a retn can be forced to return to, and in particular
it cannot return to the normal entry point of functions which makes
argument passing and stack balancing non-trivial. Note also that it
does not matter if the call itself is a direct or indirect one, i.e.,
calls through traditional function pointers are just as well protected
as normal function calls (although it may be worth separating the two
cases because the magic number for directly called functions can be
based on more information, such as the function's name).

The drawback of this method is that reaction on an attack is delegated
to userland. It means that when deploying this method in a system,
every component (executables + libraries) must be recompiled at once,
it cannot be gradually introduced into the system nor can binaries be
moved across systems of different nature nor can it be disabled easily
(that is, it would require another full system recompilation). As we
will see later, there is a probabilistic approach that remedies some
of these problems.

(c.3) The next method works by assuming a weakened attack model where we
still assume arbitrary read/write access but we only aim to prevent
attacks that need to divert execution flow more than once (i.e., the
multiple return-to-libc style attack).

In this case we can modify the function prologue/epilogue code so
that a necessary 'ingredient' for the multiple style of the attack
would no longer be present at all. Let's observe first that on IA-32
function parameters are passed on the stack normally (vs. registers)
and hence the stack needs to be rebalanced after a function call. In
a multiple return-to-libc style attack an attacker has to explicitly
rebalance the stack after every diverted function call (under Linux
the default calling convention is 'caller rebalances' vs. the 'callee
rebalances' convention used in Windows). Rebalancing requires the
modification of ESP by a specific amount and then also transferring
control to execute the next step of the attack. Such an instruction
sequence is normally found in the epilogue code of functions and
therefore this is what has to be modified (and for symmetry reasons,
the prologue as well). Our idea is to modify ESP inside a function so
that it would be way off track while the function executes and have
a prologue and epilogue code similar to this:

without frame ptr with frame ptr

prologue: sub esp,BIG_VALUE+4 push ebp
mov [esp+BIG_VALUE],register mov ebp,esp
... push register
sub esp,BIG_VALUE
...

mov [esp+BIG_VALUE+TOP],param mov [esp+BIG_VALUE+TOP],param
add esp,BIG_VALUE-TOP-4 add esp,BIG_VALUE-TOP-4
call func call func
sub esp,BIG_VALUE-TOP sub esp,BIG_VALUE-TOP
... ...

epilogue: mov register,[esp+BIG_VALUE] mov register,[esp+BIG_VALUE]
add esp,BIG_VALUE+4 mov ebp,[esp+BIG_VALUE+4]
retn add esp,BIG_VALUE+8
retn

It is clear that an appropriately chosen BIG_VALUE will prevent the
stack rebalancing since during an attack only the epilogue is executed
without the corresponding prologue. Note also that ESP has to be
restored across function calls so that the callee can perform the same
ESP modification. This method is expected to be efficient but also to
have a noticable impact on performance, although we cannot tell how
much (and whether it will be acceptable for practical purposes). Also
it will be complex if for no other reason than signal delivery which
wants to use the current stack (ESP) and would result in the kernel
killing the task in our case. There are different ways of addressing
this problem, such as delaying signal delivery (e.g., during system
calls we will have a valid userland ESP so at that time signals can
be safely delivered) or setting up a special signal stack (which is
an already supported POSIX feature, although doing it in multithreaded
applications would not be trivial) or teaching the kernel about the
userland ESP shift value and have it take that into account before
delivering the signal.

(c.4) attack method (3) is probably the hardest problem of all. Since here
the attacker relies on normal program behaviour (as far as execution
flow is concerned), we cannot detect changes there. What is possible
is to try to prevent specific ways of modifying arbitrary memory and
hence reduce the potential arsenal an attacker can use.

The following is a list of the most well-known and widespread bugs
that give some kind of write (and sometimes read) ability:

- stack based array overflow
- heap based array overflow
- user controlled format string
- integer overflow
- incorrect integer signedness change

There are already a few known solutions for detecting/preventing
exploiting such bugs (compiler modifications), where we can extend
such work is to try to do them at the binary level (i.e., without
access to the source code). The best candidates for such modifications
are ET_DYN ELF files (typically libraries), ET_EXEC ELF executables
have the fundamental problem of not having enough relocations (maybe
vendors should be educated to produce either ET_DYN executables in
the future or at least generate sufficient relocations into ET_EXEC
files, GNU ld makes the latter very easy now).

(d.1) if we weaken our attack model by assuming arbitrary write ability but
no reading then we can extend ASLR to introduce randomness into the
saved values of EIP and hence deter attacks such as the one described
in Phrack #59-9. An efficient and simple (fast) implementation could
be the modification of function prologue and epilogue code as shown
below:

without frame ptr with frame ptr

prologue: xor [esp],esp xor ebp,[esp]
sub esp,xxxx xor [esp],esp
... push ebp
mov ebp,esp
sub esp,xxxx
...

epilogue: add esp,xxxx add esp,xxxx/mov esp,ebp
xor [esp],esp pop ebp
retn/retn xx xor [esp],esp
xor ebp,[esp]
retn/retn xx

This trick uses the fact that under ASLR ESP has randomness in its
lower 8 bits as well (bits 4-7).

(d.2) The next method is the probabilistic version of saved EIP checking.
Here the kernel will create a random 32 bit value (cookie) on every
system call and place it into a variable provided by userland (it can
be described in the ELF header for example). Then we modify userland
as follows:

callee
epilogue: mov register,[esp]
mov register,[register+1]
sub register,MAGIC
neg register
sbb register,register
lock or [cookie],register
retn

caller:
call callee
test eax,MAGIC

It is clear that during a normal execution flow the value of cookie
does not change after a function returns hence it can be checked for
whenever the task enters the kernel through a system call. Should the
kernel find a discrepancy between the cookie as stored in userland and
its own version, it can react like it does for the non-executable page
violations. To make information leaking useless the kernel can simply
generate a new cookie on each system call (after having verified the
previous one of course), this way an attacker cannot learn the current
value of the cookie since that itself requires communication and hence
a system call which would obsolete the leaked information.

This method is expected to be as efficient as its deterministic
version and it also suffers from one less problem: it can be applied
gradually to a system, no need to recompile everything at once, also
protected executables can be freely moved between different systems
without adverse effects.

Note that this method is free from races (coming from signal delivery
or multiple threads) because cookie is updated in one irreversible
atomic or instruction (the kernel should of course prevent generating
the 0xffffffff value for the cookie, but since its probability is very
low, this is not critical).


Finally, we would like to improve on the existing methods PaX provides.
In particular, we should change the way we get randomness for ASLR from
the kernel since the current method is too exhaustive and is an overkill
since we do not really need such a good quality randomness at all. The
other possible improvment is in the fast path of the PAGEEXEC page fault
handler which could be sped up a bit by using assembly.

+ 141
- 0
pax.txt View File

@@ -0,0 +1,141 @@
1. Design

The goal of the PaX project is to research various defense mechanisms
against the exploitation of software bugs that give an attacker arbitrary
read/write access to the attacked task's address space. This class of bugs
contains among others various forms of buffer overflow bugs (be they stack
or heap based), user supplied format string bugs, etc.

It is important to realize that our focus is not on the finding and fixing
such bugs but rather on prevention and containment of exploit techniques.
For our purposes these techniques can affect the attacked task at three
different levels:

(1) introduce/execute arbitrary code
(2) execute existing code out of original program order
(3) execute existing code in original program order with arbitrary data

For example the well known shellcode injection technique belongs to (1)
while the so-called return-to-libc style technique belongs to (2).

Introducing code into a task's address space is possible by either creating
an executable mapping or modifying an already existing writable/executable
mapping. The first method can be prevented by controlling what can be mapped
into the task and is beyond the PaX project, access control systems are the
proper way of handling this. The second method can be prevented by not
allowing the creation of writable/executable mappings at all. While this
solution breaks some applications that do need such mappings, until they are
rewritten to handle such mappings more carefully this is the best we can do.
The details of this solution are in a separate document describing NOEXEC.

Executing code (be that introduced by the attacker or already present in
the task's address space) requires the ability to change the execution flow
using already existing code. Such changes occur when code dereferences a
function pointer. An attacker can intervene if such a pointer is stored in
writable memory. Although it would seem a good idea to not have function
pointers in writable memory at all, it is unfortunately not possible (e.g.,
saved return addresses from procedures are on the stack), so a different
approach is needed. Since the changes need to be in userland and PaX has
so far been a kernel oriented project, they will be implemented in the
future, see the details in a separate document.

The next category of features PaX employs is a form of diversification:
address space layout randomization (ASLR). The generic idea behind this
approach is based on the observation that in practice most attacks require
advance knowledge of various addresses in the attacked task. If we can
introduce entropy into such addresses each time a task is created then we
will force the attacker to guess or brute force it which in turn will make
the attack attempts quite 'noisy' because any failed attempt will likely
crash the target. It will be easy then to watch for and react on such
events. The details of this solution are in a separate document describing
ASLR.

Before going into the analysis of the above techniques, let's note an often
overlooked or misunderstood property of combining defense mechanisms. Some
like to look at the individual pieces of a system and arrive at a conclusion
regarding the effectivenes of the whole based on that (or worse, dismiss one
mechanism because it is not efficient without employing another, and vice
versa). In our case this approach can lead to misleading results. Consider
that one has a defense mechanism against (1) and (2) such as NOEXEC and the
future userland changes in PaX. If only NOEXEC is employed, one could argue
that it is pointless since (2) can still be used (in practice this reason
has often been used to dismiss non-executable stack approaches, which is
not to be confused with NOEXEC however). If one protects against (2) only
then one could equally well argue that why bother at all if the attacker
can go directly for (1) and then the final conclusion comes saying that
none of these defense mechanisms is effective. As hinted at above, this
turns out to be the wrong conclusion here, deploying both kinds of defense
mechanisms will protect against both (1) and (2) at the same time - where
one defense line would fail, the other prevents that (i.e., NOEXEC can be
broken by a return-to-libc style attack only and vice versa).

In the following we will assume that both NOEXEC (the non-executable page
feature and the mmap/mprotect restrictions) and full ASLR (using ET_DYN
executables) are active in the system. Furthermore we also require that
there be only PIC ELF libraries on the system and also a crash detection
and reaction system be in place that will prevent the execution of the
attacked program after a fixed (low) number of crashes. The possible venues
of attack against such a system are as follows:

- attack method (3) is possible with 100% reliability if the attacker
does not need advance knowledge of addresses in the attacked task.

- attack methods (2) and (3) are possible with 100% reliability if the
attacker needs advance knowledge of addresses and can derive them by
reading the attacked task's address space (i.e., the target has an
information leaking bug).

- attack methods (2) and (3) are possible with a small probability if the
attacker needs advance knowledge of addresses but cannot derive them
without resorting to guessing or a brute force search ('small' can be
further quantified, see the ASLR documentation).

- attack method (1) is possible if the attacker can have the attacked
task create, write to and mmap a file. This in turn requires attack
method (2), so the analysis of that applies here as well (note that
although not part of PaX per se, it is recommended among others, that
production systems use an access control system that would prevent
this venue of attack).

Based on the above it should come as no surprise that the future direction
of PaX will be to prevent or at least reduce the efficiency of method (2)
and eliminate or reduce the number of ways method (3) can be done (which
will also help counter the other methods of course).


2. Implementation

The main line of development is Linux 2.4 on IA-32 (i386) although most
features already exist for alpha, ia64, parisc, ppc, sparc, sparc64 and
x86_64 as well and other architectures are coming as hardware becomes
available (thanks to the grsecurity and Hardened Gentoo projects). For
this reason all implementation documentation is i386 specific (the generic
design ideas apply to all architectures though).

The non-executable page feature exists for alpha, i386, ia64, parisc, ppc,
sparc, sparc64 and x86_64 while ppc64 can have the same implementation as
ppc. The mips and mips64 architectures are hopeless in general as they have
a unified TLB (the models with a split one will be supported by PaX). The
main document on the non-executable pages and related features is NOEXEC,
the two i386 specific approaches are described by PAGEEXEC and SEGMEXEC.

The mmap/mprotect restrictions are mainly architecture independent, only
special case handling needs architecture specific code (various pieces of
code that need to be executed from writable and therefore non-executable
memory, e.g., the stack or the PLT on some architectures). Here the main
document is MPROTECT whereas EMUTRAMP and EMUSIGRT describe the i386
specific emulations.

ASLR is also mostly architecture independent, only the randomizable bits
of various addresses vary among the architectures. The documents are split
based on the randomized region, so RANDKSTACK and RANDUSTACK describe the
feature for the kernel and user stack, respectively. RANDMMAP and RANDEXEC
are about randomizing the regions used for (among others) ELF libraries
and executables, respectively. The infrastructure that makes both SEGMEXEC
and RANDEXEC possible is vma mirroring described by the VMMIRROR document.

Since some applications need to do things that PaX prevents (runtime code
generation) or make assumptions that are no longer true under PaX (e.g.,
fixed or at least predictable addresses in the address space), we provide
a tool called 'chpax' that gives the end user fine grained control over the
various PaX features on a per executable basis.

+ 79
- 0
paxteam-on-kaslr.txt View File

@@ -0,0 +1,79 @@
KASLR: An Exercise in Cargo Cult Security
Postby spender » Wed Mar 20, 2013 6:46 pm
Since this post about Kernel Address Space Layout Randomization (KASLR) extends beyond a critique of the feature itself and into a commentary on the state of commercial defensive security and how it is evaluated both by the security community and by end-users, I asked the PaX Team to contribute some valuable context to the discussion. As the creator of ASLR in 2001, he shares below some history and motivations for ASLR at the time. His exploit vector classification and ASLR analysis cover important nuances and fundamental truths lost in the barrage of "bypasses" in the industry. I continue later in more depth under the heading "Why KASLR is a Failure".
Before talking about KASLR it seems high time that we revisited a little history regarding ASLR itself. About 12 years ago PaX had already proved that there was in fact a practical way to prevent code injection attacks, the prevalent exploit technique against memory corruption bugs at the time (and even today thanks to the widespread use of JIT compiler engines). It was also clear that the next step for both sides would be to focus on executing already existing code, albeit in an order not intended by the programmer of the exploited application (the market word for this is ROP/JOP/etc). Much less relevant back then but there was always the possibility to exploit these bugs by merely changing data and without disturbing the program logic directly (data only attacks, [1] [2]). Foreseeing these future developments in practical exploit techniques made me think whether there was perhaps some general way to prevent them or at least reduce their effectiveness until specific reliable and practical defenses could be developed against the remaining exploit techniques (it was clear that such defenses wouldn't come by as easily as non-executable pages, and alas, in 2013AD nobody has still much to show). In other words, ASLR was always meant to be a temporary measure and its survival for this long speaks much less to its usefulness than our inability to get our collective acts together and develop/deploy actual defenses against the remaining exploit techniques.
In any case, thus was the concept of ASLR born which was originally called (for a whole week maybe ;)) ASR for Address Space Randomization (the first proof of concept implementation did in fact randomize every single mmap call as it was the simplest implementation).
The concept was really simple: by randomizing memory addresses on a per process basis we would turn every otherwise reliable exploit needing hardcoded addresses into a probabilistic one where the chances of success were partially controlled by the defense side. While simple in concept and implementation, ASLR doesn't come without conditions, let's look at the them briefly. For ASLR to be an effective prevention measure the following must hold:
1. the exploit must have to know addresses from the exploited process (there's a little shade of grey here in that, depending on the situation, knowledge of partial addresses may be enough (think partial or unaligned overwrites, heap spraying), 'address' is meant to cover both the 'full' and 'partial' conditions),
2. the exploit must not be able to discover such addresses (either by having the exploited application leak them or brute force enumeration of all possible addresses)
These are conditions that are not trivially true or false for specific situations, but in practice we can go with a few heuristics:
1. remote attacks: this is the primary protection domain of ASLR by design because if an exploit needs addresses at all, this gives an attacker the least amount of a priori information. It also puts a premium on infoleaking bugs on the attack side and info leak and brute force prevention mechanisms on the defense side.
2. local attacks: defense relying on ASLR here faces several challenges:
- kernel bugs: instead of attacking userland it is often better to attack the kernel (the widespread use of sandboxes often makes this the path of least resistance) where userland ASLR plays a secondary role only. In particular it presents a challenge only if exploiting the kernel bug requires the participation of userland memory, a technique whose lifetime is much reduced now that Intel (Haswell) and ARM (v6+) CPUs allow efficient userland/kernel address space separation.
- information leaks: the sad fact of life is that contemporary operating systems have all by design features that provide address information that an exploit needs and it's almost a whack-a-mole game to try to find and eliminate them. Even with such intentional leaks fixed or at least worked around there's still kernel bugs left that leak either kernel or userland addresses back to the attacker. Eliminating these didn't receive much research yet, the state-of-the-art being grsecurity but there is still much more work in this area to be done.
So how does KASLR relate to all this? First of all, the name itself is a bit unfortunate since what is being randomized is not exactly what happens in userland. For one, userland ASLR leaves no stone unturned so to speak. That is, a proper ASLR implementation would randomize all memory areas and would do so for every new process differently. There is no equivalent of this mechanism for a kernel since once booted, the kernel image does not change its location in memory nor is it practically feasible to re-randomize it on every syscall or some other more coarse-grained boundary. In other words, it's as if in userland we applied ASLR to a single process and kept running it indefinitely in the hope that nothing bad would happen to it. At this rate we could call the long existing relocatable kernel feature in Linux KASLR as well since it's trivial to specify a new randomly chosen load address at boot time there.
Second, the amount of randomization that can be applied to the base address of the kernel image is rather small due to address space size and memory management hardware constraints, a good userland ASLR implementation provides at least twice the entropy of what we can see in KASLR implementations. To balance this deficiency there's usually already some form of inadvertent brute force prevention present in that most kernels usually don't recover from the side effects of failed exploit attempts (Linux and its oops mechanisms being an exception here).
Third and probably the biggest issue for KASLR is its sensitivity to even the smallest amounts of information leaks, the sheer amount of information leaking bugs present in all kernels and the almost complete lack of prevention mechanisms against such leaks.
This situation came about because historically there was no interest until very recently in finding such bugs let alone systematically eliminating them or preventing their effects (except on hardened systems such as grsecurity, recognizing this fact early is the reason why we've been working on this problem space for many years now). Based on our experience with Linux, this will be a long drawn out uphill battle until such bugs are found or at least their effects neutralized in KASLR based kernels.
Why KASLR is a Failure
Continuing where the PaX Team left off, it should begin to become clear that ASLR has been taken out of the context of its original design and held on a pedestal as something much more than what it was originally intended as: a temporary fix until more effective measures could be implemented (which have much less to do with difficulty than the lack of resources on our side).
Information leakage comes in various forms. For our purposes here we'll consider two types of leakage: addresses and content, the former being a subset of the latter. The leaks can have spatial locality (say by leaking a string whose null terminator has been overwritten), be constrained in some other way (say due to an incomplete bounds check), or be completely arbitrary. They can also be active (by creating a leak) or passive (e.g. uninitialized struct fields). Fermin J. Serna's 2012 talk, "The info leak era on software exploitation" [3] covers lots of background information and common vulnerabilities and techniques as they pertain to userland.
The KASLR implementations of Microsoft and Apple operate in an environment where the kernel is a known entity whose contents can be obtained in a variety of ways. While on a custom-compiled Linux kernel set up properly, both the content and addresses of the kernel image are secret, for Microsoft and Apple the contents of the kernel image are known. To use "ROP" against the kernel, one needs to know not only the address of what one is returning to, but also what exists at that address. So the only "secret" in KASLR is the kernel's base address. It follows from this that any known pointer to the kernel image reveals its base address.
Due to operational constraints, however, the situation is even more dire. iOS KASLR for instance uses only a small 8 bits of entropy. If this weren't enough, the model is even further weakened from what we discussed above as not even a known pointer is needed to reveal the kernel base address. Any pointer to the kernel will reveal its base address via the upper 11 bits (the uppermost three bits are a given). The kernel is mapped aligned to a 2MB boundary. In Stefan Esser's recent iOS presentation [4] he called this 2MB alignment "arbitrary" wondering why there was no smaller alignment. This alignment is not arbitrary at all and is in fact a platform and performance-based constraint on ARM. "Sections," the ARM equivalent of large pages on x86, are implemented in the short mode descriptor format as 2MB first-level page table entries. This is why the iOS kernel as mapped in memory is composed of one 2MB "text" region and one 2MB "data" region -- because a page table entry is needed for each region to express their different memory protections. The kernel is mapped using sections as opposed to normal 4kB pages because it doesn't pollute the TLB with many entries (and potentially other reasons). Don't expect this alignment at the page table level to change. Inside the kernel, leaks will exist in the fixed-address vector mapping until it is relocated via TrustZone extensions.
KASLR has likewise been praised [5] out of context on OS X. Though the latest version of OS X supports SMEP on Ivy Bridge processors (the latest generation of Intel Core architecture), no processors are available yet that support SMAP. OS X running with a 64bit kernel does not have the userland/kernel memory separation people have been used to in years past. Though similar memory separation is possible without SMEP/SMAP on the 64bit kernel from the "no_shared_cr3" boot argument (thanks to Tarjei Mandt for pointing me to this), it is unlikely that anyone is running in this mode as it imposes a severe performance hit (upwards of 30%+ with today's TLB architectures and sizes). Since cr3 is swapped on changing between a kernel-only and shared address space, cr3 modifications (and thus implied TLB flushes) occur on kernel entry, kernel exit, and before and after every copy to/from userland. Therefore, without SMEP, arbitrary code execution is trivially done in userland. Without SMAP, crafted data for sensitive APIs or ROP payloads can be easily stored in userland, removing any need for reliable storage addresses in the kernel.
Information leaks are the critical weakness of ASLR, and KASLR amplifies this weakness. Bases are only randomized once at boot, and (at least OS X/iOS's) heap/stack randomization is weak and irrelevant over time. For every usable privilege escalation vulnerability found, at least one usable infoleak will exist. Obvious infoleaks will be fixed by Apple and Microsoft (e.g. NtQuerySystemInformation), but other infoleak sources will prove more difficult to eradicate. Uninitialized structure fields can very easily creep into code (as we've seen time and again in Linux). Improper use of certain string routines like snprintf() can cause infoleaks. Pointers get used as unique IDs [6], pointers are printed. Structure padding often introduces infoleaks, especially on 64bit architectures. An OS that had 32bit kernels only until very recently switching to 64bit might find many infoleaks suddenly appearing due to this structure padding if one were to look. Linux has been struggling with infoleaks for years and even still they're readily found. Mathias Krause found 21 of them recently in the Linux kernel [7], and "3" more even more recently [8]. I say "3" because if you look at the first commit, for instance, you'll see 8 infoleaks being fixed in a single file. 13 infoleaks rolled up into one CVE -- tidy. During the writing of this article, even, I discovered a new infoleak in the Linux kernel that would have evaded any kind of manual code analysis. An unsanitized field that turned into a limited local ASLR infoleak was found via PaX's size_overflow plugin recently as well that evaded manual inspection by the "many eyes" for years [9]. This vulnerability goes back to Linux 1.3.0 (yes you read that correctly, from 1995 -- 18 years).
Of important note is that large infoleaks via these bugs are rare (and we've taken many steps in grsecurity to further reduce the possibility of leaks through approved copying interfaces). What is not rare are small leaks, large enough to leak pointers. The leaks are often local to the source of the bug. An uninitialized heap struct might leak pointers, but it will never directly leak code from the kernel image. Uninitialized stack entries can be coerced into providing desired pointers from previous syscalls. All these things mean it's much more likely to leak addresses that would reveal all useful "secrets". These secrets are the translations between addresses and known content, and their discovery enables full scale ROP. It's much less likely that content itself will be leaked in quantities sufficient enough for the same kind of attack. While Halvar's famous quote "you do not find info leaks... you create them" rings true for much userland exploitation, in the kernel you will come to know that it is much easier (and safer) to find info leaks than create them.
We've seen this kind of cargo cult security before [10] [11], of copying techniques into a different environment and via a kind of post hoc, ergo propter hoc logic fallacy, assuming the technique in its new environment will provide the same security. The kptr_restrict sysctl currently exists in the upstream Linux kernel and in most modern distros. It was derived from my GRKERNSEC_HIDESYM feature and submitted upstream by Dan Rosenberg [12]. The intent of the feature was to not leak kernel pointers to userland via /proc interfaces, symbol lists, etc. While the configuration help for GRKERNSEC_HIDESYM mentioned explicitly three things that needed to hold true for the feature to be useful at all, among them being that the kernel was not compiled by a distribution and that the kernel and associated modules, etc on disk are not visible to unprivileged users, you'll note that nowhere in the commit message or the in-kernel documentation for kptr_restrict is any kind of qualification for its efficacy mentioned. So what do we see as a result of this? Take Ubuntu [13]:
When attackers try to develop "run anywhere" exploits for kernel vulnerabilities, they frequently need to know the location of internal kernel structures. By treating kernel addresses as sensitive information, those locations are not visible to regular local users. Starting with Ubuntu 11.04, /proc/sys/kernel/kptr_restrict is set to "1" to block the reporting of known kernel address leaks. Additionally, various files and directories were made readable only by the root user: /boot/vmlinuz*, /boot/System.map*
All of it utterly useless as every one of these files is publicly available to anyone, including the attacker. And so the false security spreads.
KASLR is an easy to understand metaphor. Even non-technical users can make sense of the concept of a moving target being harder to attack. But in this obsession with an acronym outside of any context and consideration of its limitations, we lose sight of the fact that this moving target only moves once and is pretty easy to spot. We forget that the appeal of ASLR was in its cost/benefit ratio, not because of its high benefit, but because of its low cost. A cost which becomes not so low when we consider all the weaknesses (including the side-channel attacks mentioned in the paper that triggered this whole debate [14] which have not been covered in this article for various reasons, mostly because I don't want to sway the many optimists that popped up away from their firmly held beliefs that these are the only problems of this kind that can and will be fixed [15]). KASLR is more of a marketing tool (much like the focus of the rest of the industry) than a serious addition to defense. Many other strategies exist to deal with the problem KASLR claims to deal with. To use some wording from the PaX Team, the line of reasoning is: we need to do something. KASLR is something. Let's do KASLR. "Make attacks harder" is not a valid description of a defense. Nor is "it's better than nothing" an acceptable excuse in the realm of security. If it is, then we need to give up the facade and admit that these kinds of fundamentally broken pseudo-mitigations are nothing more than obfuscation, designed to give the public presence of security while ensuring the various exploit dealers can still turn a profit off the many weaknesses.
The details matter.
Consider this our "I told you so" that we hope you'll remember in the coming years as KASLR is "broken" time and again. Then again, in this offensive-driven industry, that's where the money is, isn't it?
[1] http://static.usenix.org/event/sec05/te ... n/chen.pdf
[2] http://www.cs.princeton.edu/~dpw/papers/yarra-csf11.pdf
[3] http://media.blackhat.com/bh-us-12/Brie ... Slides.pdf
[4] http://www.slideshare.net/i0n1c/csw2013 ... 0dayslater
[5] https://twitter.com/aionescu/status/312945665888120832
[6] http://lists.apple.com/archives/darwin- ... 00012.html
[7] http://www.openwall.com/lists/oss-security/2013/03/07/2
[8] http://seclists.org/oss-sec/2013/q1/693
[9] viewtopic.php?f=7&t=2521
[10] viewtopic.php?f=7&t=2574
[11] https://lkml.org/lkml/2013/3/11/498
[12] https://lwn.net/Articles/420403/
[13] https://wiki.ubuntu.com/Security/Features
[14] http://scholar.googleusercontent.com/sc ... oogle.com/
[15] 3a5f33b4af2ffbc27530087979802613fae8ed3ce0ae10e41c44c2877a76605b

+ 126
- 0
randexec.txt View File

@@ -0,0 +1,126 @@
1. Design

The goal of RANDEXEC is to introduce randomness into the main executable
file mapping addresses.

While RANDMMAP takes care of randomizing ELF executables of the ET_DYN
kind, a special approach is needed for randomizing the position of ET_EXEC
ELF executables. As far as randomization is concerned the fundamental
difference between the two kinds of ELF files is the lack/presence of
adequate relocation information. Traditionally ET_EXEC files are created
by the linker with the assumption that they will be loaded at a fixed
address and hence relocation information is superfluous and mostly omitted.
It is worth noting though that ld from the GNU binutils package can be told
to emit full relocation information into ET_EXEC files, but this is a
relatively new and not widely (if at all) used feature.

To understand the technique that allows randomization even without enough
relocation information, consider what can happen when such an ET_EXEC file
is mapped into memory at an address other than its expected base address
(0x08048000 on i386). As soon as the CPU attempts to execute an instruction
(e.g., because of an absolute jump) or access data at the original address,
it will encounter a page fault (provided there is no other mapping at those
addresses). To avoid these page faults, we have to provide a file mapping
of the ET_EXEC file at the original base address as well.

This is not enough however for two reasons. First, the code may make data
accesses in a position independent manner (e.g., the various crt*.o files
linked into an executable contain such code) so we would have to ensure
that the two file mappings are mirroring each other's content all the time.

Second, we cannot allow the text segment mapping at the original base
address to remain executable, since that would defeat the very purpose of
randomization (there should be no executable code at a predictable/known
address in the task's address space). However the text segment must be
available for data accesses since the code may very well contain data
references to absolute addresses (which would point back to the original
mapping).

So the next refinement of our technique is that we will create two mappings
of the ET_EXEC file that will mirror each other, the first one at its
original base address and the second at a randomized one and we ensure that
the original mappings are not executable.

There is one last feature missing: since the code may attempt to execute
from the original and now non-executable mapping, it will produce extra
page faults that will need special handling. In particular, whenever we
detect a page fault due to an instruction fetch attempt in such a region,
we will redirect the execution flow back into the randomized mirror by
simply modifying the userland instruction pointer. Since automatic
redirection would again defeat the purpose of randomization, we can do
various checks in the page fault handler before we decide to go ahead with
the redirection. In the current implementation we attempt to detect so
called return-to-libc style attacks against the ET_EXEC mapping by checking
the userland stack for its signature. Since such a signature may (and does)
occur due to normal program code as well, this detection can (and does)
produce false alarms (that is, it can kill an innocent task), however it
will never miss the attack attempt it was designed to detect.

Now that the basic RANDEXEC technique is clear, let's see how it is affected
by SEGMEXEC. Under SEGMEXEC the executable status of a page is expressed by
not its page protection flags but its location in linear address space.
That is, executable regions have to be mirrored. Since RANDEXEC itself needs
mirroring, one may wonder how the triple mirroring for the ET_EXEC file's
text section can be accomplished (the current vma mirroring implementation
supports only the simplest mirroring setup, that is two regions mirroring
each other). The triple mirroring would be needed because we have the
original and the randomized mapping in the data segment region (0-1.5 GB in
linear address space) plus we need the third one in the code segment region
(1.5-3 GB in linear address space). However consider if we really need the
randomized mapping of the file's text section in the data segment region.
Such data accesses could occur only if there was position independent code
in the executable that would learn its own position in memory then use that
to access data stored in the text section. In most real life applications
no such code exists, simply because programs are compiled/linked under the
assumption that they will be executed at a fixed/known address therefore
there is no need for position independence (exceptions are incorrectly
linked programs where a position independent library is statically linked
into the executable, such as the Modula-3 runtime in CVSup). This leaves
us now with just two mappings which the vma mirroring system supports well.


2. Implementation

The core of RANDEXEC is vma mirroring which is discussed in a separate
document. The mirrors for the ET_EXEC file segments are set up in the
load_elf_binary() function in fs/binfmt_elf.c. The mirrors are created
differently depending on which of PAGEEXEC/SEGMEXEC is active.

The PAGEEXEC logic: every ELF segment is mapped as non-executable at its
original base address and the mirrors are created at a random address
returned by get_unmapped_area(). This also means that they will be using
the same randomization that other mmap() requests see.

The SEGMEXEC logic: the mappings at the original base address are mapped
as normal (no need for removing PROT_EXEC from the requests since they will
go into the data segment region anyway) with a little change: instead of
using elf_map() we directly use do_mmap_pgoff(). This is because elf_map()
is a wrapper around do_mmap() which in turn would attempt to create the
normal SEGMEXEC mirrors and is not suitable here since we need to position
the mirrors in a different way. As under PAGEEXEC, get_unmapped_area() is
used to get a random address where the ELF segment mirrors will be mapped.
The difference is in the mapping of executable mirrors: they will go into
the code segment region (1.5-3 GB linear address range) and in the data
segment (0-1.5 GB range) at the same logical addresses we create dummy
anonymous mappings as placeholders (otherwise there would be holes in the
data segment which the kernel later would try to use for mapping the ELF
interpreter and that could potentially unmap our randomized mirrors).

The extra page fault handling and execution flow redirection happens in
the archictecture specific low-level page fault handlers, for now it exists
for the i386 architecture only. When the do_page_fault() or the
pax_do_page_fault() functions in arch/i386/mm/fault.c detect an instruction
fetch attempt (that is, when the fault address is equal to the faulting EIP)
then pax_handle_fetch_fault() will examine if the fault occured within the
main executable's code segment. If it did then we take a look at just below
the current userland stack pointer (ESP-4) to see if it contains the address
of the fault location. If it does, we declare it as a return-to-libc style
attack attempt and terminate the task, otherwise we readjust the userland
EIP to point back to the randomized mirror region and return to userland.

Note that this check works for detecting attacks where control flow was
changed by a 'retn' instruction (this is the normal way of returning from
a function as the caller is expected to adjust the stack), other forms
('retf' or 'retn xx') would need more, slightly different verification (as
they modify the ESP register by more than 4, we would have to check more
below ESP at the expense of increasing the false positive ratio).

+ 73
- 0
randkstack.txt View File

@@ -0,0 +1,73 @@
1. Design

The goal of RANDKSTACK is to introduce randomness into the kernel stack
addresses of a task.

Linux assigns two pages of kernel stack to every task. This stack is used
whenever the task enters the kernel (system call, device interrupt, CPU
exception, etc). Note that once the task is executing in kernel land,
further kernel (re)entry events (device interrupts or CPU exceptions can
occur at almost any time) will not cause a stack switch.

By the time the task returns to userland, the kernel land stack pointer
will be at the point of the initial entry to the kernel (this also means
that signals are not delivered to the task until it is ready to return to
userland, that is signals are asynchronous notifications from the userland
point of view, not from the kernel's).

This behaviour means that a userland originating attack against a kernel
bug would find itself always at the same place on the task's kernel stack
therefore we will introduce randomization into the initial value of the
task's kernel stack pointer. There is another interesting consequence in
that we can not only introduce but also change the randomization at each
userland/kernel transition (compare this to RANDUSTACK where the userland
stack pointer is randomized only once at the very beginning of the task's
life therefore it is vulnerable to information leaking attacks). In the
current implementation we opted for changing the randomization at every
system call since that is the most likely method of kernel entry during
an attack. Note that while per system call randomization prevents making
use of leaked information about the randomization in a given task, it does
not help with attacks where more than one task is involved (where one task
may be able to learn the randomization of another and then communicate it
to the other without having to enter the kernel).


2. Implementation

RANDKSTACK randomizes every task's kernel stack pointer before returning
from a system call to userland. The i386 specific system call entry point
is in arch/i386/kernel/entry.S and this is where PaX adds a call to the
kernel stack pointer randomization function pax_randomize_kstack() found
in arch/i386/kernel/process.c. The code in entry.S needs to duplicate some
code found at the ret_from_sys_call label because this code can be reached
from different paths (e.g., exception handlers) where we do not want to
apply randomization. Note also that if the task is being debugged (via
the ptrace() interface) then new randomization is not applied.

pax_randomize_kstack() gathers entropy from the rdtsc instruction (read
time stamp counter) and applies it to bits 2-6 of the kernel stack pointer.
This means that 5 bits are randomized providing a maximum shift of 128
bytes - this was deemed safe enough to not cause kernel stack overflows
yet give enough randomness to deter guessing/brute forcing attempts.

The use of rdtsc is justified by the facts that we need a quick entropy
source and that by the time a remote attacker would get to execute his
code to exploit a kernel bug, enough system calls would have been issued
by the attacked process to accumulate more than 5 bits of 'real' entropy
in the randomized bits (this is why xor is used to apply the rdtsc output).
Note that most likely due to its microarchitecture, the Intel P4 CPU seems
to always return the same value in the least significant bit of the time
stamp counter therefore we ignore it in kernels compiled for the P4.

The kernel stack pointer has to be modified at two places: tss->esp0 which
is the ring-0 stack pointer in the Task State Segment of the current CPU
(and is used to load esp on a ring transition, such as a system call), and
second, current->thread.esp0 which is used to reload tss->esp0 on a context
switch. There is one last subtlety: since the randomization is applied via
xor and the initial kernel stack pointer of a new task points just above
its assigned stack (and hence is a page aligned address), we would produce
esp values pointing above the assigned stack, therefore in copy_thread() we
initialize thread.esp0 to 4 bytes less than usual so that the randomization
will shift it within the assigned stack and not above. Since this solution
'wastes' a mere 4 bytes on the kernel stack, we saw no need to apply it
selectively to specific tasks only.

+ 46
- 0
randmmap.txt View File

@@ -0,0 +1,46 @@
1. Design

The goal of RANDMMAP is to introduce randomness into memory regions handled
by the do_mmap() kernel interface. This includes all file and anonymous
mappings, such as the main executable, libraries, the brk() and mmap()
managed heaps.

Since the brk() managed heap is tied to the main executable and the latter
cannot be randomly remapped without further tricks if it is an ET_EXEC
ELF executable (see RANDEXEC for more details), RANDMMAP handles ET_DYN
ELF executables only. Luckily creating ET_DYN ELF executables is a very
simple process and their randomization is much easier and does not have
the drawbacks of RANDEXEC.


2. Implementation

All methods of populating the address space of a task are based on the
do_mmap_pgoff() internal kernel function in mm/mmap.c. This function can
establish a mapping at a caller specified address (if the MAP_FIXED flag
is used) or at an address chosen by the kernel. PaX honours the first type
of request and intervenes in the second only.

The core function that chooses a large enough unpopulated region in the
task's address space is arch_get_unmapped_area() in mm/mmap.c. The search
algorithm is simple: the function enumerates all memory mappings from a
given address up (either a user supplied hint or TASK_UNMAPPED_BASE) and
returns the first hole big enough to satisfy the request.

PaX applies randomization (delta_mmap) to TASK_UNMAPPED_BASE in bits 12-27
(16 bits) and ignores the hint for file mappings (unfortunately there is
a 'feature' in linuxthreads where the thread stack mappings do not specify
MAP_FIXED but still expect that behaviour so the hint cannot be overriden
for anonymous mappings). Note that overriding the hint is not a problem as
MAP_FIXED requests are directly satisfied in get_unmapped_area() and never
end up in arch_get_unmapped_area().

There is one more place where RANDMMAP introduces randomness: in the
load_elf_binary() function in fs/binfmt_elf.c. As mentioned already, there
are two ways to randomize the mapping of the main executable: RANDEXEC
for ET_EXEC ELF files and RANDMMAP for ET_DYN ELF files. The latter is
accomplished here by overriding the standard ELF_ET_DYN_BASE address used
for mapping ET_DYN ELF files: PaX chooses the new base at the standard
ET_EXEC base address of 0x08048000 and adds the delta_exec random value
to it. This way the task address space layout will look similar to the
normal ET_EXEC case.

+ 48
- 0
randustack.txt View File

@@ -0,0 +1,48 @@
1. Design

The goal of RANDUSTACK is to introduce randomness into the userland stack
addresses of a task.

Every task has a userland stack that is created during execve() (and copied
in fork() into the child). This stack is mandatory because this is the way
the kernel can pass the arguments and the environment to the new task. The
kernel normally creates the stack at the end of the userland address space
so that it can grow downwards later. If the application is multithreaded,
thread stacks are created by userland using the mmap() interface and hence
they are subject to RANDMMAP not RANDUSTACK (or rather, they would be were
it not for a 'feature' in linuxthreads that effectively prevents thread
stack randomization for now). Linuxthreads has another 'feature' that
prevents one from arbitrarily moving the task's stack as it assumes that
this stack will always have the highest address in the address space and
thread stacks will go below that.


2. Implementation

RANDUSTACK randomizes every task's userland stack on task creation. Since
the userland stack is created in two steps (from PaX's point of view),
randomization is applied in two steps as well.

In the first step the kernel allocates and populates pages for the stack
then in the second step it maps the pages into the task's address space.

The first step begins in fs/exec.c in the do_execve() function. The kernel
uses a temporary stack pointer stored in bprm.p to track the data copied
on the would-be stack pages, this is where PaX applies the first part of
the randomization: on i386 bits 2-11 are randomized resulting in a maximum
of 4 kB shift. Since at this point no information is available about the
new task, we cannot apply this randomization selectively.

The second step occurs when setup_arg_pages() gets called: this is where
the kernel maps the previously populated physical stack pages into the new
task's address space. Normally the bottom of the stack goes at STACK_TOP,
PaX modifies this constant in include/asm-i386/a.out.h to include a random
shift (delta_stack) in bits 12-27. This results in an additional maximum
shift of 256 MB. At this point we know enough already to be able to apply
this randomization selectively.

The end result of the randomization is that data which was copied on the
stack before setup_arg_pages() has bits 2-27 randomized (26 bits), the rest
has bits 4-27 randomized (24 bits) because the create_elf_tables() function
in fs/binfmt_elf.c aligns the stack pointer on a 16 byte boundary, that is,
it discards the randomization in bits 2-3.

+ 64
- 0
segmexec.txt View File

@@ -0,0 +1,64 @@
1. Design

The goal of SEGMEXEC is to implement the non-executable page feature using
the segmentation logic of IA-32 based CPUs.

On IA-32 Linux runs in protected mode with paging enabled. This means that
for every memory access (be that instruction fetch or normal data access)
the CPU will perform a two step address translation. In the first step the
logical address decoded from the instruction is translated into a linear
(or in another terminology, virtual) address. This translation is done by
the segmentation logic whose details are explained in a separate document.

While Linux effectively does not use segmentation by creating 0 based and
4 GB limited segments for both code and data accesses (therefore logical
addresses are the same as linear addresses), it is possible to set up
segments that allow to implement non-executable pages.

The basic idea is that we divide the 3 GB userland linear address space
into two equal halves and use one to store mappings meant for data access
(that is, we define a data segment descriptor to cover the 0-1.5 GB linear
address range) and the other for storing mappings for execution (that is,
we define a code segment descriptor to cover the 1.5-3 GB linear address
range). Since an executable mapping can be used for data accesses as well,
we will have to ensure that such mappings are visible in both segments
and mirror each other. This setup will then separate data accesses from
instruction fetches in the sense that they will hit different linear
addresses and therefore allow for control/intervention based on the access
type. In particular, if a data-only (and therefore non-executable) mapping
is present only in the 0-1.5 GB linear address range, then instruction
fetches to the same logical addresses will end up in the 1.5-3 GB linear
address range and will raise a page fault hence allow detecting such
execution attempts.


2. Implementation

The core of SEGMEXEC is vma mirroring which is discussed in a separate
document. The mirrors for executable file mappings are set up in do_mmap()
(an inline function defined in include/linux/mm.h) except for a special
case with RANDEXEC (see separate document). do_mmap() is the one common
function called by both userland and kernel originated mapping requests.

The special code and data segment descriptors are placed into a new GDT
called gdt_table2 in arch/i386/kernel/head.S. The separate GDT is needed
for two reasons: first it simplifies the implementation in that the CS/SS
selectors used for userland do not have to change, and second, this setup
prevents a simple attack that a single GDT setup would be subject to (the
retf and other instructions could be abused to break out of the restricted
code segment used for SEGMEXEC tasks). Since the GDT stores the userland
code/data descriptors which are different for SEGMEXEC tasks, we have
to modify the low-level context switching code called __switch_to() in
arch/i386/kernel/process.c and the last steps of load_elf_binary() in
fs/binfmt_elf.c (where the task is first prepared to execute in userland).

The GDT also has APM specific descriptors which are set up at runtime and
must be propagated to the second GDT as well (in arch/i386/kernel/apm.c).
Finally the GDT stores also the per CPU TSS and LDT descriptors whose
content must be synchronized between the two GDTs (in set_tss_desc() and
set_ldt_desc() in arch/i386/kernel/traps.c).

Since the kernel allows userland to define its own code segment descriptors
in the LDT, we have to disallow it since it could be used to break out of
the SEGMEXEC specific restricted code segment (the extra checks are in
write_ldt() in arch/i386/kernel/ldt.c).

+ 8
- 0
uderef-SMAP.txt View File

@@ -0,0 +1,8 @@
With the latest release of their Architecture Instruction Set Extensions Programming Reference Intel has finally lifted the veil on a new CPU feature to debut in next year's Haswell line of processors. This new feature is called Supervisor Mode Access Prevention (SMAP) and there's a reason why its name so closely resembles Supervisor Mode Execution Prevention (SMEP), the feature that debuted with Ivy Bridge processors a few months ago. While the purpose of SMEP was to control instruction fetches and code execution from supervisor mode (traditionally used by the kernel component of operating systems), SMAP is concerned with data accesses from supervisor mode. In particular, SMEP, when enabled, prevents code execution from userland memory pages by the kernel (the favourite exploit technique against kernel security bugs), whereas SMAP will prevent unintended data accesses to userland memory. The twist in the story and the reason why these security features couldn't be implemented as one lies in the fact that the kernel does have legitimate need to access data in userland memory at times while no contemporary kernel needs to execute code from there. In other words, while SMEP can be enabled unconditionally by flipping a bit at boot time, SMAP needs more care because it has to be disabled/enabled around legitimate accessor functions in the kernel. Intel has added two new instructions for this very purpose (CLAC/STAC) and repurposed the alignment check status bit in supervisor mode to enable quick switching around SMAP at runtime. This will require more extensive changes in kernel code than SMEP did but the amount of code is still quite managable. Third party kernel modules that don't use the kernel's userland accessor functions will have to take care of switching SMAP on/off themselves.
What does SMAP mean for PaX? The situation is similar to last year's SMEP that made efficient implementation of (partial) KERNEXEC possible on amd64 (i386/KERNEXEC continues to rely on segmentation instead which provides better protection than SMEP can). SMAP's analog feature in PaX is called UDEREF which so far couldn't be efficiently implemented on amd64 (once again, i386/UDEREF will continue to rely on segmentation to provide better userland/kernel separation than SMAP can). Beyond allowing an efficient implementation of UDEREF there'll be other uses for SMAP (or perhaps a future variant of it) in PaX: sealed kernel memory whose access is carefully controlled even for kernel code itself.
What does SMAP mean for security? Similarly to UDEREF, an SMAP enabled kernel will be prevented from accessing userland memory in unintended ways, e.g., attacker controlled pointers can no longer target userland memory directly, but even simple kernel bugs such as NULL pointer based dereferences will just trigger a CPU exception instead of letting the attacker take over kernel data flow. Coupled with SMEP this means that future exploits against memory corruption bugs will have to entirely rely on targeting kernel memory (which has been the case under UDEREF/KERNEXEC for many years now). This of course means that for reliable exploitation detailed knowledge of runtime kernel memory will become a premium, therefore abusing bugs that leak kernel memory to userland will become the first step towards exploiting memory corruption bugs. While UDEREF and SMAP prevent gratuitous memory leaks, they still have to allow intended userland accesses and that is exactly the escape hatch that several exploits have already targeted and we can expect more in the future. Fortunately we are once again at the forefront of this game with several features that prevent or at least greatly reduce the amount of informaton that can be so leaked from the kernel to userland (HIDESYM, SANITIZE, SIZE_OVERFLOW, STACKLEAK, USERCOPY).
TL;DR: Intel implements UDEREF equivalent 6 years after PaX, PaX will make use of it on amd64 for improved performance.

+ 74
- 0
uderef-amd64.txt View File

@@ -0,0 +1,74 @@
Hello everyone,
i guess those of you tracking the test patches have already noticed that
we recently added support for UDEREF on amd64 as well. now that hopefully
the silly problems have been worked out, it's time to talk about it a bit.
before everything, let's get out one thing that i'll probably repeat every
now and then: UDEREF on amd64 isn't and will never be the same as on i386.
it's just the way it is, it cannot be 'fixed'. now let's see what it can
still do on amd64.
as you probably know (does anyone read config help? ;), UDEREF wants to
ensure that userland and kernel address spaces are properly separated. in
particular, gratuitous dereference of userland addresses by kernel code
should result in an oops instead of userland taking over kernel data flow,
or worse, control flow as well (think of the past year's worth of NULL
dereference based exploits). this separation can be implemented with pretty
much no overhead on i386, but unfortunately amd64 lacks the necessary
segmentation logic and the alternative ain't pretty ;).
so what does UDEREF do on amd64? on userland->kernel transitions it basically
unmaps the original userland address range and remaps it at a different address
using non-exec/supervisor rights (so direct code execution as used by most
exploits is not possible at least). this remapping is the main cause of its
performance impact as well, and i think it cannot really be reduced any further.
in any case, most kernel code will run without access to the actual userland
address range, so in this sense it's similar to what UDEREF on i386 offers.
this is also where the similarities end :), so let's look at the bad stuff
now. UDEREF/amd64 doesn't ensure that the (legitimate) userland accessor
functions cannot actually access kernel memory when only userland is allowed
(some in-kernel users of certain syscalls can temporarily access kernel memory
as userland, and that is enforced on UDEREF/i386 but not on amd64). so if
there's a bug where userland can trick the kernel into accessing a userland
pointer that actually points to kernel space, it'll succeed, unlike on i386.
the other bad thing is the presence of the userland shadow area. this has
two consequences: 1. the userland address space size is smaller under UDEREF
(42 vs. 47 bits, with corresponding reduction of ASLR of course), 2. this
shadow area is always mapped so kernel code accidentally accessing its range
may not oops on it and can be exploited (such accesses can usually happen only
if an exploit can make the kernel dereference arbitrary addresses in which
case the presence of this area is the least of your concerns though).
what about performance? well, 'it depends', in particular it depends on the
amount of user/kernel transitions of your workload as that's where the extra
code really hits (it's basically a TLB flush and two CR0 writes if you have
KERNEXEC as well, say 600 cycles + TLB repopulation time). on a simple
compilation test i get these times:
#time emerge portage -j2 on 2.6.33.1-pax no UDEREF
25.55user 7.44system 0:36.16elapsed 91%CPU (0avgtext+0avgdata 555648maxresident)k
56inputs+56816outputs (0major+1715421minor)pagefaults 0swaps
#time emerge portage -j2 on 2.6.32.10-pax UDEREF KERNEXEC
28.01user 11.03system 0:38.54elapsed 101%CPU (0avgtext+0avgdata 555600maxresident)k
56inputs+56832outputs (0major+1718704minor)pagefaults 0swaps
feel free to submit benchmarks (preferably on real life apps, not synthetic) so
that people know better what to expect. as usual, virtualization doesn't like the
tricks and suffers more, although less than i386, so the pax_nouderef kernel
command line option will work for amd64 as well.
last but not least a note on implementation. besides the already mentioned
special userland shadow area there's another important bit: per-CPU PGDs.
what this does is simple: each CPU gets its own top-level page directory
for its exclusive use (instead of the usual per-process PGD). this among
other things means that we can begin the proper lockdown of the entire page
table hierarchy (a todo item for KERNEXEC as well, that's why this feature
is also now enabled there, even on i386/PAE).
so this is it in a nutshell, if you have questions, comments, complaints,
etc, you know where to reach us ;).

+ 559
- 0
uderef.txt View File

@@ -0,0 +1,559 @@
in the following i'll assume that you have good knowledge of i386
protected mode details and some concepts about how modern OSs make
use of them and i'll only explain what linux (2.6 in particular)
does and how UDEREF modifies it to achieve best-effort (read: not
perfect) userland/kernel separation.

----------------------------------------------------------------------------
of the basic protected mode resources (TSS, GDT, LDT, IDT) we'll
look at the GDT only as that's what defines all the interesting
descriptors/segments needed for UDEREF (userland has some control
over the per-process LDT but whatever is there isn't relevant as
linux will load GDT selectors upon a userland->kernel transition
and will never dereference anything in the LDT).

linux 2.6 has per-CPU GDTs, each initialized from cpu_gdt_table
in arch/i386/kernel/head.S . the kernel uses 3 descriptors there,
__KERNEL_CS for %cs, __KERNEL_DS for %ss and __USER_DS for %ds/%es
(since 2.6.20 there's a per-CPU data segment stored in %gs and in
2.6.21 it'll move into %fs, but it's not relevant for UDEREF).
of these, __KERNEL_* are DPL=0 (i.e., for kernel use only), however
__USER_DS is DPL=3, the default userland data selector loaded for
any new userland process as well.

all of these descriptors define a flat segment (0-4GB), so there's
no userland/kernel separation at the segmentation level, it's
solely up to the paging logic (i guess you already know why that's
bad, but see below). the reason that the kernel uses the default
userland data selectors/segments is that presumably there's some
performance gain when the CPU reloads a segment register with the
same value it already has (coming from a typical ELF process the
data selectors will already have __USER_DS in them). i never checked
this claim and maybe i misremember it, but in any case, regardless
of which __*_DS selector the kernel uses, they're all flat anyway.

now we're getting to the actual topic ;-). the problem i set out to
solve with UDEREF was that many kernel bugs can be exploited (at all
or more reliably) due to the fact that on i386 most OSs don't separate
the userland virtual address space from that of the kernel. this in
turn means that whenever userland can make the kernel (unexpectedly)
dereference a userland controlled pointer, userland can control the
data (and sometimes, control) flow of the kernel by virtue of providing
the attack data in its own userland address range as it's fully visible
in the kernel's virtual address space as well (the two virtual address
spaces are the same because of the use of flat segments and lack of
effective address space IDs in the i386 paging logic).

since there're two stages in logical->physical address translation,
one can attack the problem at each level. one approach is to make
use of paging by realizing that %cr3 is effectively the address space
ID on i386. that is, one can switch (reload) %cr3 on every kernel
entry to point to page tables that simply don't map userland. this
approach has been tried on linux in the so-called 4:4 VM patches
(Red Hat, Ingo Molnar) and was used/offered in some RHEL series (it
wasn't for security per se, but to provide a 4GB effective userland
virtual address space for aging 32-bit machines with lots of memory).
it has one big problem: performance impact. so much that it wasn't
even on the table for me.

the other approach is to make use of segmentation, which is what i
do in UDEREF. the basic idea is very simple and very old, i first
saw it in use (don't laugh ;-) on windows 95 where this particular
feature caught NULL userland derefs in win32 processes (and prevented
them from trashing the low DOS related memory). the trick they used
was that they set up the userland segments as so-called expand-down
segments, that is, the limit field in the segment descriptor meant
the lowest valid address for the segment, so they could set it to
something low, i forget the exact number now, but it was either 4MB
or 4k.

the same trick is used in UDEREF with a few more modifications. the
new segment setup is as follows:

kernel %cs:
still __KERNEL_CS which remains flat (there's another PaX feature
which limits it properly, but let's not digress)

kernel %ss/%ds/%es:
all switched to __KERNEL_DS which is no longer flat but turned
into an expand-down segment with a lower limit set at the top of
the userland address space (__PAGE_OFFSET in linux, normally it's
at 3GB)

userland %cs: stays the same __USER_CS, flat, doesn't matter

userland %ds/%es/%ss:
stays as __USER_DS but limited (most of the time) to the userland
address space size.

this setup so far is enough to cause a GPF whenever the kernel tries
to dereference a pointer value that falls into the userland virtual
address range (say, 0-3GB, i.e., NULL pointers are automatically
included). what it's not enough for is the need to allow data transfer
between userland/kernel and kernel/kernel in certain contexts. let's
see them one by one.

as you know, syscalls often need to copy data to/from userland which
on i386 is most often implemented by the (rep) movs insn. this insn
uses two data segments, one for the source, one for the destination
(former can be overridden). this blind movs works in the usual case
because all segments are flat, so data can freely move between kernel
and userland virtual addresses.

with the above described setup however we have a problem since
neither %ds nor %es allow userland access. furthermore, even if one
could override the source segment (so that one could use one of the
remaining segment registers for copying from userland), it wouldn't
help with copying to userland. the solution is that in these special
copy functions (on linux it's all in asm) we reload the proper segment
register with __USER_DS for the duration of the copy therefore we allow
the copy to succeed *and* at the same time not allow kernel addresses
where they're not expected (e.g., when the kernel copies from userland,
this won't allow a source pointer to kernel land and vice versa).

we're still not done however, as there's a special kernel/kernel copy
problem (i don't know if it's specific to linux): certain internal
kernel functions call syscall functions more or less directly (say,
setting up a kernel thread or even just starting init makes use of
the fork interface). this among others means that any normal userland
pointer input these syscall expect will fall into the kernel address
range - and would fail input validation right at the beginning, were
it not for a trick linux uses: set_fs().

there's some history behind this function because in early days linux
actually used to be more secure in this regard as it used segmentation
logic for userland/kernel separation (yes, there's some irony of fate
here ;-) and it provided this set_fs() call to tell the kernel that
further userland/kernel memory copy operations will actually stay within
the kernel. many years later this interface was effectively switched off
because linux turned to flat segments (the baby went with the bathwater
or something like that) however one of its features stayed: the address
space limit check in the low level copying functions. this limit check
simply compared the to-be-dereferenced pointer against the current
address space limit allowed for such copying (set_fs() itself was reduced
to change only this limit as needed).

with the non-flat segment setup described above we would still pass the
address space limit check because it's not directly derived from the
GDT descriptor limit but a per-thread variable, however we would fail
the actual copy because one of the segment registers is reloaded with
__USER_DS and that won't allow kernel addresses. the assumption i broke
is that the separately maintained address space limit corresponds to
the actual segment limit therefore the fix has to be that somehow this
segment limit needs to be updated every time set_fs() is called.

this can be accomplished by either updating the limit on __USER_DS (so
that the actual copy functions don't have to change) or by loading either
__USER_DS or __KERNEL_DS in said copy functions (then each one of them
has to be patched for this).

i went with the former for two reasons. one, it's a lot less code/work to
update the (per-CPU) GDT descriptor vs. changing all the copy functions
to conditonally load either __USER_DS or __KERNEL_DS (it took some time
to find most/all such copy code, besides the obvious copy_{to/from}_user
functions linux has a special IP checksumming code used for both kernel
and userland buffers, then there were some direct userland address accesses
in reboot code (to patch BIOS variables), etc).

two, i wasn't actually sure that set_fs() is only ever called for
kernel/kernel copy situatons (since it only raises the allowed address
space limit, therefore on i386 it'd still allow userland accesses), so
doing the 'conditional load in the copy code' stuff just to find out it
was in vain wasn't worth it. i still want to try it out though because
this is the reason i called the current approach 'best-effort' separation
only: in set_fs() context a kernel bug has actually free reign over
pointers in the copy functions (yes, it's still a lot smaller attack
surface but i'm a maximalist ;-).

so, as a summary: UDEREF makes sure that (data) segments for userland and
the kernel are properly limited, either upwards (userland) or downwards
(kernel). furthermore, every userland/kernel copy code is patched to use
the proper segments for the inter-segment copy (the challenging part was
to find all such code ;-). as a linux specialty, some care is taken to
ensure that the same copy code works for kernel/kernel copying as well.

note that GDT patching can be done safely because linux 2.6 has per-CPU
GDTs (and in the context switch code i take care of the __USER_DS limit
based on the set_fs() limit), on linux 2.4 and earlier this cannot be
done (there's one shared GDT only) and there all the copy functions are
potential escape points.
----------------------------------------------------------------------------

that'd be it in a nutshell, feel free to ask me if you have questions.

cheers,
PaX Team


To: pageexec@freemail.hu
Cc: Elad Efrat
Subject: Re: PaX uderef
Date: Wed, 4 Apr 2007 15:14:45 +0900 (JST)
From: [censored]

> On 3 Apr 2007 at 20:28, Elad Efrat wrote:
>
> > [censored],
> >
> > I've cc'd the PaX author to this mail so he can elaborate on what uderef
> > is, how it works, and why it was introduced.
>
> hello guys,
>
> in the following i'll assume that you have good knowledge of i386
> protected mode details and some concepts about how modern OSs make
> use of them and i'll only explain what linux (2.6 in particular)
> does and how UDEREF modifies it to achieve best-effort (read: not