You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

365 lines
21 KiB

  1. 1. Design
  2. The goal of vma mirroring is to allow for creating special file mappings
  3. where a given set of physical pages (those backing the file) is visible
  4. at two different linear addresses in a given task. Furthermore the mirroring
  5. logic ensures that the two mappings in linear address space will see the
  6. same physical pages even after they go through a swap-out/swap-in cycle
  7. or copy-on-write.
  8. While vma mirroring is a generic idea, PaX uses it for very specific
  9. purposes and therefore the implementation is a bit less generic than it
  10. could be (but it results in simpler and less intrusive changes).
  11. The first use is for mirroring executable regions into the code segment
  12. under SEGMEXEC. In this case the 3 GB userland linear address space is
  13. divided into two halves of 1.5 GB each and the code/data segment descriptors
  14. are modified to cover only one or the other. To be able to execute code
  15. under this setup we have to ensure that executable mappings are visible
  16. in the code segment region (1.5-3 GB range in linear address space). Since
  17. such executable mappings may contain data as well (constant strings,
  18. function pointer tables, etc), we have to have a mirror of these mappings
  19. at the same logical addresses in the data segment as well (0-1.5 GB range
  20. in linear address space). The nice property of this setup is that a pair
  21. of mirrored regions will have a constant difference between their start/end
  22. addresses: 1.5 GB (or SEGMEXEC_TASK_SIZE as it is often referenced in the
  23. code).
  24. The second use of vma mirroring is to implement the mirror of the main
  25. executable at a randomized address under RANDEXEC. Here again we will have
  26. a constant (task specific) difference between the mirrored regions and can
  27. simplify the implementation the same way as under SEGMEXEC.
  28. There is also an implicit third situation, when both SEGMEXEC and RANDEXEC
  29. are active for a task. At a first look this may appear very complex since
  30. the executable region of the main executable would have to be mirrored
  31. at three places instead of one: randomized mappings in both the data and
  32. the code segment (at the same logical addresses) plus a mirror into the
  33. code segment at the original logical address. Luckily, we can save on two
  34. of them: the randomized mapping in the data segment is not needed as we do
  35. not expect code to reference data in its executable segment in a position
  36. independent manner (which is what would be needed for code to learn its own
  37. location), second we do not need the original mapping mirrored in the code
  38. segment as we explicitly do not want it to be executable (so that code
  39. references to this region would raise a page fault and the RANDEXEC logic
  40. could then react on it).
  41. 2. Implementation
  42. vma mirroring requires two basic changes to the VM in Linux. First we have
  43. to provide an interface for setting up the mirrors, second we have to
  44. maintain the synchronicity between the mirrored regions' linear/physical
  45. mappings.
  46. Linux maintains a per task database of what is present in the given task's
  47. address space. This database is a set of structures called vm_area_struct
  48. each of which describes a single mapping (defined in include/linux/mm.h).
  49. The database for a task can be viewed in /proc/<pid>/maps. For our purposes
  50. the relevance of the vma database is that it directly guides the page fault
  51. resolution logic which in turn is responsible for setting up the linear to
  52. physical address translation on a per page basis.
  53. To understand how all this works, consider task creation and its first
  54. moment of life in userland. For ELF executables the load_elf_binary()
  55. function in fs/binfmt_elf.c is responsible for populating the task's
  56. address space with a few basic mappings such as the stack, the dynamic
  57. linker (if the application in question is dynamically linked) and the main
  58. executable itself. The file mappings are established by using the kernel's
  59. internal do_mmap() interface through a simple wrapper called elf_map().
  60. Note that at this point only the stack region has physical pages assigned
  61. (since that is where arguments, the environment, etc go and therefore must
  62. be present in physical memory at this early stage), the file mappings are
  63. not yet backed by physical memory pages.
  64. When the task begins its life in userland, the very first instruction fetch
  65. in ld.so or the main executable will raise a page fault since the Linux VM
  66. system does not establish a valid physical mapping until it is actually
  67. needed (i.e., it is demand based). The first thing the architecture specific
  68. page fault handler (for i386 it is do_page_fault() in arch/i386/mm/fault.c)
  69. does is to find the vma structure that describes the region in which the
  70. page fault happened then call the architecture independent handler
  71. (handle_mm_fault() in mm/memory.c) which based on the fault and the vma
  72. type will call the appropriate function to establish a physical page
  73. containing the expected data (in our case it would be read from the file
  74. backing the mapping, that is, somewhere from the .text section in an ELF
  75. file).
  76. The interface for setting up a vma mirror is a simple extension to the
  77. already existing memory mapping interface. This interface is accessible
  78. from userland as the mmap() library call. Since the vma mirroring facility
  79. is meant to be used by specific PaX features only, userland initiated vma
  80. mirroring requests are not allowed (that is why PaX returns an error from
  81. do_mmap2() in arch/i386/kernel/sys_i386.c). Care must be taken however for
  82. handling mmap() requests from tasks running under SEGMEXEC. This is because
  83. they can create executable file mappings and therefore they must be mirrored
  84. just like when the kernel itself establishes the initial file mappings as
  85. discussed above. Since all mmap() requests go through do_mmap() (an inline
  86. function defined in include/linux/mm.h) this is where PaX requests the extra
  87. mirrored mappings for SEGMEXEC executables. Since do_mmap2() originally
  88. gets around do_mmap() by directly calling do_mmap_pgoff(), we modified it
  89. to use do_mmap() instead. This way we can ensure that the SEGMEXEC logic
  90. gets to see both userland and kernel originated file mapping requests.
  91. vma mirror requests use special arguments for calling do_mmap_pgoff() in
  92. the end:
  93. 'file' must be NULL because the mirror will reference the same file
  94. as the vma to be mirrored
  95. 'addr' has its normal meaning of specifying a hint for searching a
  96. suitable hole in the address space where the mapping can go,
  97. 'len' must be 0 because it will be derived from the vma that is about
  98. to get mirrored,
  99. 'prot' has its normal meaning,
  100. 'flags' has its normal meaning except that it must also specify the new
  101. MAP_MIRROR flag and it must request a private mapping,
  102. 'pgoff' specifies the linear start address of the vma to be mirrored
  103. (note that here it is measured in bytes and not PAGE_SIZE units).
  104. The vma to be mirrored must exist at the specified start address ('pgoff')
  105. and must not be mirrored or be a mirror itself already. Furthermore PaX will
  106. not allow a writable mirror for a read-only vma. Note that these are only
  107. sanity checks to detect early if there is a bug in the rest of the vma
  108. mirroring logic (denied mirror requests will end up in a non-functioning
  109. task and are therefore easy to see for an end user).
  110. The second basic change needed for implementing vma mirroring is in the MMU
  111. state management logic (which governs the linear/physical translation). Our
  112. goal is simple: whenever the state of a mirrored page changes we will have
  113. to propagate the change into the state of the mirror page as well (and do all
  114. this atomically, that is, other state changing code must be locked out until
  115. we finished). Such state changes occur in the following operations: page
  116. fault servicing, munmap, mremap, mprotect, mlock and vma merging.
  117. Servicing a fault means that the kernel finds out why a page fault occured
  118. and when it is valid (it occured in a region described by a vma having proper
  119. access rights) it will allocate storage in physical memory and set up a valid
  120. linear/physical translation in the MMU (on i386 it means setting up a present
  121. pte).
  122. While the page fault classification (valid/invalid) is done in architecture
  123. specific code, the actual service needs no longer to care about such details
  124. and is architecture independent: handle_mm_fault(), therefore this is what
  125. we have to modify. Also note that handle_mm_fault() is used by other code as
  126. well that we would otherwise have to explicitly modify (get_user_page() for
  127. example that is used by ptrace() among others).
  128. The strategy for servicing a page fault in a mirrored vma is the following:
  129. first we do some sanity checks on the mirror's vma (again in order to detect
  130. potential bugs in the implementation early) then allocate the necessary MMU
  131. resources (various levels of paging structures) so that by the time we get
  132. to propagate the MMU state information, we will not have to worry about
  133. resource allocation failures. After successful resource allocation we let
  134. the original fault handling logic carry out its work (swap-in a page, do
  135. copy-on-write, etc) and intervene when it has just established the new MMU
  136. state for the mirror: we call the core of the vma mirroring logic in
  137. pax_mirror_fault() in mm/memory.c.
  138. To simplify the logic of the mirroring code, we established a simple naming
  139. convention for variables related to one or the other vma: the vma for which
  140. handle_mm_fault() was called is said to be mirrored and the corresponding
  141. variables have no suffixes, whereas the other vma is called the mirror and
  142. its variables are suffixed by _m. For example, vma_m is the vm_area_struct
  143. pointer of the mirror vma.
  144. pax_mirror_fault() first determines if it has anything to do in the first
  145. place and if so, it looks up the mirror vma and associated information, such
  146. as the mirror of the fault address and the related MMU structures (we are
  147. interested in the page table entry as it contains the physical page number
  148. that will have to be synchronized between the mirrors). Once the mirror's
  149. pte is known, we have to see if it currently specifies a valid mapping and
  150. if so, we have to invalidate it (and while handling the different cases, we
  151. also take care of the resident set size: we have to increment it if the
  152. mapping was not valid since it will be after mirroring). Invalidating the
  153. current mapping in the mirror is derived from kernel code doing the same:
  154. the munmap() and swap-out operations. The next and final step is to actually
  155. propagate the new linear/physical mapping into the mirror: we look up the
  156. physical page in the mirrored pte (and increment its use count since we are
  157. going to create another reference to it in the mirror's pte) then construct
  158. the mirror's pte from it and the appropriate access rights (the writability
  159. state must be copied verbatim from the mirrored pte otherwise we would ruin
  160. the copy-on-write logic).
  161. The atomicity of all the above actions is ensured by holding the appropriate
  162. page_table_lock on entry and never releasing it inside. This way the higher
  163. level callers (who establish the mirrored pte) ensure that the mirror's pte
  164. is established at the same time.
  165. The last set of vma mirroring related changes ensures that userland can
  166. modify/destroy mirrored regions only along with the corresponding mirrors
  167. (creation was described in the do_mmap() changes).
  168. The most complex change is in the munmap() logic which is responsible for
  169. destroying all kinds of mappings. To understand the changes let's first look
  170. at how it works in the standard kernel. The core function is do_munmap() in
  171. mm/memory.c which begins by doing some checks on the area to be removed then
  172. proceeds with moving all the vma structures that fall in there (fully or in
  173. part) from the mm's vma list to a special one. In the next phase this list is
  174. processed for clearing all linear/physical mappings in the MMU for each vma.
  175. The final step is to free up page tables that may have become empty and are
  176. no longer needed.
  177. Mirror handling requires two changes in the above logic. First, we have to
  178. detect if any to be removed vma is mirrored and if so we have to move its
  179. mirror vma to another special list (during the same atomic operation as it
  180. is done originally). Second, we have to clear the corresponding mappings in
  181. the MMU for this list of vma structures as well.
  182. While setting up the second special list is straightforward, the second step
  183. is not as the original kernel code is rather badly organized and does not
  184. lend itself for easy reuse. In order to avoid unnecessary code replication we
  185. opted for rearranging the original code a bit through simple program
  186. transformations: the MMU cleanup logic has been split into unmap_vma_list()
  187. and unmap_vma(). This way processing the second special list can be done in
  188. unmap_vma_mirror_list() which makes use of unmap_vma().
  189. There is one last trick worth noting: map_count handling. This counter has to
  190. be decremented for each vma which gets unmapped. The original do_munmap()
  191. logic delegates this task to unmap_vma(). The problem with it is that at this
  192. point all the vma structures have already been removed from the main mm vma
  193. list however this counter is decremented one by one for each vma. This in
  194. turn will trigger a kernel BUG message because of the inconsistency between
  195. map_count and the actual number of vma structures on the mm vma list. Since
  196. this is a kernel BUG (or 'feature', after all at the end of do_munmap()
  197. everything will be in synch again), we decided to modify the original kernel
  198. logic so that map_count gets decremented during the special lists preparation
  199. phase.
  200. The next change is in do_mremap() in mm/mremap.c where we ensure that
  201. mirrored regions simply cannot be remapped (they can shrink however as it
  202. simply means a call to do_munmap() which handles mirrors fine).
  203. The remaining two userland interfaces are handled the same way because both
  204. mprotect() and mlock() have the same internal logic: enumerate all mappings
  205. in the given range and then act on each one of them individually. The
  206. functions we modify are mprotect_fixup() and mlock_fixup(), respectively.
  207. First the original functions are moved to __mprotect_fixup() and
  208. __mlock_fixup() then they are called for each vma in a mirror when one is
  209. encountered.
  210. The vma merging mechanism is governed by the inline can_vma_merge() function
  211. in include/linux/mm.h. PaX modifies this function to prevent anonymous mirror
  212. mappings from getting inadvertantly merged with others (file mappings are
  213. never merged).
  214. 3. Examples
  215. To help better understand vma mirroring, we present a few address space
  216. layouts and explain what happened there. In each case we used a copy of
  217. /bin/cat in /tmp to execute "/tmp/cat /proc/self/maps". Note that for
  218. the sake of simplicity we disabled RANDMMAP, this of course should not
  219. be done in production systems. The [x] marks are not part of the original
  220. output, we use them to refer to the various lines in the explanation.
  221. Active PaX features: SEGMEXEC and MPROTECT
  222. [1] 08048000-0804a000 R-Xp 00000000 00:0b 1109 /tmp/cat
  223. [2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat
  224. [3] 0804b000-0804d000 RW-p 00000000 00:00 0
  225. [4] 20000000-20015000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
  226. [5] 20015000-20016000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so
  227. [6] 2001e000-20143000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
  228. [7] 20143000-20149000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so
  229. [8] 20149000-2014d000 RW-p 00000000 00:00 0
  230. [9] 5fffe000-60000000 RW-p fffff000 00:00 0
  231. [10] 68048000-6804a000 R-Xp 00000000 00:0b 1109 /tmp/cat
  232. [11] 80000000-80015000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
  233. [12] 8001e000-80143000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
  234. Since cat is a dynamically linked executable, its address space will have
  235. several file mappings besides the main executable. Let's see what each line
  236. represents.
  237. [1] is the first PT_LOAD segment of the /tmp/cat ELF file, it is mapped
  238. with R-X rights, that is, it contains the executable code plus all
  239. read-only initialized data as well. It is also mirrored by [10] because
  240. it is executable.
  241. [2] is the second PT_LOAD segment of the /tmp/cat ELF file, it is mapped
  242. with RW- rights, that is, it contains writable data (all initialized and
  243. the beginning of the uninitialized data, in our case all of it as they
  244. fit into a single page).
  245. [3] is the brk() managed heap (its size changes at runtime as cat calls
  246. malloc()/free()/etc). Note that if cat had more uninitialized data than
  247. what would fit into the gap left on the last page of mapping [2] then
  248. the rest would be mapped here from the beginning of [3] and the brk()
  249. managed heap would follow then.
  250. [4] and [5] are the PT_LOAD segments of the dynamic linker, whereas
  251. [6] and [7] are those of the C library. [4] and [6] are also mirrored by
  252. [11] and [12] respectively as they are executable.
  253. [8] is an anonymous mapping corresponding to uninitialized data in the C
  254. library (if we take a look at the ELF program headers of libc, we will
  255. see that the memory size of the second PT_LOAD segment is 4 pages more
  256. than its file size).
  257. [9] is another anonymous mapping containing the stack. We can observe that
  258. it is at the end of the userland address space (which under SEGMEXEC
  259. is at TASK_SIZE/2) and grows downwards.
  260. [10], [11] and [12] are the mirrors of the executable file mappings [1],
  261. [4] and [6] respectively (notice that each pair has exactly TASK_SIZE/2
  262. "distance"). They are all above the TASK_SIZE/2 limit as well which
  263. means that they are part of the code segment and hence executable.
  264. Active PaX features: SEGMEXEC and RANDEXEC and MPROTECT
  265. [1] 08048000-0804a000 R-Xp 00000000 00:0b 1109 /tmp/cat
  266. [2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat
  267. 0804b000-0804d000 RW-p 00000000 00:00 0
  268. [3] 20000000-20002000 ++-p 00000000 00:00 0
  269. [4] 20002000-20003000 RW-p 00002000 00:0b 1109 /tmp/cat
  270. 20003000-20018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
  271. 20018000-20019000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so
  272. 20021000-20146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
  273. 20146000-2014c000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so
  274. 2014c000-20150000 RW-p 00000000 00:00 0
  275. [5] 5fffe000-60000000 RW-p 00000000 00:00 0
  276. [6] 80000000-80002000 R-Xp 00000000 00:0b 1109 /tmp/cat
  277. 80003000-80018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
  278. 80021000-80146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
  279. Enabling RANDEXEC changes the layout slightly. In particular, beyond the
  280. previously shown mappings we can see [3] which represents the dummy
  281. anonymous mapping corresponding to the first (executable) PT_LOAD segment
  282. of cat and [4] which is the mirror of the second PT_LOAD segment of cat
  283. (notice that [2] and [4] have the same page offset values). Also observe
  284. that [1] is mirrored above the TASK_SIZE/2 limit by [6] at a different
  285. distance than TASK_SIZE/2, hence despite its having the R-X rights, it is
  286. not actually executable: logical addresses of [1] are invalid in the code
  287. segment, instead it is region [3] whose logical addresses are valid there
  288. (in region [6]).
  289. The careful reader has probably noticed a small difference between this
  290. and the previous situation: the stack area in the first case is 3 pages
  291. long whereas [5] here has only 2 pages. The reason for this discrepancy
  292. has to do with RANDUSTACK: the first part of the stack randomization
  293. cannot be disabled and in our case it happened to cause a big enough shift
  294. that made the kernel allocate an extra page for the initial stack.
  295. Active PaX features: PAGEEXEC and RANDEXEC and MPROTECT
  296. [1] 08048000-0804a000 R--p 00000000 00:0b 1109 /tmp/cat
  297. [2] 0804a000-0804b000 RW-p 00002000 00:0b 1109 /tmp/cat
  298. 0804b000-0804d000 RW-p 00000000 00:00 0
  299. [3] 40000000-40002000 R-Xp 00000000 00:0b 1109 /tmp/cat
  300. [4] 40002000-40003000 RW-p 00002000 00:0b 1109 /tmp/cat
  301. 40003000-40018000 R-Xp 00000000 03:07 110818 /lib/ld-2.2.5.so
  302. 40018000-40019000 RW-p 00014000 03:07 110818 /lib/ld-2.2.5.so
  303. 40021000-40146000 R-Xp 00000000 03:07 106687 /lib/libc-2.2.5.so
  304. 40146000-4014c000 RW-p 00125000 03:07 106687 /lib/libc-2.2.5.so
  305. 4014c000-40150000 RW-p 00000000 00:00 0
  306. bfffe000-c0000000 RW-p fffff000 00:00 0
  307. The last case where vma mirroring takes place has the simplest layout of
  308. all as only the main executable is mirrored: [3] mirrors [1] and [4]
  309. mirrors [2]. Notice that [1] no longer has R-X rights but R-- as under
  310. PAGEEXEC it is the mapping rights that decide what is executable, not the
  311. mapping's position.