You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

146 lines
9.0 KiB

  1. 1. Design
  2. The goal of PAGEEXEC is to implement the non-executable page feature using
  3. the paging logic of IA-32 based CPUs.
  4. Traditionally page protection is implemented by using the features of the
  5. CPU Memory Management Unit. Unfortunately IA-32 lacks the hardware support
  6. for execution protection, i.e., it is not possible to directly mark a page
  7. as executable/non-executable in the paging related structures (the page
  8. directory (pde) and table entries (pte)). What still makes it possible to
  9. implement non-executable pages is the fact that from the Pentium series on
  10. the Intel CPUs have a split Translation Lookaside Buffer for code and data
  11. (AMD CPUs have a split TLB since the K5 series however due to its
  12. organization it is usable for our purposes only since the K7 core based
  13. CPUs).
  14. The role of the TLB is to act as a cache for virtual/physical address
  15. translations that the CPU has to perform for every single memory access (be
  16. that instruction fetch or data read/write). Without the TLB the CPU would
  17. have to perform an expensive page table walk operation for every such
  18. memory access and obviously that would be detrimental to performance.
  19. The TLB operates in a simple manner: whenever the CPU wants to access a
  20. given virtual address, it will first check whether the TLB has a cached
  21. translation or not. On a TLB hit it will take the physical address directly
  22. from the TLB, otherwise it will perform a page table walk to look up the
  23. required translation and cache the result in the TLB as well (if the page
  24. table walk is unable to find the translation or the result is in conflict
  25. with the access type, e.g., a write to a read-only page, then the CPU will
  26. instead raise a page fault exception). Note that hardware assisted page
  27. table walking and automatic TLB loading are features specific to IA-32,
  28. other CPUs may have or need software assistance in this operation. Since
  29. the TLB has a finite size, sooner or later it becomes full and the CPU will
  30. have to purge entries to make room for new translations (on IA-32 this is
  31. again automatically done in hardware). Software can also purge TLB entries
  32. by either removing all translations (e.g., whenever a userland context
  33. switch happens) or those corresponding to a specific virtual address.
  34. As mentioned already, from the Pentium on Intel CPUs have a split TLB, that
  35. is, virtual/physical translations are cached in two independent TLBs
  36. depending on the access type: instruction fetch related memory accesses will
  37. load the ITLB, everything else loads the DTLB (if both kinds of accesses are
  38. made to a page then both TLBs will have an entry). TLB entry replacement
  39. works also on a per TLB basis except for the software initiated purges which
  40. act on both.
  41. The above described TLB behaviour means that software has explicit control
  42. over ITLB/DTLB loading: it can get notified on hardware TLB load attempts
  43. if it sets up the page tables so that such attempts will fail and trigger
  44. a page fault exception, and it can also initiate a TLB load by making the
  45. appropriate memory access to have the CPU walk the page tables and load one
  46. of the TLBs. This in turn is the key to implement non-executable pages:
  47. such pages can be marked either as non-present or requiring supervisor level
  48. access in the page tables hence userland memory accesses would raise a page
  49. fault. The page fault handler can then decide whether it was an instruction
  50. fetch attempt (by comparing the fault address to that of the instruction
  51. that raised the fault) or a legitimate data access. In the former case we
  52. will have detected an execution attempt in a non-executable page and can
  53. act accordingly (terminate the task), in the latter case we can just change
  54. the affected page table entry temporarily to allow user level access and
  55. have the CPU load it into the DTLB (we will of course have to restore the
  56. page table entry to the old state so that further page table walks will
  57. again raise a page fault).
  58. The decision between using non-present or supervisor mode page table entries
  59. for marking a page as non-executable comes down to performance in the end,
  60. the latter being less intrusive because kernel initiated data accesses to
  61. userland pages will not raise a page fault.
  62. To sum it up, PAGEEXEC as implemented in PaX overloads the meaning of the
  63. User/Supervisor bit in the ptes to mean the executable/non-executable status
  64. and also makes sure that data accesses to non-executable pages still work as
  65. before.
  66. 2. Implementation
  67. PAGEEXEC requires two sets of changes in Linux: the kernel has to be taught
  68. that the i386 architecture can do the proper non-executable semantics, and
  69. next we have to deal with the special page faults that require kernel
  70. assisted DTLB loading.
  71. The low-level definitions of the capabilities of the paging logic are in
  72. include/asm-i386/pgtable.h. Here we simply redefine the constants that are
  73. used for creating the ptes of non-executable pages. One such use of these
  74. constants is the protection_map[] array defined in mm/mmap.c which is
  75. referenced whenever the kernel sets up a pte for a userland mapping. Since
  76. PAGEEXEC can be disabled on a per task basis we have to modify all code
  77. that accesses this array so that we provide an executable pte even if it
  78. was not explicitly requested. Affected files include fs/exec.c (where the
  79. stack pages are set up), mm/mprotect.c, mm/filemap.c and mm/mmap.c. The
  80. changes in the latter two cooperate in a non-trivial way: do_mmap_pgoff()
  81. creates executable non-anonymous mappings by default and it is the job of
  82. generic_file_mmap() to turn it into a non-executable one (as the mapping
  83. turned out to be a file mapping). This logic ensures that non-anonymous
  84. mappings of devices remain executable regardless of PAGEEXEC. We opted for
  85. this approach to remain as compatible as possible (by not affecting all
  86. non-anonymous mappings) yet still make use of the non-executable feature
  87. in the most frequently encountered case.
  88. The kernel assisted DTLB loading logic is in the IA-32 specific page fault
  89. handler which in Linux is do_page_fault() in arch/i386/mm/fault.c. For
  90. easier code maintenance we created our own page fault entry point called
  91. pax_do_page_fault() which gets called first from the low-level page fault
  92. exception handler page_fault found in arch/i386/kernel/entry.S.
  93. First we verify that the given page fault is ours by checking for a userland
  94. fault caused by access conflict (vs. not present page). Next we pay special
  95. attention to faults caused by an instruction fetch since this means an
  96. attempt of code execution in a non-executable page. Such faults are easily
  97. identified by checking for a read access where the target address of the
  98. page fault is equal to the userland instruction pointer (which is saved by
  99. the CPU at the time of the fault for us). The default action is of course a
  100. task termination along with a log message, only EMUTRAMP when enabled can
  101. change it (see separate document).
  102. Next we prepare the mask for setting up the special pte for loading the
  103. DTLB and then we acquire the spinlock that guards MMU state changes (since
  104. we are about to cause such a change ourselves). Holding the spinlock is
  105. also necessary for looking up the target pte that we will modify and load
  106. into the DTLB. If the pte state we looked up no longer corresponds to the
  107. fault type then we must have raced with other MMU state changing code and
  108. pass down the fault to the original fault handler. It is also the time when
  109. we can identify (and pass down) copy-on-write page faults that have the
  110. same fault type but a different pte state than what is caused by the
  111. PAGEEXEC logic.
  112. Finally we change the pte to allow userland accesses to the given page then
  113. perform a dummy read memory access that will have the CPU page table walk
  114. logic load it into the DTLB and then we change the state back to be in
  115. supervisor mode. There is a trick in this part of the code that is worth
  116. a few words. If the TLB already has an entry for a given virtual/physical
  117. translation then initiating a memory access will not cause a page table
  118. walk, that is, for our DTLB loading to work we would have to ensure that
  119. the DTLB has no entries for our virtual address. It turns out that different
  120. members of the Intel IA-32 family have a different behaviour when the CPU
  121. raises a page fault during a page table walk (which is our case): the old
  122. Pentium (but not the MMX version) CPUs would still cache the translation
  123. if it described a present mapping but had an access conflict (which is our
  124. case since we have a supervisor mode pte that is accessed while executing
  125. in user mode) whereas newer CPUs (P6 core based ones, P4 and probably future
  126. CPUs as well) would not cache them at all. This means that in the second
  127. case we can be sure that the DTLB has no translations for our target virtual
  128. address and can omit a very expensive 'invlpg' instruction (it sped up the
  129. fast path by some 20% on a P3).