_____ _____ _ _ | __ \ |_ _| \ | | | |__) |_ _| | | \| | | ___/\ \ /\ / / | | . ` | | | \ V V /| |_| |\ | |_| \_/\_/_____|_| \_| - Pwning Intel Pin -
Zhechko Zhechev & Julian Kirsch (Technical University of Munich)
Binary instrumentation is a robust and powerful technique which facilitates binary code modification of computer programs in order to better analyse their behaviour and characteristics even when no source code is available. Dynamic Binary Instrumentation (DBI) frameworks achieve this either statically by rewriting the binary instructions of the program and then executing the altered program or dynamically, changing the binary’s code at run-time right before it is executed. The design of most DBI frameworks puts emphasis on ease-of-use, portability, and efficiency, which has established them as a powerful tool utilised for analysis tasks such as profiling, performance evaluation, tainting, and bug detection.
Moreover, the interest of employing DBI tools for development of integrity protection software (e.g. CFI) and malware analysis is constantly increasing among researchers in academia. This inspired us to look more closely at how DBI frameworks influence the security of the instrumented binary. Our three main findings can be summarized like follows:
To our perception, the most prominent examples of DBI frameworks nowadays are Intel PIN, Dyninst, Valgrind, DynamoRIO and (more recently) QBDI. In the following, we focused (almost exclusively) on Intel PIN version 3.5 in JIT mode on Linux. Moreover, we utilise the latest version of Ubuntu 17.10 64 bit so that we can benefit from the latest security mechanisms, such as a higher number of randomised bits by ASLR.
All of the examples presented here can be executed in a Docker container which can be built using
docker build . executed in the folder containing the Dockerfile and the tools folder. Get the image id via
docker image ls and finally run the resulting container using
docker run --privileged -ti <image-id>.
Building on the work of Falcon and Riva presented at REcon 2012, we were able to port some of their PIN detection techniques to Linux x86_64. Unfortunately, some of their research was based on unintentional bugs in Intel PIN which were fixed in later versions or they were Windows specific and not compatible with Linux. For example, the instrumented process in Windows is a child process of the PIN DBI binary (pinbin) throughout its whole execution. In contrast, in Linux PIN employs the
ptrace debugging API to spawn the instrumented process as its child, copy itself into the new process’ memory and continue its execution there, while the parent process exits. Moreover, the instrumented program is started using the
execve system call which preserves the original process PID but the text, data, bss, and stack of the calling process are overwritten by that of the program loaded. Hence, the instrumented program appears to have no relation to the PIN framework. As a result, we cannot check whether the process’ parent is the PIN binary and determine whether it is being instrumented or not.
After a caffeine intensive drilling in the PIN binary, we have managed to extend the list of PIN detection techniques in a couple of surprising ways. We have implemented a tool called
jitmenot which employs 11 PIN framework detection mechanisms. In the following, we describe some of the most prominent examples which we have found, adding to the work of Falcon and Riva. Additionally, we provide information about how the tool can be built and executed.
There are three main categories which encapsulate our detection techniques, namely code cache artefacts, JIT compiler overhead, and environment artefacts.
In the first category – code cache artefacts – we include anomalies introduced in the program’s execution by the fact that the executed code is not the original one. Falcon and Riva already explained in their presentation how the
fxsave instruction can be (ab)used to detect PIN’s VM instruction pointer which the framework tries to hide by always calculating and returning the instrumented program’s original instruction pointer value.
However, there is another method involving system calls and the way the PIN framework emulates them. Firstly, when PIN has to accomplish some task outside of the VM, such as system call emulation or determining the next instruction trace to execute, the register state of the application is saved on PIN’s stack and upon return this state is restored. Secondly, when executing any system call using the
syscall instruction the current instruction pointer value is copied to the
rcx register, such that the
sysret instruction can restore execution. As
sysret is used by the operating system’s kernel, user land perceives the
syscall instruction to have the side effect of setting the
rcx register to the instruction right behind the syscall. However, this is not the case with applications executed in PIN DBI. Since PIN emulates all system calls undertaken by the application outside of the VM, it has to save the program’s register state before exiting the VM. Apart from the syscall’s result no other side effects are propagated back to the program. As a result, the value in application’s
rcx register stays the same as before the syscall instruction was executed. This discrepancy can be used as an instrumentation detection mechanism.
Yet another code cache artefact involves the way PIN handles self-modifying code (SMC) together with the fact that instrumentation is done with basic block granularity. According to Intel, the PIN framework does attempt to detect manipulations in the binary’s original code by utilising the
PIN_SetSmcSupport configuration function and the
TRACE_AddSmcDetectedFunction callback function. However, the Pin tool programmer has to manually trigger code cache invalidation upon receiving a SMC notification in order to jit the altered code again. In the following, we show how a malicious binary could adopt this as an instrumentation detection technique. Firstly, we mark the binary’s code as
rwx by calling
prot set to
PROT_READ | PROT_WRITE | PROT_EXEC. Then we try to modify the immediate operand of a move instruction which changes the value of a register. In the next listing we illustrate this approach.
mov [label+1], 0x0 label: mov eax, 0x1
Since PIN does not automatically invalidate the code cache and only the original code (but not the instrumented version in the code cache) is modified, after executing this code segment the value of
eax will depend on the fact whether the application is being instrumented or not. In the first case, the old code cache is executed and the value of
eax register will be
1, while in the latter the code change is registered and
eax will hold
0 after the execution completes. Additionally, we notice that the callback registered by
TRACE_AddSmcDetectedFunction is really only a callback that triggers if
mprotect was called with
prot set to
PROT_READ | PROT_WRITE | PROT_EXEC and would raise a code cache invalidation notification followed by recompilation of already present code in the code cache.
The introduced overhead by the JIT compiler and the necessary VM switch could also be used as a detection mechanism. For example, if we measure the cycles (using
rdtsc instruction) elapsed between the last instruction before entering a loop and the first instruction of the loop body () and compare it with the time spent on every iteration of the same loop (, ), we can see that , for all . Normally, when an application is not instrumented, this time is much smaller, as the PIN VM introduces an overhead of factor .
As already described, PIN does mask (almost) all register values, including the current instruction pointer. However, there are some cases when this is not the case, e.g. FS base. After starting the application, PIN saves its FS base value in its cache and restores it every time this value is necessary. This is the case when we try to retrieve FS base value by utilising the corresponding system call (
SYS_arch_prctl). Unfortunately, PIN fails to emulate the
readfsbase instruction implemented in Intel processors >= Ivy Bridge and instead of the application’s FS base, the instruction returns PIN’s own FS base. By comparing these two values, the application can detect whether it is being instrumented. It has to be noted, however, that the underlying operating system has to allow the use of the
readfsbase instruction, which is currently not the case for Linux, making this detection approach less useful in practice.
Lastly, PIN expects certain environment variables to be set in order to run properly (e.g.
PIN_INJECTOR64_LD_LIBRARY_PATH). As the framework spawns the instrumented process as its own child, these inherited environment variables can still be found in the original application’s memory. Searching for them can also expose the underlaying JIT engine.
As one can see, a binary could notice whether it is currently being executed in a PIN environment. By nature, JIT compilers cause a lot of noise which is not only hard to disguise but trying to do so can introduce even more irregularities in the instrumented program execution. This is extremely essential considering split-personality malware analysis. An instrumented malware may pretend to be a harmless program while instrumented and one cannot be completely sure whether all anti-instrumenting techniques have been deactivated. We executed jitmenot in three other DBI frameworks, DynamoRIO, Valgrind and QBDI. The results can be found in
jitmenot can be built with the provided
Makefile. Executing it without the presence of PIN (
./jitmenot) should output a green
NEGATIVE text next to each of the 11 tests, while instrumenting it with any Pintool (
pin -t any/nice/pintool -- ./jitmenot) should result in red
POSITIVE labels again for every test case. If one starts jitmenot with the
-v parameter, it will also print additional information for each test case. In order to execute the fsbase test, one has to load a kernel module (tools/fsgsbase-mod) using
make start which allows the execution of the
readfsbase instructions in userspace, available only for Intel processors newer than Ivy Bridge.
If one happens to study closely the PIN framework paper[1:1], they would discover that in section 3.3.1, the authors clearly state that the instrumented program’s code is never executed; instead it is compiled (from machine instructions to the same kind of machine instructions) and executed together with the Pintools procedures in a VM. Every machine instruction executed resides in the VM (code cache) and the effect of any instruction cannot “escape” the VM region. Like other VMs, The PIN framework manages the instrumented program’s instruction pointer and translates each basic block of the original code lazily (i.e. when necessary). More importantly, the VM may and will reuse already compiled code because of optimisation benefits. Moreover, PIN does not employ any integrity checking of already translated instructions in the code cache. Therefore, we can alter already executed instructions in memory, as they are comfortably protected as
rwx by the VM for us. We believe this to be a necessity by design since the JIT engine has to compile and store the instructions on-the-fly, which will be later executed by the VM. This allows us to first execute a function in order to allow the PIN VM to translate its assembly code and place its address in an internal hash table to find it later. Then the instrumented program can find in the code cache (using the real instruction pointer detection techniques described in the previous chapter) and modify it arbitrarily to point to . In the end, we have to call again and PIN will effectively execute . (Note that can be any instruction specified by an instrumented malicious binary.)
tools/sandbox does exactly as described: when executed with any command argument, it will attempt to change its own already jitted code in PIN’s VM code cache and execute its original code but this time without the presence of the VM. In
tools/sandbox one can also find a Pintool (
SandboxPinTool) which prints basic information about every system call which is executed by the instrumented binary to the console. After successfully escaping from the sandbox, we execute one system call (
get_pid) which is not recorded by the Pintool and exit the program, effectively demonstrating the sandbox escape.
Of course, nothing stops us from exiting and again entering the sandbox which will not be taken into account by the PIN VM. However, one has to be careful with storing and restoring all of the necessary register values (basically playing a second VM). For example, the internal PIN
fs:0x0 value (used for example for thread local storage and the stack canary reference value, …) is always set to 0. Once the instrumented application would like to access its
fs:0x0 value (e.g. attempt to check stack cookie), PIN restores it from its storage (always
r15 + 0x40). If we want to sustain autonomous working of the application after escaping from the VM, we have to restore the original
fs:0x0 value on its place before executing any libc functions or returning from a function containing a stack cookie.
Finally, escaping the PIN sandbox in Linux without necessarily knowing any code cache address is also possible. We measured the relative offsets between all mapped pages in different executions of an application instrumented by PIN (
tools/sandbox/lin_graph.pdf). It is clear that the offset between libc and the code cache, as well as pinbin (PIN binary) (PIN’s own stack, too ;]) is constant. Having a leaked address from any of these code regions allows us to find the other mappings. Therefore, we can use all gadgets present in the code basis to build ROP chains, as well as explicitly write shellcode into the code cache. This is due to the fact that, as already explained, the PIN framework copies itself into the application’s memory by allocating memory using
mmap. As pointed out in our earlier work, the addresses of consecutively allocated memory locations allocations
mmap are predictable (i.e. relative distances remain constant) in Linux. All needed information can therefore be calculated a priori based on known binaries of Intel PIN, the instrumentation tool and our own binary.
Unfortunately, these page offsets are not constant in consecutive executions of PIN’s Windows version (
Finally, we show how implementing security mechanisms enforced by executing a given common off the shelf (COTS) binary in a DBI environment may introduce more possibilities to exploit already present bugs (i.e. attack surface is increased instead of decreased). There are two main concerns involved in this -
rwx pages which we have already put into action in the previous chapter and how PIN decides which code is next to be executed.
During our experiments we made the critical observation that the PIN framework fails to check the permissions of the code that is to be processed by its JIT engine. This means, any data in memory can (and will) be translated to executable instructions if reached by the control flow. This transfers us to the dawn of buffer overflows and shellcode execution era. As a simple example we can run an application which places some shellcode on the stack and then jums to it. Normally, because of the set
NX bit in the page tables of the stack, the program will crash as soon as the instruction pointer points to an address on the stack. However, instrumenting the same binary with PIN does not crash the application. In fact, the execution continues and opens a shell (an example C program spawning a shell can be seen in
tools/stack-exec/shell.c). We could not resist to coagulate the words PIN and pwn into the neologism PwIN, thus the name of this work.
PwIN makes it possible to execute code residing on pages which have (at least) read permissions. We state that this PIN “feature” would make previously difficult to exploit bugs already present in the application easier to misuse. We discuss this in the following by means of a bug in wget versions older than
1.19.2 found in
http.c:skip_short_body() documented as CVE-2017-13089. Without Intel PIN the strongest attack (known to us) results in a 1:16 probability of leaking an arbitrary file stored on the victim to the server (see below). We will discuss how the same bug can be escalated to full code execution if the victim is instrumented using Intel PIN.
The vulnerable function in wget is called when processing HTTP redirects together with HTTP chunked encoding. The chunk parser uses
strtol() to parse each chunk’s length into a long variable but it does not check if it is negative. The code then tries to skip the chunk in pieces of 512 bytes but ends passing the negative length to
fd_read()'s length argument is of type
int, thus the high 32 bit of the length variable are discarded. This allows an attacker to completely control the length of the read chunk and overflow the
dlbuf on the stack.
Using a partial overwrite (binary is PIE) on the return address of
skip_short_body() one may divert the program’s execution flow to the sorrounding 64k of existing code with probability 1:16. Hence the constraints on the impact of the exploit without PIN: There is no trivially-to-reach function / gadget that results in remote code execution.
However, when running in context of Intel PIN we can inject and execute shellcode situated in non-executable memory regions, reducing the challenge of achieving remote code execution to just having to find reliable possibility to jump to a pointer to data we control. Fortunately, in the end of the
rsi register contains the address of
dlbuf. However, there are no convenient gadgets reachable with a partial overwrite on the return address which may divert the code execution to the address contained in
rsi. We remedy this by injecting our own
jmp rsi gadget into a buffer that we can divert control to using the partial overwrite: We can jump into the HTTP cookie using a stack lifting gadget (
add rsp, 0xXX; ret). This allows us to execute the cookie as machine instructions where we have placed a
push rsi; ret shellcode which then transfers the execution to the payload situated in
rsi is still pointing to the stack). We could just have used the cookie directly to inject our shellcode but since wget imposes strong restrictions on the cookie (very short, < 16 bytes, UTF-8 encode all bytes >
0x7f) this approach was discarded. (Note how
push rsi; ret is
0x56 0xc3, which is perfectly valid UTF-8 as
0xc3 just happens to be a UTF-8 continuation character and also a
An example of this exploit can be found in
tools/exec-cookie-wget. In order to execute it, start the server by using
python3 pwn.py and then connect with instrumented
pin -t /any/nice/pintool ./wget localhost:55555. Use
docker attach <container-id> to obtain a second shell on the same container. In this scenario the server tries to pwn the client with a probability of 1:16 – if you see a
SIGSEGV just retry.
In summary, we presented an extended list of instrumentation detection techniques focusing on Intel PIN DBI, as well as possibilities to escape the underlying VM and execute code outside of it which is driven by DBI’s design. Lastly, we presented PwIN, an Intel PIN bug which allows code execution in not executable memory regions.
Chi-Keung Luk et al. Pin: building customized program analysis tools with dynamic instrumentation. Programming Language Design and Implementation (PLDI). 2005. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool ↩︎ ↩︎
Nahuel Riva, Francisco Falcon. Dynamic Binary Instrumentation Frameworks: I know you’re there spying on me. RECon (2012). https://recon.cx/2012/schedule/attachments/42_FalconRiva_2012.pdf ↩︎
Mario Polino et al. . Measuring and Defeating Anti-Instrumentation-Equipped Malware. Detection of Intrusions and Malware and Vulnerability Assessment. 2017 ↩︎
Julian Kirsch, Bruno Bierbaumer, Thomas Kittel and Claudia Eckert. Dynamic Loader Oriented Programming on Linux. Reversing and Offensive-oriented Trends Symposium (ROOTS). 2017. https://github.com/kirschju/wiedergaenger ↩︎