Recently in PoC||GTFO 0x12 Chris Domas demonstrated a minimal Turing complete virtual machine that only implements a mov instruction where the operands for the mov instruction are taken from a data list of memory address and offsets. When executed, every program will run using the same instructions but provide different functionality. With this paper he released a tool called 'reductio', a python script that leverages his Movfuscator compiler and reduction techniques to compile C programs to an operand data list that is acted on by the VM. The proof of concept he demonstrated was on x86 but he states that it should be possible to adapt it to other architectures.
Here we adapt the Movfuscator VM from x86 assembler to AVR assembler and go one step further by building a PoC VM from ROP gadgets after exploiting a vulnerable application running on an Arduino Mega2560. This single ROP VM can then interpret different data payloads to provide different functionality with the same ROP chain on a Harvard device with separate program and data memory regions.
AVRs are an 8-bit RISC Harvard architecture by Atmel. They are found in a wide range of embedded devices including the popular Arduino development boards. Like many MCUs registers and hardware are memory mapped allowing for them to be directly read or written to.
Harvard devices are common in embedded systems where code and data are stored in separate memory regions. Although this choice is not with security in mind it does prevent execution of shell code from data memory and limits remote code execution to return oriented programming (ROP). As these devices have a relatively small amount of program memory, the number of useful gadgets is often limited making complex functionality difficult. Modern devices often feature a modified Harvard architecture that enables software to write to program memory in order to enable firmware updating and patching. Typical techniques to gain remote code execution on such devices would be to build a ROP chain that returns into the bootloader and utilises the firmware update routine to add shellcode into program memory. However in production devices the existence of a bootloader cannot be guaranteed or an attacker may not want to have a persistent implant.
The VM developed by Chris Domas implements a simple mov instruction that performs a RISC like load and store instructions moving data stored at an offset from an address stored in a virtual register to a location at an offset of an address stored in another virtual register. His implementation of this in x86 is as follows:
mov esi, operands ; Point esi to the start of the operand list loop: ; Load data from source mov ebx, [esi] ; Read address of virtual register mov ebx, [ebx] ; Read virtual index register add ebx, [esi+0x04] ; Add offset and index to compute source address mov ebx, [ebx] ; Read data from source address ; Store data to destination mov edx, [esi+0x08] ; Read address of virtual register mov edx, [edx] ; Read virtual index register add edx, [esi+0x0C] ; Add offset and index to compute destination address mov [edx], ebx ; Write data to destination address ; Increment virtual instruction pointer add esi, 0x10 ; Increment esi to point to the next set of operands jmp loop
Adapting this to AVR assembler is not complicated but requires more instructions as AVR is a load/store RISC architecture. Because we are using an 8-bit AVR with 16-bit addressable memory we utilise special 16-bit registers that combine two 8-bit registers together to address memory. These register pairs are known as X, Y and Z registers. As most of our memory mapped IO registers are 8-bit, our VM only loads and stores 8-bits per instruction in comparison to the x86 implementation that moves 32-bits.
; Point X to start of operand list ldi XL, ops&0xFF ldi XH, (ops>>8)&0xFF loop : ; Load data from source ; Read address of virtual index register ld YH, X+ ld YL, X+ ; Read virtual index register ld ZH, Y+ ld ZL, Y ; Add offset and index to compute source address ld YH, X+ ld YL, X+ add ZL, YL adc ZH, YH ; Read data from source address (8 bits) ld r24, Z+ ; Store data to destination ; Read address of virtual index register ld YH, X+ ld YL, X+ ; Read virtual index register ld ZH, Y+ ld ZL, Y ; Add offset and index to compute destination address ld YH, X+ ld YL, X+ add ZL, YL adc ZH, YH ; Store data to destination (8 bits) st Z+, r24 jmp loop
As the Movfuscator compiler and 'reductio' is for x86 and 32-bit mov instructions we have to adapt some of the methodology implemented. Movfuscator is also notorious for producing very long data segments which is fine for a PC with plenty of RAM and fast processor speeds but on an embedded device we have limited resources, for example on the AVR on which we built our PoC is limited to just 8 KB of RAM.
To reduce the amount of data used we only include lookup tables and jump tables that are required for the functionality we want and manually write the operands using Atmel's AVR assembler directives. We also implement a different branching technique which would not be possible for movfuscator to utilise.
Movfuscator does not implement true branching, rather to branch it switches all pointers to a dummy data space, turning off execution and continues to process on a dummy state until the destination is reached. This technique requires a lot of overhead for checking if the branch address has been reached and for changing the pointers to and from real and dummy data.
As AVR has real registers that are memory mapped we can directly change the real register we are using for our virtual instruction pointer. This allows us to implement branching by directly storing values into the virtual instruction pointer. For conditional branching we build jump-tables of destination addresses and then use the index of these destination pointers as part of a comparison routines to implement if/then/else, loops, and switch case statements.
However, we are limited to 8-bit load/store instructions with 16-bit addressable memory. This means we can only change one byte of the 16-bit virtual instruction pointer at a time limiting direct branching to destinations within the same 0x100 byte aligned segment of memory. If a branch crosses a 0x100 aligned segment we have to include a branch handling routine at the beginning of each segment. When we want to branch we store the destination and instead branch to the branch handler routine in the same segment. This routine first stores the high byte of the destination into the virtual instruction pointer, this moves execution to the next instruction of the branch routine in the correct segment. The next instruction then moves the low byte of the destination into the virtual instruction pointer finishing the branch.
To demonstrate a ROP chain that implements the VM a simple vulnerable application was developed for an Atmega2560. The application is a note logging application featuring a buffer overflow when adding notes over UART. When the buffer is overflowed we overwrite the return address and gain control of the instruction pointer.
Because the AVR C compiler places the initial stack pointer right at the end of memory this leaves us with very little space to build our ROP chain and stage our data. To solve this limitation we need to grow the amount of stack we have available by repeatably pivoting the stack and returning back to the vulnerable function. Once we have moved the stack as close to the top of memory as possible (without hitting the memory mapped IO) we deliver the VM ROP chain, VM data, and list of executable operands. We then perform a final stack pivot and begin execution of our ROP based Movfuscator VM. To loop our ROP chain the final gadget pivots the stack back to the start of the chain. It is vital that interrupts are disabled to prevent our ROP chain from being corrupted.
We developed two PoC data payloads that use the same ROP chain but have different functionality. Our first example flashes an LED on the Arduino device and the other echos data sent to it over UART.
PoC code can be found on Github.
It would be nice to be able to compile C to Movfuscator VM operands, one approach would be to use Movfuscator and adapt the 'reductio' script to output 8-bit moves rather than 32-bit however we will be stuck with a lot of overhead from Movfuscator. Alternatively we could write a new movfuscator compiler by going through AVR assembler instructions and building macros of each instruction.
Another approach for our VM would be to build a JIT movfuscator VM that reads, processes and disregards one VM instruction at a time. This would require more ROP gadgets to fetch and return data from the host but we would no longer be space limited and we would be able to offload some of the processing.