Proving Equivalence of x86 programs with McSema + SeaHorn (III)
SeaHorn

In our first post, we started talking about our experience using McSema and SeaHorn to prove equivalence between two x86 binary programs, where one is a variant from the other. In our second post, we exemplified how McSema translates x86 binary programs into LLVM bitcode and highlighted the main challenges for reasoning about them. In this post, we finish our series of posts about proving equivalence of x86 programs by describing our experience applying SeaHorn on the bitcode generated by McSema.

Remember that SeaHorn relies on a context- and field-sensitive pointer analysis to statically partition memory into disjoint memory regions. Read this post for more details about our pointer analysis. During Verification Condition (VC) generation, it maps each memory region into a logical array and replaces load and stores from/to a memory region with reads and writes from/to the logical array. SeaHorn provides several back-end solvers (aka engines) to solve VCs. The most powerful engines use Spacer and Crab to automatically infer loop invariants if the VCs are expressed over quantified-free theory of arrays and linear integer arithmetic. These engines allow SeaHorn to prove absence of errors, not just find bugs. Unfortunately, these engines are not applicable in our context because they are not appropriate for bit-precise analysis.

For this reason, we rely on another SeaHorn engine based on Bounded Model Checking (BMC). This engine is more precise but less powerful, because it cannot construct loop invariants. Essentially, BMC can prove properties that hold for bounded program executions. Applying BMC to McSema bitcode is challenging because of the various issues with McSema mentioned previously: no type information, every register can be both an integer and a pointer, low-level bitcode to simulate process flags, etc.

The problem of proving equivalence between two binaries is split into the following steps:

  1. For every function f in the original binary, we identify the corresponding function f’ in the variant. This is possible because the binary transforms keep this mapping.

  2. Both functions f and f’ are translated to LLVM bitcode using McSema as described in our previous post. In the function body, calls to other functions are modeled with uninterpreted functions. Calls to user-defined functions are replaced with an external function with the same name in the original and the variant function.

  3. We merge the bitcode of f and f’ using the llvm-link tool and create a main function that calls f and f’ sequentially. Before these two functions are called we insert an assumption to ensure that both memory states are identical (Mem_in_f == Mem_in_f'), and after both functions return, we insert an assertion to check that the output of both functions is identical (Mem_out_f == Mem_out_f').

   assume(Mem_in_f == Mem_i_f');
   Mem_out_f := f(Mem_i_f); 
   Mem_out_f' := f’(Mem_i_f');
   assert(Mem_out_f == Mem_out_f');
  1. Finally, the annotated LLVM bitcode from Step 3 is analyzed SeaHorn’s BMC engine. This requires specifying a bound on the number of loop unrollings.

  2. If SeaHorn returns “safe” then it means that the two functions (the original and the variant) are equivalent up to the number of loop unrollings. Otherwise, SeaHorn produces a counterexample that shows different behaviors between the original and the variant

We needed to tackle several important technical challenges before SeaHorn could be effective on the LLVM bitcode produced by McSema. We have seen that LLVM code produced by McSema is very different from regular LLVM bitcode produced by a C/C++ compiler. All program variables are integers that are cast to pointers when accessing memory via intptr instructions. This is a major problem for SeaHorn’s pointer analysis. Like all pointer analyses we know, it can reason precisely about intptr instructions only if the pointer does not escape into memory (e.g., if it is only used for pointer comparison). Complicated pointer arithmetic and coercions from integers to pointers and back are not handled precisely by any such analysis. McSema casts an integer to a pointer and then reads or writes memory using that pointer for most operations. As a consequence, SeaHorn’s pointer analysis is imprecise when applied to McSema bitcode. It often collapses all memory regions into one since it might not know exactly which memory region is accessed after the execution of an intptr instruction. This results in a “flat” memory model where memory is represented by one single array and each allocation returns and offset in that single array. We have tried this memory model, but it did not scale.

Our solution is to refine this flat memory model by using static knowledge about how McSema translates programs into LLVM bitcode. This refinement separates memory into disjoint regions, which helps SeaHorn’s BMC engine and makes the verification more practical.

Program Annotations

The main aspects we need to encode before we could use effectively SeaHorn’s BMC engine include:

  1. Memory layout: exploit McSema representation to improve over the flat memory model.

  2. Assumptions: state that before the execution of the two functions their states are identical.

  3. Property: assert that after the execution of the two functions their outputs are identical.

In the implementation, we encode all these constraints by automatically adding extra LLVM instructions to the bitcode. These instructions are interpreted by SeaHorn and produces the desired behavior. To simplify the presentation, we do not show the bitcode but instead we show an equivalent logical representation using SMTLIB syntax.

Program State

CPU registers and status flags:

  Reg_1 = (Array (_ BitVec 64) (_ BitVec 64))
  Reg_2 = (Array (_ BitVec 64) (_ BitVec 64))

The processor state (i.e., CPU registers and flags) is modeled as a one-dimensional array. Indexes and array elements are bitvector of 64 bits.

Memory:

  Mem_1 = (Array (_ BitVec 64) (_ BitVec 64))
  Mem_2 = (Array (_ BitVec 64) (_ BitVec 64))

The memory is also modeled as a one-dimensional array.

By combining these two regions, the complete state of each function is defined as a pair of arrays: (Reg_i x Mem_i) for i in {1,2}.

This is a major improvement over the flat memory model. The states of both variants are separated: function 1 and function 2 have disjoint memory spaces and register states. Also, the register states of both variants are separated from their memory. At the LLVM level, this separation is enforced by annotations that specify different logical array names to memory read and writes. These annotations are automatically inserted by our tool, thus, the BMC engine does not need to infer this fact during its reasoning.

Memory Layout

McSema assigns a predefined offset to each register and flag:

RIP = 0 RDI = 6 R11 = 12 PF = 18
RAX = 1 RSP = 7 R12 = 13 AF = 19
RBP = 2 RBP = 8 R13 = 14 ZF = 20
RCX = 3 R8 = 9 R14 = 15 SF = 21
RDX = 4 R9 = 10 R15 = 16 OF = 22
RSI = 5 R10 = 11 CF = 17 DF = 23

We use these indices to encode register accesses. For instance, the content of %RSP is denoted by the expression (select Reg_i 7). We use index i in {1,2} to identify the two function variants under consideration.

To describe memory layout, we define the following variables (for i in {1,2}):

(declare-fun stack_i_start()(_BitVec 64))
(declare-fun stack_i_end()(_BitVec 64))
(declare-fun heap_i_start()(_BitVec 64))
(declare-fun heap_i_end()(_BitVec 64))
(declare-fun global_i_start()(_BitVec 64))
(declare-fun global_i_end()(_BitVec 64))
(declare-fun fun_param_i_start()(_BitVec 64))
(declare-fun fun_param_i_end()(_BitVec 64))

Each of them defines the start or and of a particular memory region. We assert several constraints on these regions.

Stack, heap, global and function parameters are in disjoint regions

(and (< stack_i_start stack_i_end) 
     (< heap_i_start  heap_i_end)
     (< global_i_start global_i_end) 
     (< fun_param_i_start fun_param_i_end)
     (< stack_i_end heap_i_start)
     (< heap_i_end global_i_start)
     (< global_i_end fun_param_i_start))

Both functions have the same initial stack address

(= stack_1_start stack_2_start)

Initially, RSP points to the bottom of the stack (which grows downwards)

(= (select Reg_i 7) stack_i_end)

Contents of global variables point to the global region

We assume that both functions have the same global variables {gv_1, gv_2, ..., gv_n}. We map each LLVM global variable to a logical bitvector variable:

(declare-fun gv_1()(_BitVec 64))
(declare-fun gv_2()(_BitVec 64))
 …
(declare-fun gv_n()(_BitVec 64))

And we generate the following constraints:

(and
(= gv_1 global_i_start) 
(= gv_2 (+ global_i_start sizeof(gv_1))
 …
(= gv_n (+ global_i_start
        (+ sizeof(gv_1)(+ … (+ sizeof(gv_n-2) sizeof(gv_n-1)))))
)   

The function sizeof returns the size of a global variable. This size is statically known.

Contents of dynamic allocated memory point to the heap region

For each function, we identify all calls to malloc-like functions (malloc, realloc, calloc, etc). We require that both functions allocate dynamically the same memory. Since we unroll loops finitely many times, the number of allocation sites is finite.

For each call of the form h := malloc(sz), we declare the following logical variables:

(declare-fun h_1()(_BitVec 64))
(declare-fun h_2()(_BitVec 64))
 …
(declare-fun h_n()(_BitVec 64))
(declare-fun sz_1()(_BitVec 64))
(declare-fun sz_2()(_BitVec 64))
 …
(declare-fun sz_n()(_BitVec 64))

and we generate the following constraints:

(and
(= h_1 heap_i_start) 
(= h_2 (+ heap_i_start sz_1))
 …
(= h_n (+ heap_i_start (+ sz_1 (+ … (+ sz_n-2 sz_n-1)))))

Contents of function parameters (%RSI, %RDI, %RDX, %RCX, %R8, %R9) point to the func_param region

(and (<= param_i_start (select Reg_i 5)) (< (select Reg_i 5) param_i_end))
     (<= param_i_start (select Reg_i 6)) (< (select Reg_i 6) param_i_end))
     (<= param_i_start (select Reg_i 7)) (< (select Reg_i 7) param_i_end))
     (<= param_i_start (select Reg_i 8)) (< (select Reg_i 8) param_i_end))
     (<= param_i_start (select Reg_i 9)) (< (select Reg_i 9) param_i_end))
     (<= param_i_start (select Reg_i 10))(< (select Reg_i 10) param_i_end)))

These constrains do not prevent the register’s contents from overlapping. This might produce spurious counterexamples. An alternative encoding would be to assign a fixed offset to each function parameter’s content in the memory region between param_i_start and param_i_end. This alternative encoding is more precise, but it requires knowing in advance the size of the function parameters.

Other constraints

Other constraints are added to assert that registers such as %RAX and %RBX point to the memory region.

Assumptions

Before execution starts, both functions have identical memory states:

(and (= Reg_1 Reg_2) (= Mem_1 Mem_2))

Property

After execution, both functions have identical outputs (i.e., the value of %RAX).

(= (select Reg_1 1) (select Reg_2 1))

We could easily extend this to check more consistency properties (e.g., portions of the heap or global variables should be equal). In general, the right notion of consistency might be program and transform dependent.

Implementation and some preliminary results

We use McSema version 1 , LLVM version 3.8, and the latest version of SeaHorn. We made some changes to McSema to prevent irrelevant ELF sections from polluting the bitcode (e.g., the .eh_frame section), and we wrote several Python scripts and LLVM passes to instrument the McSema bitcode.

The current limitations of the implementation are:

These are not fundamental limitations but we simply did not have time to tackle them.

We evaluated our prototype on thttpd, an open-source HTTP server, available at (https://acme.com/software/thttpd/).

For generating variants, we tested the BILR (block-level instruction layout randomization) transform provided by DoubleHelix. Our main result is that we can verify BILR transform applied to thttpd.

For thttpd, our tool generated 78 pairs of functions. We performed bounded model checking with a fixed loop unrolling of 5. 3 function pairs could not be processed either because they involved recursive functions or 128-bit integers. We ran SeaHorn on the remaining 75 function pairs. We experimented with different backend SMT solvers. We obtained the best results with the Yices 2 solver. With a timeout of 900 seconds, SeaHorn + Yices 2 can prove functional equivalence for all function pairs except four. We also tried Boolector and Z3. With SeaHorn + Boolector, verification fails on 17 pairs of functions. Z3’s performance on these examples was much worse, producing timeouts on most checks.

Conclusions

This is the end of our series of posts about proving equivalence of x86 programs using McSema + SeaHorn. We hope you enjoyed it!

We have shown once again the versatility of the SeaHorn verification framework. The novelty has been the integration of McSema, a new front-end that allowed us translating from x86 code to LLVM bitcode. The main benefit was to leverage the SeaHorn’s BMC engine.

Reasoning about x86 binaries is hard and scalability is still an open problem. We do not claim by any means that the approach we took is the most practical. Our goal was only to use SeaHorn’s components as they are whenever possible in order to see how far we could go. But, there are many things that are worth investigating:

Well, that’s all!

*****
Written by SeaHorn on 12 December 2018