Field Sensitive Pointer Analysis

Field-sensitivity is the ability to distinguish between individual fields of a structure. Field-sensitivity introduces a problem with cycle-elimination, called Positive-Weighted-Cycle which I’ll talk about later. 

Field Sensitive Analysis was introduced by Pearce in Efficient Field Sensitive Pointer Analysis. 

Cycle detection and collapsing is a very important optimization that helps Andersen’s analysis. To recap, a cycle is of the form p⊇ q, ⊇ q r, r ⊇ p. In SVF’s world, it is a CopyEdge from q → p, r → q, and finally from p → r. And all the pointers in this cycle share the same solution, and therefore, can be collapsed.

In order to make Andersen’s analysis field-sensitive, Pearce extended the original constraint model with the concept of “weighted constraints” or “weighted constraint edges”. Every field in a complex struct type has an index with respect to the base of the object. And this index is assigned as the weight of the constraint. 

The constraint itself is of the form p ⊇ q + k, where k is the weight / field-index. 

In order to derive the solution of this constraint, let’s first recap the solution without the field-sensitivity. The derivation rule looks like —

p ⊇ q, q ⊇ {r} then, p ⊇ {r}

Now, with the addition of the field constraint, we need to match the field index, in addition with the base-object (r, in this case). So we get —

p ⊇ q+k, p {r}, idx(s) = idx(r) + k, idx(s) <= end(r), then, p ⊇ {s}

Which is just a mathematical way of saying s is the object at an offset of k from r. Note that s is bounded by end(r), which is the maximum number of fields in the object.

Cycles in the Constraint Graph

One of the common optimizations in Andersen’s points-to analysis is cycle elimination. At a source code level, consider the following pointer assignments —

int a = 100;
int *p, *q, *r;
p = &a;
q = p;
r = q;
p = r;

This represents a cycle, from p → q → r → p. This is a very simple example, and these cycles can happen across functions, and complex situations, so it’s useful to be able to eliminate these cycles.

The key point about cycles is that all nodes in a cycle share the same points-to set. So, the analysis can be optimized by replacing all nodes in a cycle by a single node. 

Now let’s look at SVF’s implementation.

Note that only copy constraints can cause cycles. In SVF world, the copy constraints are represented by CopyEdges in the constraint graph. Remember, that during solving the LoadEdges and the StoreEdges, CopyEdges are introduced. These CopyEdges are introduced to make it easier to track cycles. So, anytime a cycle is introduced during solving, it shows up as a cycle of CopyEdges, and all nodes that are part of this cycle can be merged/collapsed into a single Node.

Now, let’s take a look at a simple example of a cycle — int *p, *q; p = q; q = p;

This is the same example that we were looking at earlier. Figure 3 shows how the cycle of Copy Edges is created in this case. The cycle is nodes 22 → 30 → 20 → 28 → 22. So all of these nodes will be collapsed into a single node. This creates the final constraint graph that looks like Figure 4.

Figure 4

Note that each Copy Constraints in the C code, is translated into two constraints in the IR — a Deref/Load constraint, and a Assign/Store constraint, and the effect of solving the two Copy Constraints in the C source code is the same as the effect of solving the two Load and two Store constraints in the IR.

Equivalence of analysis in C Source Code and IR

The LLVM IR is a representation of the C Source Code. But the constraints generated for the statement in C and in IR might be different. 

For example, p = *q; is a single constraint of Deref/Load type in C, but results in two Load constraints and one Store constraint in IR. 

But this doesn’t mean that the result of performing the points-to analysis on C source code is different from performing it on the IR!!! In IR, you just have a more finer view into the memory operations, but eventually, the points-to sets for a memory location / variable will be the same whether you do it in C or LLVM IR.

SVF Implementation of Andersen’s Analysis

This brings us to the SVF implementation of Andersen’s Analysis. SVF operates on the LLVM IR. And this presents some quirks. 

SVF represents the constraints as a constraint graph. As expected there are 4 types of constraints (Address-Of, Copy, Assign, and Deref). Only SVF calls them — 

  1. Address-of
  2. Copy
  3. Store
  4. Load

(In addition to these 4, SVF also has a fifth type of constraint to handle field sensitivity, that I’ll talk about later).

In the constraint graph, the memory objects and pointers are represented as nodes, and the constraints between them as edges. IR pointers are represented as ValPN nodes, memory objects are represented as ObjPN nodes. Every node, both ValPN and ObjPN has points-to set associated with it.

SVF solves constraints by first reducing all Deref (Load) and Assign (Store) constraints to Copy Constraints. Check the underlined part in the Andersen’s Analysis for why this is reasonable. Reducing to Copy Constraints allows SVF to apply optimizations (such as cycle detection, that I’ll talk about later)

Let’s take a look at how different LLVM IR instructions result in the creation of new nodes and edges in the constraint graph (Actually in the PAG graph, but let’s defer the discussion of PAG vs Constraint graph for later). 

  1. AllocaInst: This instruction is used to create a local variable on the stack. When SVF sees this instruction, say p = alloca i32**, corresponding to the C statement int *p, it creates two nodes. One for the pointer p (a ValPN node), and one for the memory object (a ObjPN node). The memory object represents the memory associated with the variable.

    An Addr Edge is added between the ObjPN and the ValPN, denoting the address-of relationship. When this AddrEdge constraint is solved, points-to set of the ValPN includes the ObjPN.

    This is exactly the same thing as solving the Addr-of constraint. p = &q.

    Though the language changes ( C → IR) the rules that guide the derivation of the constraints don’t change. A Copy Constraint derives in the same way in the IR, as in C.

    For example, consider the bolded C statements from the simple C program, whose Constraint Graph is shown in the figure below —

    int a = 100;
    int *p, *q;

    p = &a;
    q = p;
    p = q;

    Translated to IR —
    %a = alloca i32, align 4
    %p = alloca i32*, align 8
    %q = alloca i32*, align 8

This creates three ValPNs and their corresponding three ObjPNs. The Green edges are the Address Edges.

Each Node in the ConstraintGraph has a unique id (19, 20, etc).

Figure 1:

  1. LoadInst: This instruction is used to explicitly load the contents of an IR pointer into a virtual register. When SVF sees this instruction, say %0 = load i32*, i32** q, it creates one ValPN node for the virtual register %0, and adds a LoadEdge between the ValPN node for q (not the ObjPN) and the ValPN node for %0.

    Consider the bolded statement in the C program, 

int a = 100;
int *p, *q;
p = &a;
q = p;
p = q;

This is translated to the IR instructions:

%0 = load i32*, i32** %p, align 8

  store i32* %0, i32** %q, align 8
  %1 = load i32*, i32** %q, align 8

  store i32* %1, i32** %p, align 8

Let’s look at the first Load instruction. In the above constraint graph it is represented by the Red edge from nodes 21 → 30.

This LoadEdge represents a Deref constraint. As stated before, SVF reduces this to a Copy Constraint from all nodes that exist in the points-to set (source) to the destination of the load. So in the simplest case, LoadEdge will lead to the creation of a CopyEdge from the ObjPN that existed in the source ValPN’s pts-to set (added because of the AddrEdge), to the ValPN of the destination.

[Quick tip: Think about how you’d solve p = *q. p is the destination, q is the source of this load. You’d first find all elements you know q to point to, (call them e1, e2, etc) and then add copy constraints q = e1, q = e2, etc. The source is the e’s and the destination is q.]

Solving the two LoadEdges (21 → 30 and 19 → 28) results in adding the following CopyEdges (black edges)
Figure 2

  1. StoreInst: This instruction is used to explicitly store the contents of a virtual register to a memory location. When SVF sees this instruction, say  store i32* %1, i32** %p, it creates an StoreEdge from the source ValPN (%1) to the destination ValPN (%p).

    This StoreEdge represents an Assign Constraint. In the simplest case, this will be reduced to a CopyEdge from the ValPN of the source pointer, to the ObjPN in the pts-to set of the ValPN of the destination pointer (added because of the AddrEdge)

    Let’s continue with the previous IR instructions —

    %0 = load i32*, i32** %p, align 8

 store i32* %0, i32** %q, align 8
%1 = load i32*, i32** %q, align 8

 store i32* %1, i32** %p, align 8

[Quick Tip: In order to understand how SVF solves this, think about the assign constraint *p = q. To solve this, you’d first find all elements e that exist in the points to set for p, and then add a Copy Constraint e = q.]

The three Store Edges add the three (bolded, dashed) black Copy Edges.

Figure 3

CastInst: CastInst instructions do BitCasts and other casts and these result in CopyEdges in the LLVM IR. CopyEdges are solved in the same way as Copy Constraints in the Andersen’s analysis.

Translation of C language to LLVM IR

Anyhow, let’s consider a different language. Consider the LLVM IR for example. The LLVM IR is also a language that has pointers and dereferences to pointers. 

  1. The Load instruction, loads a value into a virtual-register from a memory location. This is like loading via a pointer dereference, similar to the p = *q constraint, a.k.a, the Dereference Constraint. 
  2. The Store instruction, stores the contents of a virtual register to a memory location. This is similar to doing a store via a pointer dereference, similar to the *p = q constraint, a.k.a the Assign Constraint.

Another thing to remember is that because the LLVM IR is closer to machine code, it makes very memory access explicit. Unlike C language, which hides one level of memory access — what I mean is that, when we do c = 10; it accesses a memory location, but we don’t call that a pointer. But in LLVM IR world, a pointer is basically anything that points to memory. And the C statement ( c = 10;) will be translated to a StoreInst which involves a pointer to the memory location pointed to by c.

One handy rule to translate C statements into IR is to count the number of memory reads and writes. 

For e.g. a = *ptr; This involves, first loading the value at the address of ptr. This value is a pointer, then we load the value at this pointer. Then we store this value into the memory location of variable a.

  %0 = load i32*, i32** %p — First load to load the ptr

  %1 = load i32, i32* %0 — Second load to load value @ ptr

  store i32 %1, i32* %a — Store value into var a

Key thing to remember here is that the IR *knows* the memory addresses of various variables, but not its value (what it holds). It knows the address of variable a, but not what it contains. It knows the address of ptr, but not what it contains (which is another pointer). So it needs the extra load compared to C source code. C language hides this from us. When we write the name of a variable, it implicitly means “the value” of the variable.