Andersen Wave Diff

This is the variant of Andersen’s algorithm that seems to be the “default” in SVF. If you use -ander, this is the analysis that is run. I’ll describe this algorithm in short.

The key idea that differentiates this from vanilla-Andersen’s analysis is that instead of propagating the full points-to set, for each CopyEdge, at each iteration, it propagates only the difference between it’s points-to set at the end of the previous iteration, and, its points-to set at this iteration.

In order to do this, the constraint graph must not have any cycles, and must be sorted in topological order.

At a high level, each iteration is divided into three phases: 

Phase 1: First phase of each iteration is to collapse all cycles, and sort the graph in topological order. Phase 2: Then, process the Copy constraints/edges and propagate the difference in the points-to sets up the constraint-tree. 

Phase 3: Then, process the Load and Store constraints/edges and add the new Copy Edges.

If a new Copy Edge is added during Phase 3, then repeat.

The worst-case complexity of this approach is the same as vanilla-Andersen, but because it propagates only the difference in points-to sets, it’s faster in the average-case. And also, the Phase 2 and Phase 3 can be parallelized ( according to the paper http://compilers.cs.ucla.edu/fernando/publications/papers/CGO09.pdf)

Note: Because Phase 2 and 3 requires an acyclic graph, any positive-weight-cycles are collapsed and the struct object is made field-insensitive.

Variant GEP

GEP Edges can be of two types — Normal Gep Edges and Variant Gep Edges.

A normal gep edge is one where the index or offset is known. A variant GEP is a gep whose index is not a constant. For example — 

struct Obj { int a; int b; int c; }

int* ptr = &sobj;

for (int i = 0; i < 3; i++) {

int *c = (ptr + i); // ← non-constant offset

}

In this case, a VarGepPE PAGEdge will be inserted for the source ValPN for ptr, and this VarGepPE is converted into a VariantGepCGEdge in the ConstraintGraph and solved in the following way —

Src → Dst

Any object that the src can point to is first made field-insensitive. Then, these field-insensitive objects are added to the pts-to-set(Dst). 

VariantGeps and Arrays of struct

The default MemoryModel is field-insensitive when it comes to arrays. All elements in an array are considered to be the same. Now if there is a variable based access to this array, it’ll result in a VarGepPE from this array. 

Consider this example —

struct Obj { int* a; int* b;};

struct Obj objects[2];

struct Obj* optr = objects;

for (int i = 0; i < 2; i++) { optr[i].b = &bint; }

In this case, the variable i is being used to index into the array objects, via the pointer optr. The IR for the highlighted part will be as follows —

for.body:                                         ; preds = %for.cond

  %1 = load %struct.Obj*, %struct.Obj** %optr, align 8

  %2 = load i32, i32* %i, align 4

  %idxprom = sext i32 %2 to i64

  %arrayidx = getelementptr inbounds %struct.Obj, %struct.Obj* %1, i64 %idxprom

  %b = getelementptr inbounds %struct.Obj, %struct.Obj* %arrayidx, i32 0, i32 1

  store i32* %bint, i32** %b, align 8

  br label %for.inc

There are two gep pointers involved in computing the address of optr[i].b. The first one computes the address of the i-th object in the array, and the second one computes the address of the field ‘b’, within the struct. 

The Constraint / PAG graph considering only the first gep is shown in Figure 1.

We’d expect the GEP edge for the second gep instruction to be a normal gep edge because the offset is constant (1), but because the source of this gep is originated from a VarGep, the second GEP edge also becomes a VarGep. Logically it makes sense. The array itself is element/field-insensitive, so it’s impossible to distinguish between fields within the elements of this array.

This causes much imprecision in apr-hook framework for httpd. Figure 2 shows the Constraint Graph after the second gep edge is added.

Figure 1

Figure 2

SVF: Cycle Handling in Field Sensitive Analysis

TL;DR – The AndersenWaveDiff implementation of SVF handles positive-weight cycles by making the structure object involved field-insensitive!

Let’s look at how these positive-weight-cycles look like in the IR Representation, and the cycle that shows up in SVF.

The IR for the above C code is shown below. Note the BitCast Instructions that are used to assign to and from the void pointer q.
 

  %a = alloca %struct.aggr, align 8

  %p = alloca %struct.aggr*, align 8

  %q = alloca i8*, align 8

  store i32 0, i32* %retval, align 4

  %0 = bitcast %struct.aggr* %a to i8*

  store i8* %0, i8** %q, align 8

  %1 = load i8*, i8** %q, align 8

  %2 = bitcast i8* %1 to %struct.aggr*

  store %struct.aggr* %2, %struct.aggr** %p, align 8

  %3 = load %struct.aggr*, %struct.aggr** %p, align 8

  %f2 = getelementptr inbounds %struct.aggr, %struct.aggr* %3, i32 0, i32 1

  %4 = bitcast i32** %f2 to i8*

  store i8* %4, i8** %q, align 8

The initial constraint Graph for the above example is shown in Figure 8. The BitCast Instructions are specified as CopyEdges. The rest of the IR instructions are handled as expected.

Figure 8

After solving the AddrEdges, and the CopyEdge 9 → 17, we get Figure 9.

Figure 9

Solving, the CopyEdges 9 → 17, 19 → 20, and the StoreEdges, 17 → 13, 20 → 12, and the LoadEdge 13→ 19, gives this constraint graph

Finally, solving edges LoadEdges, 11 → 22, and StoreEdge 24 → 13, we get the constraint graph shown in Figure 10.

Figure 10

As you can see, this has a cycle 14 → 19 → 20 → 12 → 22 → gep-edge → 23 → 24 → 14. This is how a positive weight cycle (PWC) looks in the SVF’s Constraint Graph. 

Because, there’s a GepEdge for the struct obj 10, then object will become field-insensitive. Then the cycle will be collapsed into the final constraint graph shown in Figure 11. Note that the GepEdge is gone.

Figure 11

Cycles in Field Sensitive Pointer Analysis

In the original Field-Sensitive paper by Pearce, this problem was described as the Positive Weight Cycle problem. Let’s first look at this problem as formulated in the original paper.

Consider the following C code, that has a void* pointer.

typedef struct { int *f1; int *f2; } aggr;

int main(void) {

    aggr a, *p; void *q;

    q = &a; // q {a}

    p = q; // p ⊇ q

    q = &(p->f2); // q ⊇ p + 1

    // do stuff with q

    return 0;

}

This leads to a weighted cycle — p → q, q — 1 → p. Unlike non-weighted cycles, the nodes in a weighted cycle don’t share the same solution, and it can lead to infinite derivations. 

For example, 

q = {a}

p = {a}

q = {a, r}, where r is the field at the index 1 from the base object a.

Then, because it’s a cycle, the derivation continues,

p = {a, r}
q = {a, r, s}, where s is the field at the index 1 from the base object r.

So on and so forth. Pearce et.al. basically puts a limit to these derivations, and say that the maximum number of fields we’ll support is N.

SVF’s Field-Sensitivity: Handling of GEP Edges

Let’s now look at how SVF handles the simpler cases of field-sensitive analysis (without any cycle elimination and positive-weighted-cycles)

Imagine a struct A {int idx; struct A* next};
And consider the C statement:
p->idx = 10; // p is a pointer pointing to an object of type A

Clearly, this statement involves computing the address of the field ‘idx’ of the type struct A. In LLVM IR, this is done by the Instruction GetElementPtrInst. It looks like —

%idx = getelementptr inbounds %struct.A, %struct.A* %1, i32 0, i32 0


idx is a virtual register that contains the address of the field ‘idx’.

The GEP instruction has its own dedicated page on LLVM website 🙂 https://llvm.org/docs/GetElementPtr.html

Anyhow, what we need to remember is that GEP computes a pointer into a struct type. It doesn’t access any memory, it’s purely pointer computation. And the last index (0 in this case) denotes the index of the field being accessed (0 means the first field — that is, ‘idx’)

Now, every time SVF encounters a GEP instruction, it creates a GEPEdge from the base pointer to the field pointer. The base pointer is the IR pointer to the object, and field pointer is the IR pointer to the field within the object. This means that there can be multiple GEPEdges for the same field, if the same field is accessed multiple times. It simply means that the address of that field was computed multiple times (and loaded into a virtual register).

So, let’s take a look at what that looks like in the SVF Constraint Graph. Consider the C source code —

struct A {int idx; int *p};

int main( void){

    int k = 10;

    struct A aobj;

    aobj.p = &k;

    return 0;

}


The IR instructions — 

  %k = alloca i32, align 4

  %aobj = alloca %struct.A, align 8

  store i32 0, i32* %retval, align 4

  store i32 10, i32* %k, align 4

  %p = getelementptr inbounds %struct.A, %struct.A* %aobj, i32 0, i32 1

  store i32* %k, i32** %p, align 8

The initial constraint graph is shown in Figure 5. The purple edge is the GepEdge (There are actually two types of Gep Edges, but for now, let’s talk only about the “Normal” GepEdge — the NormalGepCGEdge). It does not show in the figure, but every Normal Gep Edge has the field index associated with it. And there can be multiple edges with the same index, if the field at the same index is accessed multiple times.

Figure 5. 

Conceptually, a GepEdge is similar to a CopyEdge, with an added constraint on which fields are copied (similar to how Pearce et.al extended Andersen’s plain copy constraint). Figure 5 shows an example. Let’s focus only on nodes 12, 11, and 17. First the AddressEdge will make sure that pts-to-set(11) = {12}. Then, the purple GepEdge will act like a Copy Constraint, but instead of copying the whole pts-to-set(11), it’ll copy only the objects corresponding to the field of the GepEdge. In this case, the GepEdge corresponded to field index 1 (the second field), so it’ll copy only the second field of the ObjPN with id 12.

Now, as you can see, there’s no such node in the graph (yet). So far, we just have a single node for the ObjPN for aobj. So SVF will create this ObjPN — a GepObjPN, with base node-id as 12, and field-index as 1. Assume this GepObjPN gets a NodeId of 20, then points-to-set(17) = {20}.

If the second field is accessed again, a different GepEdge will be created, but solving it will result in the same GepObjPN node with NodeId = 20, being copied into the points-to-set of the destination of the GepEdge. Figure 6, shows the constraint graph at this stage.

Figure 6

Note: The handling for the first field (0th index) is slightly different because the pointer to the struct and the pointer to the first field can be used interchangeably. I’ll talk about it later (if I get around to it).

Note: The ObjPN corresponding to the whole object aobj is a FIObjPN (Field Insensitive Obj PN), but this is a bit of a misnomer. It does not mean that the object itself is field-insensitive. It just means that you can’t use this FIObjPN to talk about the individual fields of the object.

Now, let’s take a look at how the StoreEdge from Node 9 to Node 17 will be processed. As discussed earlier, the StoreEdge will be reduced to a CopyEdge from the source of the StoreEdge, to all nodes that exist in the points-to set of the destination. Thus, it adds a CopyEdge from 9 to 20. This is the final Constraint Graph, shown in Figure 7.

Figure 7