# SVF’s Field-Sensitivity: Handling of GEP Edges

Let’s now look at how SVF handles the simpler cases of field-sensitive analysis (without any cycle elimination and positive-weighted-cycles)

Imagine a struct A {int idx; struct A* next};
And consider the C statement:
p->idx = 10; // p is a pointer pointing to an object of type A

Clearly, this statement involves computing the address of the field ‘idx’ of the type struct A. In LLVM IR, this is done by the Instruction GetElementPtrInst. It looks like —

%idx = getelementptr inbounds %struct.A, %struct.A* %1, i32 0, i32 0

idx is a virtual register that contains the address of the field ‘idx’.

The GEP instruction has its own dedicated page on LLVM website 🙂 https://llvm.org/docs/GetElementPtr.html

Anyhow, what we need to remember is that GEP computes a pointer into a struct type. It doesn’t access any memory, it’s purely pointer computation. And the last index (0 in this case) denotes the index of the field being accessed (0 means the first field — that is, ‘idx’)

Now, every time SVF encounters a GEP instruction, it creates a GEPEdge from the base pointer to the field pointer. The base pointer is the IR pointer to the object, and field pointer is the IR pointer to the field within the object. This means that there can be multiple GEPEdges for the same field, if the same field is accessed multiple times. It simply means that the address of that field was computed multiple times (and loaded into a virtual register).

So, let’s take a look at what that looks like in the SVF Constraint Graph. Consider the C source code —

struct A {int idx; int *p};

int main( void){

int k = 10;

struct A aobj;

aobj.p = &k;

return 0;

}

The IR instructions —

%k = alloca i32, align 4

%aobj = alloca %struct.A, align 8

store i32 0, i32* %retval, align 4

store i32 10, i32* %k, align 4

%p = getelementptr inbounds %struct.A, %struct.A* %aobj, i32 0, i32 1

store i32* %k, i32** %p, align 8

The initial constraint graph is shown in Figure 5. The purple edge is the GepEdge (There are actually two types of Gep Edges, but for now, let’s talk only about the “Normal” GepEdge — the NormalGepCGEdge). It does not show in the figure, but every Normal Gep Edge has the field index associated with it. And there can be multiple edges with the same index, if the field at the same index is accessed multiple times.

Figure 5.

Conceptually, a GepEdge is similar to a CopyEdge, with an added constraint on which fields are copied (similar to how Pearce et.al extended Andersen’s plain copy constraint). Figure 5 shows an example. Let’s focus only on nodes 12, 11, and 17. First the AddressEdge will make sure that pts-to-set(11) = {12}. Then, the purple GepEdge will act like a Copy Constraint, but instead of copying the whole pts-to-set(11), it’ll copy only the objects corresponding to the field of the GepEdge. In this case, the GepEdge corresponded to field index 1 (the second field), so it’ll copy only the second field of the ObjPN with id 12.

Now, as you can see, there’s no such node in the graph (yet). So far, we just have a single node for the ObjPN for aobj. So SVF will create this ObjPN — a GepObjPN, with base node-id as 12, and field-index as 1. Assume this GepObjPN gets a NodeId of 20, then points-to-set(17) = {20}.

If the second field is accessed again, a different GepEdge will be created, but solving it will result in the same GepObjPN node with NodeId = 20, being copied into the points-to-set of the destination of the GepEdge. Figure 6, shows the constraint graph at this stage.

Figure 6 Note: The handling for the first field (0th index) is slightly different because the pointer to the struct and the pointer to the first field can be used interchangeably. I’ll talk about it later (if I get around to it).

Note: The ObjPN corresponding to the whole object aobj is a FIObjPN (Field Insensitive Obj PN), but this is a bit of a misnomer. It does not mean that the object itself is field-insensitive. It just means that you can’t use this FIObjPN to talk about the individual fields of the object.

Now, let’s take a look at how the StoreEdge from Node 9 to Node 17 will be processed. As discussed earlier, the StoreEdge will be reduced to a CopyEdge from the source of the StoreEdge, to all nodes that exist in the points-to set of the destination. Thus, it adds a CopyEdge from 9 to 20. This is the final Constraint Graph, shown in Figure 7.

Figure 7 # Field Sensitive Pointer Analysis

Field-sensitivity is the ability to distinguish between individual fields of a structure. Field-sensitivity introduces a problem with cycle-elimination, called Positive-Weighted-Cycle which I’ll talk about later.

Field Sensitive Analysis was introduced by Pearce et.al. in Efficient Field Sensitive Pointer Analysis.

Cycle detection and collapsing is a very important optimization that helps Andersen’s analysis. To recap, a cycle is of the form p⊇ q, ⊇ q r, r ⊇ p. In SVF’s world, it is a CopyEdge from q → p, r → q, and finally from p → r. And all the pointers in this cycle share the same solution, and therefore, can be collapsed.

In order to make Andersen’s analysis field-sensitive, Pearce et.al extended the original constraint model with the concept of “weighted constraints” or “weighted constraint edges”. Every field in a complex struct type has an index with respect to the base of the object. And this index is assigned as the weight of the constraint.

The constraint itself is of the form p ⊇ q + k, where k is the weight / field-index.

In order to derive the solution of this constraint, let’s first recap the solution without the field-sensitivity. The derivation rule looks like —

p ⊇ q, q ⊇ {r} then, p ⊇ {r}

Now, with the addition of the field constraint, we need to match the field index, in addition with the base-object (r, in this case). So we get —

p ⊇ q+k, p {r}, idx(s) = idx(r) + k, idx(s) <= end(r), then, p ⊇ {s}

Which is just a mathematical way of saying s is the object at an offset of k from r. Note that s is bounded by end(r), which is the maximum number of fields in the object.

# Cycles in the Constraint Graph

One of the common optimizations in Andersen’s points-to analysis is cycle elimination. At a source code level, consider the following pointer assignments —

int a = 100;
int *p, *q, *r;
p = &a;
q = p;
r = q;
p = r;

This represents a cycle, from p → q → r → p. This is a very simple example, and these cycles can happen across functions, and complex situations, so it’s useful to be able to eliminate these cycles.

The key point about cycles is that all nodes in a cycle share the same points-to set. So, the analysis can be optimized by replacing all nodes in a cycle by a single node.

Now let’s look at SVF’s implementation.

Note that only copy constraints can cause cycles. In SVF world, the copy constraints are represented by CopyEdges in the constraint graph. Remember, that during solving the LoadEdges and the StoreEdges, CopyEdges are introduced. These CopyEdges are introduced to make it easier to track cycles. So, anytime a cycle is introduced during solving, it shows up as a cycle of CopyEdges, and all nodes that are part of this cycle can be merged/collapsed into a single Node.

Now, let’s take a look at a simple example of a cycle — int *p, *q; p = q; q = p;

This is the same example that we were looking at earlier. Figure 3 shows how the cycle of Copy Edges is created in this case. The cycle is nodes 22 → 30 → 20 → 28 → 22. So all of these nodes will be collapsed into a single node. This creates the final constraint graph that looks like Figure 4.

Figure 4

Note that each Copy Constraints in the C code, is translated into two constraints in the IR — a Deref/Load constraint, and a Assign/Store constraint, and the effect of solving the two Copy Constraints in the C source code is the same as the effect of solving the two Load and two Store constraints in the IR.

# Equivalence of analysis in C Source Code and IR

The LLVM IR is a representation of the C Source Code. But the constraints generated for the statement in C and in IR might be different.

For example, p = *q; is a single constraint of Deref/Load type in C, but results in two Load constraints and one Store constraint in IR.

But this doesn’t mean that the result of performing the points-to analysis on C source code is different from performing it on the IR!!! In IR, you just have a more finer view into the memory operations, but eventually, the points-to sets for a memory location / variable will be the same whether you do it in C or LLVM IR.

# SVF Implementation of Andersen’s Analysis

This brings us to the SVF implementation of Andersen’s Analysis. SVF operates on the LLVM IR. And this presents some quirks.

SVF represents the constraints as a constraint graph. As expected there are 4 types of constraints (Address-Of, Copy, Assign, and Deref). Only SVF calls them —

2. Copy
3. Store

(In addition to these 4, SVF also has a fifth type of constraint to handle field sensitivity, that I’ll talk about later).

In the constraint graph, the memory objects and pointers are represented as nodes, and the constraints between them as edges. IR pointers are represented as ValPN nodes, memory objects are represented as ObjPN nodes. Every node, both ValPN and ObjPN has points-to set associated with it.

SVF solves constraints by first reducing all Deref (Load) and Assign (Store) constraints to Copy Constraints. Check the underlined part in the Andersen’s Analysis for why this is reasonable. Reducing to Copy Constraints allows SVF to apply optimizations (such as cycle detection, that I’ll talk about later)

Let’s take a look at how different LLVM IR instructions result in the creation of new nodes and edges in the constraint graph (Actually in the PAG graph, but let’s defer the discussion of PAG vs Constraint graph for later).

1. AllocaInst: This instruction is used to create a local variable on the stack. When SVF sees this instruction, say p = alloca i32**, corresponding to the C statement int *p, it creates two nodes. One for the pointer p (a ValPN node), and one for the memory object (a ObjPN node). The memory object represents the memory associated with the variable.

An Addr Edge is added between the ObjPN and the ValPN, denoting the address-of relationship. When this AddrEdge constraint is solved, points-to set of the ValPN includes the ObjPN.

This is exactly the same thing as solving the Addr-of constraint. p = &q.

Though the language changes ( C → IR) the rules that guide the derivation of the constraints don’t change. A Copy Constraint derives in the same way in the IR, as in C.

For example, consider the bolded C statements from the simple C program, whose Constraint Graph is shown in the figure below —

int a = 100;
int *p, *q;

p = &a;
q = p;
p = q;

Translated to IR —
%a = alloca i32, align 4
%p = alloca i32*, align 8
%q = alloca i32*, align 8

This creates three ValPNs and their corresponding three ObjPNs. The Green edges are the Address Edges.

Each Node in the ConstraintGraph has a unique id (19, 20, etc).

Figure 1: 1. LoadInst: This instruction is used to explicitly load the contents of an IR pointer into a virtual register. When SVF sees this instruction, say %0 = load i32*, i32** q, it creates one ValPN node for the virtual register %0, and adds a LoadEdge between the ValPN node for q (not the ObjPN) and the ValPN node for %0.

Consider the bolded statement in the C program,

int a = 100;
int *p, *q;
p = &a;
q = p;
p = q;

This is translated to the IR instructions:

%0 = load i32*, i32** %p, align 8

store i32* %0, i32** %q, align 8
%1 = load i32*, i32** %q, align 8

store i32* %1, i32** %p, align 8

Let’s look at the first Load instruction. In the above constraint graph it is represented by the Red edge from nodes 21 → 30.

This LoadEdge represents a Deref constraint. As stated before, SVF reduces this to a Copy Constraint from all nodes that exist in the points-to set (source) to the destination of the load. So in the simplest case, LoadEdge will lead to the creation of a CopyEdge from the ObjPN that existed in the source ValPN’s pts-to set (added because of the AddrEdge), to the ValPN of the destination.

[Quick tip: Think about how you’d solve p = *q. p is the destination, q is the source of this load. You’d first find all elements you know q to point to, (call them e1, e2, etc) and then add copy constraints q = e1, q = e2, etc. The source is the e’s and the destination is q.]

Solving the two LoadEdges (21 → 30 and 19 → 28) results in adding the following CopyEdges (black edges)
Figure 2 1. StoreInst: This instruction is used to explicitly store the contents of a virtual register to a memory location. When SVF sees this instruction, say  store i32* %1, i32** %p, it creates an StoreEdge from the source ValPN (%1) to the destination ValPN (%p).

This StoreEdge represents an Assign Constraint. In the simplest case, this will be reduced to a CopyEdge from the ValPN of the source pointer, to the ObjPN in the pts-to set of the ValPN of the destination pointer (added because of the AddrEdge)

Let’s continue with the previous IR instructions —

%0 = load i32*, i32** %p, align 8

store i32* %0, i32** %q, align 8
%1 = load i32*, i32** %q, align 8

store i32* %1, i32** %p, align 8

[Quick Tip: In order to understand how SVF solves this, think about the assign constraint *p = q. To solve this, you’d first find all elements e that exist in the points to set for p, and then add a Copy Constraint e = q.]

The three Store Edges add the three (bolded, dashed) black Copy Edges.

Figure 3 CastInst: CastInst instructions do BitCasts and other casts and these result in CopyEdges in the LLVM IR. CopyEdges are solved in the same way as Copy Constraints in the Andersen’s analysis.