-
Notifications
You must be signed in to change notification settings - Fork 17
Home
This specification focuses on the PDG model definition only, the representation syntaxes for portable exchange and visualization will be addressed in a separate document.
Companion files containing C source code and LLVM IR code are included to provide a running example that will be referenced throughout this document. The example.ll is generated by issuing the following command: clang -c -emit-llvm -g example.c
The Program Dependency Graph (PDG) is a graphical representation of a salient portion of the LLVM IR of a program; it is composed of PDG nodes and PDG edges. We use a TYPE_SUBTYPE naming convention for PDG nodes and PDG edges. We define two properties that we will use below:
- AllDisjoint(...) : The specified sets are disjoint
- Universe(...) = <UNIV> : The union of specified sets is the set <UNIV>
The different types/subtypes of PDG nodes are listed below.
INST_FUNCALL,
INST_RET,
INST_BR,
INST_OTHER, // In the future, we may break out INST_INDBR and INST_SWITCH to capture the other types of branching instructions in LLVM
VAR_STATICALLOCGLOBALSCOPE, // C: global variable LLVM: global with common linkage
VAR_STATICALLOCMODULESCOPE, // C: static global variable LLVM: global with internal linkage
VAR_STATICALLOCFUNCTIONSCOPE, // C: static function variable LLVM: global with internal linkage, variable name is prefixed by function name
VAR_OTHER // In the future we may also call out VAR_STACK and VAR_HEAP
FUNCTIONENTRY,
PARAM_FORMALIN,
PARAM_FORMALOUT,
PARAM_ACTUALIN,
PARAM_ACTUALOUT
ANNOTATION_VAR,
ANNOTATION_GLOBAL, // PL propose splitting global annotation into variables and functions so we would have ANNOTATION_GLOBAL_FUNC and ANNOTATION_GLOBAL_VAR
ANNOTATION_OTHER, We define each node type/sub-type below and provide an example for each.
Description: The PDG nodes representing LLVM instructions have the type <INST>_. For convenience, we create separate subtypes for some instructions, and the rest are included in other. Each type of instruction node is disjoint and the union of each type is the universe of possible instruction nodes.
INST_FUNCALL Description: Represents an LLVM call instruction.
- Example:
- Source Line:
36: greeter(username, &age); - IR Line:
159: call void @greeter(i8* %10, i32* %2), !dbg !114
- Source Line:
INST_RET Description: Represents an LLVM ret instruction.
- Example:
- Source Line:
27: return sz; - IR Line:
140: ret i32 %35, !dbg !93
- Source Line:
INST_BR Description: Represents an LLVM br instruction that contains a condition.
- Example:
- Source Line:
25: for (i=0; i<sz; i++) - IR Line:
108: br i1 %11, label %12, label %34, !dbg !79
- Source Line:
INST_OTHER Description: Represents all other LLVM instructions not specified above. Future versions of this specification may call out instructions for switch and indirect branches as separate subtypes.
- Description: Variable nodes denote where memory is allocated and have the type <VAR>_. Each type of variable node is disjoint and the union of each type is the universe of possible variable nodes.
VAR_STATICGLOBAL Description: Represents the allocation site for a global variable.
- Example:
- Source Line:
6: char *ciphertext; - IR Line:
11: @ciphertext = common dso_local global i8* null, align 8, !dbg !16
- Source Line:
VAR_STATICMODULE Description: Represents the allocation site for a static global variable.
- Example:
- Source Line:
7: static unsigned int i; - IR Line:
10: @i = internal global i32 0, align 4, !dbg !18
- Source Line:
VAR_STATICFUNCTION Description: Represents the allocation site for a static function variable.
- Example:
- Source Line:
11: static int sample = 1; - IR Line:
6: @greeter.sample = internal global i32 1, align 4, !dbg !0
- Source Line:
VAR_OTHER Description: Represents a variable allocation site not captured by VAR_STATICGLOBAL, VAR_STATICMODULE, or VAR_STATICFUNCTION.
FUNCTIONENTRY Description: Defines an entry point to a function.
- Example:
- Source Line:
17: void initkey (int sz) { - IR Line:
51: define dso_local void @initkey(i32 %0) #0 !dbg !38 {
- Source Line:
- Description: Parameter nodes denote formal input/output parameters or actual input/output parameters and have the type <PARAM>_. These parameter types are disjoint and the union of their types is the universe of possible parameter types.
PARAM_FORMALIN Description: Associated with every formal parameter in a function is a tree of PARAM_FORMALIN node(s).
- Example:
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
24: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
PARAM_FORMALOUT Description: Associated with every formal parameter in a function that is modified in the function body is a tree of PARAM_FORMALOUT node(s).
- Example:
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
26: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
PARAM_ACTUALIN Description: Associated with every actual parameter of a function call is a tree of PARAM_ACTUALIN node(s).
- Example:
- Source Line:
26: greeter(username, &age); - IR Line:
161: call void @greeter(i8* %10, i32* %2), !dbg !114
- Source Line:
PARAM_ACTUALOUT Description: Associated with every actual parameter received by the caller after the corresponding argument has been modified during the function call is a tree of PARAM_ACTUALOUT node(s).
- Example:
- Source Line:
26: greeter(username, &age); - IR Line:
161: call void @greeter(i8* %10, i32* %2), !dbg !114
- Source Line:
- Description: Annotation nodes denote LLVM annotations and have the type <ANNOTATION>_. The label associated with an annotation can have multiple values. Each annotation type is disjoint and the union of annotation types gives the universe of possible annotation nodes.
ANNOTATION_VAR Description: Represents the annotation of a variable.
- Example:
- Source Line:
32: char __attribute__((annotate("confidential")))username[20]; - IR Line:
14: @.str.4 = private unnamed_addr constant [13 x i8] c"confidential\00", section "llvm.metadata" - IR Line:
155: call void @llvm.var.annotation(i8* %6, i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str.4, i32 0, i32 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.3, i32 0, i32 0), i32 32), !dbg !104
- Source Line:
ANNOTATION_GLOBAL Description: Contains the annotations for exactly one function or global variable.
- Example:
- Source Line:
23: int __attribute__((annotate("sensitive"))) encrypt (char *plaintext, int sz) { - IR Line:
12: @.str.2 = private unnamed_addr constant [10 x i8] c"sensitive\00", section "llvm.metadata" - IR Line:
23: @llvm.global.annotations = appending global [2 x { i8*, i8*, i8*, i32 }] [{ i8*, i8*, i8*, i32 } { i8* bitcast (i32 (i8*, i32)* @encrypt to i8*), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.2, i32 0, i32 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.3, i32 0, i32 0), i32 23 }, { i8*, i8*, i8*, i32 } { i8* bitcast (i8** @key to i8*), i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str.12, i32 0, i32 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.3, i32 0, i32 0), i32 5 }], section "llvm.metadata" - Note: The node will refer to the first index of the llvm.global.annotations array in this example.
- Source Line:
ANNOTATION_OTHER Description: Represents an annotation not captured by ANNOTATION_VAR or ANNOTATION_GLOBAL.
Listed below are some useful properties of PDG node types and subtypes. Recall that AllDisjoint(<set1>, <set2>, ..., <setn>) indicates the set(s) specified are disjoint, meaning their intersection is empty. Universe(<set1>, <set2>, ..., <setn>) results in the union of the specified set(s).
- Every instruction in an LLVM IR instance will have a corresponding node in the PDG, and the set of these nodes is INST.
- Universe(INST_FUNCALL, INST_RET, INST_BR, INST_OTHER) = INST
- Every variable allocation in an LLVM IR instance will have a corresponding node in the PDG, and the set of these nodes is VAR.
- Universe(VAR_STATICGLOBAL, VAR_STATICFUNCTION, VAR_STATICMODULE, VAR_OTHER) = VAR
- Every annotation in an LLVM IR instance will have a corresponding node in the PDG, and the set of these nodes is ANNOTATION.
- Universe(ANNOTATION_VAR, ANNOTATION_GLOBAL, ANNOTATION_OTHER) = ANNOTATION
- Universe(PARAM_FORMALIN, PARAM_FORMALOUT, PARAM_ACTUALIN, PARAM_ACTUALOUT) = PARAM
- Universe(INST, VAR, ANNOTATION, FUNCTIONENTRY, PARAM) = PDGNODES
- AllDisjoint(INST_FUNCALL, INST_RET, INST_BR, INST_OTHER)
- AllDisjoint(VAR_STATICGLOBAL, VAR_STATICFUNCTION, VAR_STATICMODULE, VAR_STACK, VAR_HEAP)
- AllDisjoint(ANNOTATION_VAR, ANNOTATION_GLOBAL, ANNOTATION_OTHER)
- AllDisjoint(PARAM_FORMALIN, PARAM_FORMALOUT, PARAM_ACTUALIN, PARAM_ACTUALOUT)
- AllDisjoint(INST, VAR, ANNOTATION, FUNCTIONENTRY, PARAM)
We summaizee the PDG edge types below.
CONTROLDEP_CALLINV,
CONTROLDEP_CALLRET,
CONTROLDEP_ENTRY,
CONTROLDEP_BR,
CONTROLDEP_OTHER,
DATADEP_DEFUSE,
DATADEP_RAW,
DATADEP_RET,
DATADEP_ALIAS,
PARAMETER_IN,
PARAMETER_OUT,
PARAMETER_FIELD,
ANNO_GLOBAL, // PL propose splitting global annotation edges into variables and functions so we would have ANNO_GLOBAL_FUNC and ANNO_GLOBAL_VAR
ANNO_VAR,
ANNO_OTHER,- Description: Control dependency edges are used to indicate that the execution of the destination node depends on the execution of the source node and have type: <CONTROLDEP>_. Each type of CONTROLDEP edge is disjoint and the union of each type is the universe of possible CONTROLDEP edges.
CONTROLDEP_CALLINV Description: CONTROLDEP_CALLINV edge connects an INST_FUNCALL node with the FUNCTIONENTRY node of callee. It indicates the control flow transition from the caller to callee.
- Example:
- Source
- Source Line:
26: greeter(username, &age); - IR Line:
161: call void @greeter(i8* %10, i32* %2), !dbg !114
- Source Line:
- Destination
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
26: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
- Source
CONTROLDEP_CALLRET Description: CONTROLDEP_CALLRET edge connects an INST_RET node with the corresponding INST_FUNCALL of the caller. It indicates the control flow transition from the callee back to the caller. If there are multiple INST_RET instructions, they will map to the same INST_FUNCALL node.
- Example:
- Source
- Source Line:
27: return sz; - IR Line:
140: ret i32 %35, !dbg !93
- Source Line:
- Destination
- Source Line:
41: int sz = encrypt(text, strlen(text)); - IR Line:
174: %21 = call i32 @encrypt(i8* %17, i32 %20), !dbg !126
- Source Line:
- Source
CONTROLDEP_ENTRY Description: Connects the FUNCTIONENTRY node with every INST_[TYPE] node that would be unconditionally executed inside a function body.
- Example:
- Source
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
26: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
- Destination
- Source Line:
10: char* p = str; - IR Line:
27: %3 = alloca i8*, align 8
- Source Line:
- Source
CONTROLDEP_BR Description: Connects an INST_BR node to every INST_[TYPE] node of the control-dependent basic block(s)'s.
- Example:
In the example label %12 corresponds to line 108
- Source
- Source Line:
25: for (i=0; i<sz; i++) - IR Line:
108: br i1 %11, label %12, label %34, !dbg !79
- Source Line:
- Destination
- Source Line:
26: ciphertext[i]=plaintext[i] ^ key[i]; - IR Line:
111: %13 = load i8*, i8** %3, align 8, !dbg !80
- Source Line:
- Source
CONTROLDEP_OTHER Description: Captures control dependencies not captured by the above.
- Description: Data dependency edges are used to indicate that the source node refers to data used by the destination node and have type: <DATADEP>_. Each type of DATADEP edge is disjoint and the union of each type is the universe of possible DATADEP edges.
DATADEP_DEFUSE Description: DATADEP_DEFUSE edge connects a def node (Either an INST_[Type] or VAR_[Type]) and use node (INST_[Type]). It is directly computed from the LLVM def-use chain.
- Example:
- Source
- Source Line:
40: initkey(strlen(text)); - IR Line:
166: %15 = call i64 @strlen(i8* %14) #7, !dbg !119
- Source Line:
- Destination
- Source Line:
40: initkey(strlen(text)); - IR Line:
167: %16 = trunc i64 %15 to i32, !dbg !119
- Source Line:
- Source
DATADEP_RAW Description: DATA_RAW connects two INST_[Type] nodes with read-after-write dependence. This is flow-sensitive. We use memory dependency LLVM pass to compute this information.
- Example:
- Source
- Source Line:
24: ciphertext = (char *) (malloc (sz)); - IR Line:
95: store i32 %1, i32* %4, align 4
- Source Line:
- Destination
- Source Line:
24: ciphertext = (char *) (malloc (sz)); - IR Line:
97: %5 = load i32, i32* %4, align 4, !dbg !69
- Source Line:
- Source
DATADEP_RET Description: DATA_RET edge connects the return value from an Inst_RET node to the corresponding INST_FUNCALL node in the caller function. It indicates the data flow from the return instruction to the call instruction
- Example:
- Source
- Source Line:
27: return sz; - IR Line:
140: ret i32 %35, !dbg !93
- Source Line:
- Destination
- Source Line:
41: int sz = encrypt(text, strlen(text)); - IR Line:
174: %21 = call i32 @encrypt(i8* %17, i32 %20), !dbg !126
- Source Line:
- Source
DATADEP_ALIAS Description: DATA_ALIAS edge connects two nodes that have may-alias relations, meaning the source and destination node might (not must) refer to the same memory location. Note that if two nodes n1 and n2 have may-alias relation, then there are two DATA_ALIAS edges exist between them, one from n1 to n2 and one from n2 to n1. This is because the alias relation is bidirectional.
- Example:
- Source
- Source Line:
10: char* p = str; - IR Line:
35: %6 = load i8*, i8** %3, align 8, !dbg !31
- Source Line:
- Destination
- Source Line:
12: printf("%s\n", p); - IR Line:
37: %7 = load i8*, i8** %5, align 8, !dbg !32
- Source Line:
- Source
- Description: Parameter tree edges show data flow from the caller to callee and vice versa denoted with type: <PARAMETER>_. Each type of PARAMETER edge is disjoint and the union of each type is the universe of possible PARAMETER edges.
PARAMETER_IN Description: PARAMETER_IN edge represents interprocedural data flow from caller to the callee. It connects
-
actual_parameter_in_tree node and formal_in tree nodes
-
formal_in_tree node and the IR variables that correspond to these formal_in_tree nodes in the callee.
- Example:
- Source
- Source Line:
26: greeter(username, &age); - IR Line:
161: call void @greeter(i8* %10, i32* %2), !dbg !114
- Source Line:
- Destination
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
26: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
- Source
PARAMETER_OUT Description: edge represents data flow from callee to caller. It connects
-
arguments modified in callee to formal_out_tree node.
-
formal_out_tree node to actual_out_tree node.
-
actual_out_tree node to the modified variable in the caller.
- Example:
- Source
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
26: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
- Destination
- Source Line:
26: greeter(username, &age); - IR Line:
151: call void @greeter(i8* %10, i32* %2), !dbg !114
- Source Line:
- Source
PARAMETER_FIELD Description: PARAMETER_FIELD edge connects a parent parameter tree node to a child parameter tree node.
- Example:
- Source
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
26: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
- Destination
- Source Line:
9: void greeter (char *str, int* s) { - IR Line:
26: define dso_local void @greeter(i8* %0, i32* %1) #0 !dbg !2 {
- Source Line:
- Source
- Description: Annotation Nodes are connected to other nodes with <ANNO>_ edge type. Each type of ANNO edge is disjoint and the union of each type is the universe of possible ANNO edges.
ANNO_GLOBAL Description: Connects FUNCTIONENTRY nodes or VAR_STATIC* nodes to ANNOTATION_GLOBAL nodes.
- Example:
- Source
- Source Line:
23: int __attribute__((annotate("sensitive"))) encrypt (char *plaintext, int sz) { - IR Line:
90: define dso_local i32 @encrypt(i8* %0, i32 %1) #0 !dbg !62 {
- Source Line:
- Destination
- Source Line:
23: int __attribute__((annotate("sensitive"))) encrypt (char *plaintext, int sz) { - IR Line:
12: @.str.2 = private unnamed_addr constant [10 x i8] c"sensitive\00", section "llvm.metadata" - IR Line:
23: @llvm.global.annotations = appending global [2 x { i8*, i8*, i8*, i32 }] [{ i8*, i8*, i8*, i32 } { i8* bitcast (i32 (i8*, i32)* @encrypt to i8*), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.2, i32 0, i32 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.3, i32 0, i32 0), i32 23 }, { i8*, i8*, i8*, i32 } { i8* bitcast (i8** @key to i8*), i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str.12, i32 0, i32 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.3, i32 0, i32 0), i32 5 }], section "llvm.metadata"
- Source Line:
- Source
ANNO_VAR Description: Connects INST nodes or VAR_OTHER nodes to ANNOTATION_VAR nodes.
- Example:
- Source
- Source Line:
32: char __attribute__((annotate("confidential")))username[20]; - IR Line:
145: %3 = alloca [20 x i8], align 16
- Source Line:
- Destination
- Source Line:
32: 32: char __attribute__((annotate("confidential")))username[20]; - IR Line:
14: @.str.4 = private unnamed_addr constant [13 x i8] c"confidential\00", section "llvm.metadata"" - IR Line:
call void @llvm.var.annotation(i8* %6, i8* getelementptr inbounds ([13 x i8], [13 x i8]* @.str.4, i32 0, i32 0), i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.3, i32 0, i32 0), i32 32), !dbg !104
- Source Line:
- Source
ANNO_OTHER Description: Connects a PDG node to an annotation type not captured by the above.
Listed below are useful properties of PDG edge types and subtypes.
- Every instruction node inside a function body is reachable from the FUNC_ENTRY node through CONTROLDEP_* edges.
- From the FUNC_ENTRY node for the main function, every other function is reachable unless the function dead code (never called).
- Universe(CONTROLDEP_CALLINV, CONTROLDEP_CALLRET, CONTROLDEP_ENTRY, CONTROLDEP_BR, CONTROLDEP_OTHER) = CONTROLDEP
- Universe(DATADEP_DEFUSE, DATADEP_RAW, DATADEP_READ, DATADEP_RET, DATADEP_ALIAS) = DATADEP
- Universe(ANNO_FUNC, ANNO_VAR, ANNO_OTHER) = ANNO
- Universe(VARDEP_STATIC, VARDEP_OTHER) = VARDEP
- Universe(PARAMETER_IN, PARAMETER_OUT, PARAMETER_FIELD) = PARAMETER
- Universe(CONTROLDEP, DATADEP, ANNO, VARDEP, PARAMETER) = PDGEDGES
- AllDisjoint(CONTROLDEP_CALLINV, CONTROLDEP_CALLRET, CONTROLDEP_ENTRY, CONTROLDEP_BR, CONTROLDEP_OTHER)
- AllDisjoint(DATADEP_DEFUSE, DATADEP_RAW, DATADEP_READ, DATADEP_RET, DATADEP_ALIAS)
- AllDisjoint(ANNO_FUNC, ANNO_VAR, ANNO_OTHER)
- AllDisjoint(VARDEP_STATIC, VARDEP_OTHER)
- AllDisjoint(PARAMETER_IN, PARAMETER_OUT, PARAMETER_FIELD)
- AllDisjoint(CONTROLDEP, DATADEP, ANNO, VARDEP, PARAMETER)
- Handles subset of the C language (excludes C++ etc.)
- Does not support goto statements
- Does not support function pointers
- Does not support variatic functions
| Version | Date | Comments |
|---|---|---|
| 1.0.2 | 4/14/2021 | Revised dcument based on discussion on 4/13/21. See requests from 4/13 for a complete list of changes. |
| 1.0.1 | 4/8/2021 | Added node and PDG definitions/properties and updated edge definitions/properties. Included example illustrating most nodes/edges. |
| 1.0.0 | 3/26/2021 | Initial Version |
| Request | Date | Description | Status | |
|---|---|---|---|---|
| Update DefUse example | 4/16/2021 | Pending | ||
| Update example to have multiple annotation labels | 4/14/2021 | Resolved | ||
| Update edge properties to include functions reachable from main | 4/13/2021 | Resolved | ||
| Update annotation node definition to include multiple labels | 4/13/2021 | Resolved | ||
| Remove VAR_DEP edges | 4/13/2021 | Resolved | ||
| Update def use to include static variables | 4/13/2021 | Resolved | ||
| Connect IR line 33 to line 35 in the alias example | 4/13/2021 | Resolved | ||
| Update control edges destinations | 4/13/2021 | Resolved | ||
| Update annotation global such that it can be broken into separate entities. | 4/13/2021 | Resolved | ||
| Define Control Edges | 4/5/2021 | Resolved | ||
| Define a PDG (a graph with nodes and edges) | 4/1/2021 | Resolved | ||
| Define all nodes | 4/1/2021 | Resolved | ||
| Try Structure the nodes such that node types are disjoint | 4/1/2021 | Resolved | ||
| Add an example for CONTROLDEP_IND_BR | 3/26/2021 | Not needed at this time | N/A | |
| PDG capture C at the moment. Need more changes on handling C++ in future. | 3/17/2021 | Unresolved | ||
| In C program, add function pointer (invoke) and indirect branch examples. | 3/17/2021 | Not needed at this time | N/A | |
| In C program, add two level pointer example. | 3/17/2021 | Unresolved | ||
| if an attributes can apply to more than one subtypes, make it an attribute of the type (a field). | 3/17/2021 | Unresolved | ||
| SUBTYPEs are disjointed; union of SUBTYPES is TYPE. | 3/17/2021 | Resolved | ||
| Use TYPE_SUBTYPE for node and edges. | 3/17/2021 | Resolved | ||
| create nodes for annotation instruction (must be a sink node). | 3/17/2021 | INST_ANNO_LOCAL INST_ANNO_GLOBAL | Resolved | |
| Create edges for ANNO_FUNC and ANNO_VAR. | 3/17/2021 | ANNO_FUNC: from function entry node to annotation node. ANNO_VAR: from variable node to annotation node. | Resolved | |
| Rename CONTROLDEP Edges | 3/17/2021 | CONTROLDEP_CALLINV CONTROLDEP_CALLRET CONTROLDEP_ENTRY CONTROLDEP_BR CONTROLDEP_IND_BR | Resolved | |
| For unhandled nodes and edges, label them with TYPE_OTHERNODE, TYPE_OTHEREDGE. | 3/17/2021 | Resolved | ||
| Add GLOBALVAR_GLOBAL / GLOBALVAR_LOCAL to represent global/static variables | 3/17/2021 | Resolved | ||
| Remove "sensitive" info from the annotation node. | 3/17/2021 | Resolved |