Custom Data-Flow Semantics

Motivation #

In order to precisely model a program’s behaviour we would require a whole-program analysis. This premise is largely sound, but to compute the whole execution environment includes modelling third-party dependencies and native operations directly on the operating system.

This is expensive and non-trivial in practice, and Joern contains mechanisms to perform program analysis well. One such solution is being able to define semantics for methods where their definitions are not modelled in the analysis. However, to remain sound, Joern will treat external methods with no semantic definitions as able to propagate taint from all arguments, to all arguments including the return value.

Example #

In the following example, we have a call from object foo to some external call to bar, where x is assigned to the return value, and a and b are the arguments.

a = "MALICIOUS"
foo = Foo()
x = foo.bar(a, b)
# True positives
print(a)
print(x)
# False positives
print(b)
print(foo)

Let’s assume the call to bar results in taint in a propagating to x. Without a semantic definition, Joern will propagate taint from a to itself and x, foo, and b. While this is sound, it is imprecise, and could result in reachableBy returning additional and unrelated paths.

Defining Custom Semantics #

The semantic definition language is fairly rudimentary. It allows a user to express explicit dataflow paths between arguments, where any missing flows are assumed killed.

Semantics can be defined programmatically, as is done in DefaultSemantics, or parsed from a simple grammar via semanticsloader/Parser.

Basic Syntax #

The basic syntax of semantics is the method full name, followed by argument pairs denoting source-destination pairs, e.g., "foo" 1->-1 2->3. -1 is the return value, and 0 is the receiver/base of the call (relevant for object-oriented programming languages), where everything > 0 is the call’s arguments.

Following from the example above, the semantic definition for x = foo.bar(a, b) would look something like Foo.bar 1->-1 0->0 1->1 2->2. While the first entry is rather intuitive (flow from argument 1 propagates to the return value), the last three simply reiterate that the data-flow in the other arguments are not to be killed at this call site.

Named Arguments #

Some languages allow for named arguments instead of only positional ones. For this, we allow the optional definition of the parameter name using the following syntax:

"foo" 1 "param1"->2 3 -> 2 "param2"

This should allow taint in the call foo(param1="MALICIOUS", a) to map flow from "MALICIOUS" to argument a, as well as taint in foo(param2=b, param1=a, param3="MALICIOUS") to map "MALICIOUS" to argument b.

Every mapping must have an argument index, but following this, a parameter name can be supplied. The parameter name takes precedence when interpreting the rule at a matching call site.

Regex Matching #

In some cases, calls cannot be resolved to their call sites, and thus the method full name will not be fully present. The type recovery system that is currently in place, may instead attempt to reconstruct a plausible path which one can match against instead, e.g., in Python we may see

from path import sanitizer

source = 1
# This should kill the flow from 1 -> -1, but not sanitize the argument pointer
x = sanitizer(source) 
sink(x)

We can add the following flow to the data-flow engine context, where the method full name should be interpreted as a regex.

val extraFlows = List(
    FlowSemantic.from(
        "^path.*<module>\\.sanitizer$", // Method full name
        List((1, 1)), // Flow mappings
        regex = true  // Interpret the method full name as a regex string
    )
)

val context = new LayerCreatorContext(cpg)
val options = new OssDataFlowOptions(semantics = DefaultSemantics().plus(extraFlows))
new OssDataFlow(options).run(context)

Passthrough Mapping #

A new feature added to reduce the overhead of reiterating self-taint, e.g. 1->1, especially in the case of variadic arguments, the PASSTHROUGH syntax was added, e.g., "foo" PASSTHROUGH 0 -> 0 or FlowSemantic("foo", List(PassThroughMapping)).

This represents an instance where arguments are not sanitized, may affect the return value, and do not cross-taint. e.g. foo(1, 2) = 1 -> 1, 2 -> 2, 1 -> -1, 2 -> -1. Note this does not map 0->0.

Notes on order vs argumentIndex #

argumentIndex != order. order is based on the position of the node in relation to its AST siblings. argumentIndex is similar but refers to the order with respect to AST siblings joined to the AST parent call via an additional ARGUMENT edge. This is usually > 0 and the receiver (in OOP) has a RECEIVER edge in addition to the AST edge.

The semantic language interprets the argument positions using argumentIndex and not order!