The code property graph is a data structure designed to mine large codebases for instances of programming patterns. These patterns are formulated in a domain-specific language (DSL) based on Scala. It serves as a single intermediate program representation across all languages supported by Joern and its commercial brother Ocular.
Property graphs are a generic abstraction supported by many contemporary graph databases such as Neo4j, OrientDB, and JanusGraph. In fact, older versions of Joern made use of general purpose graph databases as a storage backend and the graph query language Gremlin. As the limitations of this approach became more apparent over the years, we replaced both the storage backend and query language with our own graph database OverflowDB.
Qwiet AI (formerly ShiftLeft) has open-sourced the implementation of the code property graph and its specification.
Building Blocks of Code Property Graphs #
Code property graphs are graphs, and more specifically property graphs. A property graph is composed of the following building blocks:
Nodes and their types. Nodes represent program constructs. This includes low-level language constructs such as methods, variables, and control structures, but also higher level constructs such as HTTP endpoints or findings. Each node has a type. The type indicates the type of program construct represented by the node, e.g., a node with the type
METHOD
represents a method while a node with typeLOCAL
represents the declaration of a local variable.Labeled directed edges. Relations between program constructs are represented via edges between their corresponding nodes. For example, to express that a method contains a local variable, we can create an edge with the label
CONTAINS
from the method’s node to the local’s node. By using labeled edges, we can represent multiple types of relations in the same graph. Moreover, edges are directed to express, e.g., that the method contains the local but not the other way around. Multiple edges may exist between the same two nodes.Key-Value Pairs. Nodes carry key-value pairs (attributes), where the valid keys depend on the node type. For example, a method has at least a name and a signature while a local declaration has at least the name and the type of the declared variable.
In summary, code property graphs are directed, edge-labeled, attributed multigraphs, and we insist that each node carries at least one attribute that indicates its type.
A Look into the Rear-View Mirror
The Code Property Graph was first introduced in the paper Modeling and Discovering Vulnerabilities with Code Property Graphs in the context of vulnerability discovery for C system code and the Linux kernel in particular. The core ideas outlined in this early work are the following:
Different classic program representations are merged into a property graph into a single data structure that holds information about the program’s syntax, control- and intra-procedural data-flow.
The property graph is stored in a graph database and made accessible via a domain specific language (DSL) for the identification of programming patterns - based on a DSL for graph traversals.
The query language allows to seamlessly transition between the original code representations, making it possible to combine aspects of the code from different views these representations offer in a single query.
From 2014-2016, research followed on (a) extending the concept for interprocedural analysis, (b) using the graph as a basis for learning typical data-flow patterns in a program, (c) the effects of introducing further classic program representations such as dominator trees, (d) and the applicability of the approach for dynamically-typed languages such as PHP.
From 2017 onwards, the code property graph served as the technological foundation for the static analysis solutions developed at ShiftLeft Inc. The representation has since undergone heavy extensions and transformations. At the statement and expression level, it has matured into a generic container format which allows for hosting of graphs generated by 8 different language frontends, to enable querying with the same query language across programming languages. Moreover, the concept of overlays was introduced to allow representing code on different levels of abstraction, enabling transitioning between these layers of abstraction using the query language in the same way as for the original three low-level code representations. Finally, programming models and APIs are now available for parallel graph processing at low memory footprint - a core ingredient for scaling the approach.