Graphs appear everywhere in software. A wide variety of difficult, or slow, algorithmic problems can be made much simpler by solving them as graph problems instead. One of the most common queries that K2 performs against the database is figuring out which groups (and the groups that contain those groups, and so on) a user might belong to. This existing solution uses a naive recursive algorithm, as a recursive CTE, to figure out all the groups that the user belongs to. This algorithm is not only slow but also uses tons of memory to detect and break out of infinite recursion. It also has a hard limit on the depth of group nesting, because SQL won’t evaluate CTEs to an arbitrary depth.
What are Graphs?
Most people would consider graphs to be the types of diagrams that you find in spreadsheets. While this is true, we are referring to the following type of graph:
A graph. I’ve never been able to figure out a real business case for having cycles in an identity store, but there are, and so we have to support them.
- The nodes in the graph refer to the blue circles in the above diagram: individual values that are related in some way.
- The vertices are the lines in the graph that form relationships between nodes. A graph can be directed (where lines have arrows or a direction) or undirected.
- There are two broad types of directed graphs: cyclic and acyclic. A cycle is formed when vertices make a loop, or recursive dependency, between one or more nodes. Cyclic graphs contain one or more cycles, acyclic graphs contain none.
A graph is a pretty useful structure for storing the relationships between users, their groups and their managers when referring to identity stores (Active Directory, GSuite and so forth). Users and groups can be represented as nodes are represented by nodes, and their membership can be represented using directed vertices.
Traversal is one of the many queries that can be run against a graph. Is asks the question, “if I start at a node and repeatedly follow vertices, which nodes do I arrive at?” For example, if we followed vertices forwards from Group A, we’d arrive at Group B, Group C, Group A (again), and finally Bob. This would also answer the question: “which groups and users does Group A contain?” Following vertices backward from Group A would answer the question, “which groups does Group A belong to?”
How do we make traversal queries efficient and, most importantly, in SQL?
The Natural Solution
The natural (and naive) solution is pretty clear, use recursion:
This query has a major problem, though: if it encounters a cycle it will enter an infinite recursion. This can be prevented by keeping track of all the IDs that have been encountered. Unfortunately, this is both awkward and horribly inefficient in SQL (in terms of both memory and CPU time): we have to concatenate them into a delimited string and perform a CONTAINS() on that string.
Things we want to SELECT. It’s pretty obvious when I put it like this, but visualizing this is important for arriving at the final solution.
Going by this visualization, the first thing we need to do is to partition the graph into two parts, the acyclic graph and the cycles. This can be achieved using Tarjan’s Strongly Connected Components Algorithm, which can identify cycles in a graph in O(N+V) time (where N is the number of nodes, and V is the number of vertices).
If we’re only interested in traversal queries, we can further simplify the structure we are concerned with. Notice that starting at any node in a cycle will result in visiting every node in that cycle; the internal structure of the cycle can be discarded. Using these two optimizations, we’d arrive at a directed acyclic graph (DAG) and a list of cycles:
A directed graph and its cycles.
By inserting the data into SQL in this format, we can change our query to the following:
This neatly solves the problem but can still be further improved.
ORDPATH (or hierarchyid) is an ingenious invention that allows you to efficiently store and query trees in SQL. Microsoft published a paper describing how it works, but it approximates the following approach. Give each node in the tree a two-digit number, that is also prepended by its ancestor’s number. In other words:
The tree structure can then be removed and inserted into a table:
This table can be queried “downwards” (finding descendants) by asking for records that are numerically between the ORDPATH for the queried node and the ORDPATH of its next sibling (even if that sibling does not exist in the table). For example, 10101 (Bob) falls between 10100 (Sales) and 10200. All the records fall between 10000 (Everyone) and 20000. Assuming that the ORDPATH is the clustered index, this is a type of query that SQL can answer extremely quickly. As a bonus, we’d eliminate the imperative concepts in our query (the CTE).
Every directed acyclic graph (DAG) can be represented by one or more trees, if we allow nodes to appear multiple times across the trees. Wherever we find a merge in a DAG (such as the one at Bob), we create a new tree. Our final representation would be:
A representative forest and its cycles.
The Venn diagram now looks like this:
The Venn diagram that includes the representative forest.
Our query becomes:
This query no longer contains recursion and can be answered using a series of index scans: it is incredibly efficient.
The tree can also be stored in NoSQL systems that support comparison operators on keys; such as Azure Table Storage.
What We’ve Lost
This is an extremely efficient on-disk structure for graph traversal, but we’ve lost something in the process. By discarding the internal structure of cycles, we can no longer ask the graph for nth-level traversals, for example, “find me the groups that the user is contained by, but don’t go more than 5 groups deep.”
Full traversal is a really common query in the space that K2 operates in; we need traversal to determine the rights of users, determine who to route work to and so on. Our API conveniently only ever supported 1st-level and full traversal, so we could store both the original graph and this efficient representation to cover all our bases.
What We’ve Gained
While this structure is much faster on disk (or over the network); it is also still faster in memory. Recursively chasing pointers around the heap will wreak havoc with the CPU caches or even result in page faults. No matter how we choose to store the data, this format is probably the best solution (but always measure).
We achieved this by correctly identifying the problem as a graph problem, as well as correctly representing the data as a forest that is efficient to ascend or descend.