Spatial Index: R Trees

If you have been following the Spatial Index Series, it started with the need for multi-dimensional indexes and an introduction to space-filling curves, followed by a deep dive into grid systems (GeoHash and Google S2) and tessellation (Uber H3).

In this post, let's explore the R-Tree data structure (data-driven structure), which is popularly used to store multi-dimensional data, such as data points, segments, and rectangles.

1. R-Trees and Rectangles

For example, consider the plan of a university layout below. We can use the R-Tree data structure to index the buildings on the map.

To do so, we can place rectangles around a building or group of buildings and then index them. Suppose there's a much bigger section of the map signifying a larger department, and we need to query all the buildings within a department. We can use the R-Tree to find all the buildings within (partially or fully contained) the larger section (query rectangle).

Figure 0: Layout with MBRs and Query Rectangle

In the above figure, the red rectangle represent the query rectangle, used to ask the R-Tree to get all the buildings that intersect with the query rectangle (R2, R3, R6).

2. R-Tree - Intuition

The main idea in R-trees is the minimum bounding rectangles. We'll come to what "minimum" implies in a second.

The inner node of an R-tree is as follows: We start with the root node, representing the large landscape. The inner nodes are guideposts that hold pointers to the child nodes we need to go down to in the tree. i.e. each entry of a node points to an area of the data space (described by MBR).

Figure 1: R-Tree Inner Node

For instance, think of a Binary Search Tree. From the root node, we make a decision to go left or right. The R-tree is similar, but more of an M-way tree, where each node can have multiple entries as seen above. Instead of having integer or string values (one-dimensional), the inner nodes consist of entries (multi-dimensional). In the example, there are 4 entries of rectangles.

2.1. MBR - Minimum Bounding Rectangle

Figure 2: R-Tree Minimum Bounding Rectangle

Minimum Bounding Rectangles, R1, R2, R3, R4, contain the objects which are stored in the sub-trees in a minimal way. For instance, say we have 3 rectangles R11, R12, R13. R1 is the smallest rectangle that can be created to completely contain all three rectangles, hence the name "minimum."

2.2. Search Process and Overlapping MBRs

The search process in an R-tree is simple: for a query object/query rectangle; at an inner node, it is the decision to check if any of the entries in a node intersect with the query rectangle.

Figure 3: R-Tree Query Rectangle(s)

For example, consider a query rectangle Q1. It's clear that R1 intersects with Q1, so we would follow down the tree from R1. Similarly, Q2 intersects with R2. However, in scenarios where the query rectangle intersects with multiple entries/rectangles (Q3 with R2, R3, R4), all the intersecting rectangles have to be searched. This can happen if the indexing is not optimized and has to be avoided as it defeats the purpose of indexing in the first place.

2.3. R-Tree - Properties

Here's a bit of a larger example of an R-tree.

Figure 4: R-Tree Level-2

Every node in an R-tree has between m and M entries. More specifically, each node has between m ≤ ⌈M/2⌉ and M entries. The node has at least 2 entries unless it's a leaf.

By now, if you also read the blog post on B-Trees and B+ Trees, you’ll see that an R-Tree is quite similar to a B+ Tree. It uses a similar idea to split the space at each (inner) node into multiple areas. However, B+ Trees mostly work with one-dimensional data, and the data ranges do not overlap.

3. Search using an R-Tree

Now that we know the idea behind R-Trees and the search process, Let's put a clear-cut definition to the search process:

Goal: Find all rectangles that overlap with the given rectangle S (query rectangle).
Let T denote the node (at the current level/sub-tree).
S1 (Search in sub-trees): If T is not a leaf, check all the entries E in T. If the MBR of E overlaps with S, then continue the search in the sub-tree to which E points.
S2 (Search in Leaves): If T is a leaf node, inspect all entries of E. All entries that overlap with S are part of the query result.

4. Inserting to an R-Tree

Coming to inserts, consider a leaf node (MBR) as shown below with 3 entries/objects, R1, R2, and R3. Let's assume that the leaf is not full yet (MBR has a threshold capacity on the number of objects it can hold).

Say, there's a new rectangle R4 coming and it has to be inserted inside the leaf node. As you can see, in order to capture the new objects, the MBR is adjusted, i.e., enlarged to minimally contain R1 to R4. Going on and inserting another object R5, the MBR is once again adjusted.

Figure 5: R-Tree Insert (Adjusting MBR)

On an insert, when the MBR is updated, i.e., contains more objects, the new MBR has to be updated not only for the node but also propagated to other lower levels and potentially (not always) up to the root node. This is to reflect that the sub-tree now contains more information.

4.1. Choice for Insert

Unlike the example, it's not always clear in which node/sub-tree an object should be inserted. Here: MBR1, MBR2, or MBR3.

Figure 6: R-Tree Choice for Insert (1)

The question is, in which MBR should we insert R1 into? Setting aside any rules or justification for a second, R1 can be inserted into any MBR.

Figure 7: R-Tree Choice for Insert (2)

Inserting into MBR1 would need to immensely grow/expand MBR1 to fully contain R1. The implication? Say there's a query rectangle Q1. After leading down the sub-tree to MBR1, we find that there's nothing (no objects). This is because, to contain R1, we have expanded MBR1 so much that there is a lot of space without any objects. So, it's fair to conclude that one criterion to add is to insert into MBRs that need to expand the least.

Figure 8: R-Tree Choice for Insert (3)

Going by that, inserting into MBR2 is a better option as opposed to MBR1. Similarly, MBR3 may not be a bad option either, depending on the expansion factor.

Stating the obvious (for implementation), the minimum-bounding-rectangle (MBR) is defined as the rectangle that has the maximal and minimal values of all rectangles in each dimension.

Figure 9: R-Tree MBR Implementation

Summarizing the insertion into R-Tree so far:

In principle, a new rectangle can be inserted into any node.
If the node is full, a split needs to be performed (more on that in the next section).
If not, the MBR may have to be adjusted/expanded to accommodate new objects (as seen ).

Observations:

Extending bounding boxes is a critical factor for the performance of the R-Tree.
Try to minimize overlap (of the MBRs).
Try to minimize spread (the size of the MBR, as seen in section 4.1).

4.2. Insert - Algorithm

Here's the algorithm proposed by the author of the R-Tree paper "A Dynamic Index Structure for Spatial Searching," by A. Guttman, 1984.

The rest of this section is mostly going over snippets of code and explanations from this paper, but with more examples and visualization.

Algorithm: Search for leaf to insert (ChooseLeaf):

CS1: Let N be the root.
CS2:
- If N is a leaf, return N.
- If N is not a leaf: Search for an entry in N whose rectangle (MBR) requires the least area increase in order to accommodate the new rectangle. In the case where there are multiple options, consider an entry that has the smallest (in area) MBR.
CS3: Let N be the child node, then continue to step CS2 (repeat).

A much simpler example of 8 objects, each object with one multidimensional attribute (Range or line-segments on x-axis) and one identity (Color). To insert these objects one by one in an empty R-tree of degree M = 3 (maximum number of entries at each node) and m ≥ M/2 (minimum number of entries at each node = 2).

Figure 10: R-Tree Insertion Example

Observation: in the case where the selected leaf is already full, a splitting operation is performed. Let's understand the overflow problem better (the split problem):

4.3. Handling Overflow

In the case a node/leaf is full and a new entry cannot be stored anymore, a split needs to be performed, just as for a B+ Tree. The difference is that the split can be done arbitrarily and not only in the middle as for a B+ Tree.

Figure 11: R-Tree Insertion: Overflow

4.3.1. The Split Problem

Given M + 1 entries in a node (exceeded maximum capacity per node), which two subsets of these entries should be considered as new and old nodes?

To better understand the split problem, let's take a step back and consider 4 rectangles (R1, R2, R3, R4) that need to be assigned to two nodes (MBRs) in a meaningful way.

Figure 12: R-Tree Insertion: Split Problem

Why is one better than the other? As mentioned before (Section 4.1), the area of expansion of the poor split is much larger compared to the good split (despite the overlap). This leads to more empty spaces in the node/MBR that do not have any objects.

A realistic use case for an R-Tree is M = 50 and there are 2^(M-1) possibilities. Hence, a naive approach to look at all possible subsets and choose the best one is not practical (too expensive!).

4.3.2. The Split Problem: Quadratic Cost

Search for split with smallest possible area
Cost is Quadratic in M and linear in number of dimensions d.
Idea:

Search for pairs of entries that would cause the largest MBR area if placed in the same node. Then put these entries in two different nodes
Then: Consider all remaining entries and consider the one (among the 2 nodes) for which the increase in area (of MBR) has the largest possible difference between the two nodes.
This entry is assigned to the node with the smallest increase. Repeat until all entries are assigned

Figure 13: R-Tree Insertion: Choosing MBR

In this example, two nodes, MBR1 and MBR2, are created. R1 and R2 in the same MBR would lead to creating the largest MBR. R3 is then inserted into MBR1 and not MBR2, as the area increase of MBR1 is smaller compared to MBR2.

Method "AdjustTree," is called whenever a new entry is inserted. It is responsible for adapting the parent's MBR and propagating the changes bottom up, handling splits as well as changes to MBRs. In the worst case, the propagation can be up to the root node.

5. R-Tree Variants

R-trees do not guarantee good worst-case performance, but generally speaking, they perform well with real-world data. Addressing this specific problem, the Priority R-tree is a worst-case asymptotically optimal alternative to the spatial tree R-tree, which is essentially a hybrid between a k-dimensional tree (k-d tree) and an R-tree.

Another commonly used variant is the R*-Tree, which uses the same algorithm as the regular R-tree for query and delete operations. However, while inserting, the R*-tree uses a combined strategy: for leaf nodes, overlap is minimized, and for inner nodes, enlargement and area are minimized, making the tree construction slightly more expensive.

The R+-Tree, on the other hand, solves one main problem to ensure nodes do not overlap with each other, leading to better point query performance. However, it does so by inserting an object into multiple leaves if necessary, which is a disadvantage due to duplicate entries and larger tree size.

The Hilbert R-Tree uses space-filling curves, specifically the Hilbert curve, to impose a linear ordering on the data rectangles. It has two variants: Packed Hilbert R-trees, suitable for static databases in which updates are very rare, and dynamic Hilbert R-trees, suitable for dynamic databases where insertions, deletions, or updates may occur in real time.

6. Conclusion

R-trees have come a long way since the first paper was published in 1984. Today, their applications span over multi-dimensional indexes, computer graphics, video games, spatial data management systems, and many more.

On the flip side, R-trees can degrade badly with discrete data. Hence, it's highly recommended to understand the data representation before using R-trees. R-trees are also relatively slow when there's a very high mutation rate, i.e., where the index changes often; this is because of the higher cost for constructing and updating the index (due to tree rebalancing) and they are more optimized for various search operations. Lastly, R-trees can be a poor algorithm choice when primarily dealing with points as opposed to polygons/regions.

7. References

[1] A. Guttman, "A Dynamic Index Structure for Spatial Searching," presented at the ACM SIGMOD International Conference on Management of Data, 1984. [Online]. Available: https://www.researchgate.net/publication/220805321_A_Dynamic_Index_Structure_for_Spatial_Searching.
[2] "R-Tree," Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/R-tree.
[3] "B-Trees and B+ Trees," PyBlog. [Online]. Available: https://www.pyblog.xyz/b-trees-b-plus-trees.
[4] "Spatial Index R-Tree," YouTube, https://www.youtube.com/watch?v=U0jUvvQkaFw.