Data Structures - Trees
This post will focus on defining things and explaining how trees are applied for databases.
The difference in these two definitions is interesting, with a tree (data structure) being:
In computer science, a tree is a widely used abstract data type that simulates a hierarchical tree structure, with a root value and subtrees of children with a parent node, represented as a set of linked nodes.
While a tree (graph theory) is:
In graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one path, or equivalently a connected acyclic undirected graph. A forest is an undirected graph in which any two vertices are connected by at most one path, or equivalently an acyclic undirected graph, or equivalently a disjoint union of trees.
The graph theory definition is a very precise thing, while the data structure is a bit looser.
Th B-Tree and the 1970 Paper Introducing It
A B-tree definition is a bit more precise though:
In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as disks. It is commonly used in databases and file systems.
So the B-Tree is an implementation of a self balancing search tree. There are other implementations like:
- AVL tree
- red-black tree
- splay tree
- and many others
So why are B-trees in particular talked about so much for databases? As per that link, it is because they are "well suited for storage systems that read and write relatively large blocks of data, such as disks". Why is that? Well, it is explained in the 1970 paper in which they were introduced - it has to do with having the concept of "pages" which are the nodes of the tree.
1970 Paper
When I first read this paper I had to keep referring to the first main diagram in it, and I had to figure out what was what on the diagram - as it is easy to assume that perhaps the stored data itself is shown on the diagram - but it is not. It seems that:
- there are nodes which are the rectangles
- there are keys inside nodes
- there are page numbers next to the nodes
- the lines between nodes are like the pointers shown diagrammatically
- the data or "information" is omitted from the diagram.
So far, so good. It looks like a familiar tree, and the nodes are pages - that was the big takeaway for me from this paper.
But digging deeper, there are lots more details of course. For instance, one page is represented with a long row divided in to sections of:
- pointers to child pages
- keys
- the data inside an index element
The paper talks a lot about other things like page splitting, which we won't go in to here.
Database and Trees
Before we jump in to how trees are used for databases, we should ask, why are they used, and, why are other data structures not used? For example databases are not implemented with say dictionaries because with trees the following are tree, which are not true for dictionaries:
- the natural order of keys is maintained (and order is important for performance, for indices and for using the SQL "order by" for example)
- storage utilisation is more than 50%
- storage is requested and released as a file grows/contracts
- operations in batches are very efficient as it is sequential
There are whole fields of studies devoted to this - but the above is just a simple summary.
Implementations
SQL server has its own implementation of all of this - for example for each page storing page number, free space etc. You can select and see this data in various ways. The main points are that:
- a record is the smallest data structure - for example each row is a record, but as are indices and metadata
- pages are the smallest unit of data that will cache in memory
- there are various page types for indices, metadata and so on. But in general a page is 8KB and has a header and body. The body then has a record offset array at the end which dictates the physical order of records.
- pages are structured in heaps (bunches of pages) and indices (a b-tree of pages)
- a clustered index has data in the leaves of its tree
- a non-clustered index has pointers in the leaves of its tree
- a table without indices is stored as a heap
- fragmentation happens when page order is not logic due to eg page splitting - eg doing implementation - store data as b tree
Other B-tree implementations have pointers between nodes at same height, which can be useful.
And that's it - most of the things have been defined to understand the SQL server implementation of trees.