Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other attributes. Trees are especially useful when we are facing hierarchical data. For example, trees are used:
For more details, see the applications vignette by typing
vignette("applications", package = "data.tree")
Tree-like structures are already used in R. For example, environments
can be seen as nodes in a tree. And CRAN provides numerous packages that
deal with tree-like structures, especially in the area of decision
theory. Yet, there is no general purpose hierarchical data structure
that could be used as conveniently and generically as, say,
data.frame.
As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames. But often, hierarchies don’t marry with tables, and various workarounds are usually required.
data.treeThis package offers an alternative. The data.tree
package lets you create hierarchies, called data.tree
structures. The building block of theses structures are
Node objects. The package provides basic traversal, search,
and sort operations, and an infrastructure for recursive tree
programming. You can decorate Nodes with your own
attributes and methods, so as to extend the package to your needs.
The package also provides convenience methods for neatly printing and
plotting trees. It supports conversion from and to
data.frames, lists, and other tree structures
such as dendrogram, phylo objects from the ape
package, igraph, and other packages.
Technically, data.tree structures are bi-directional,
ordered trees. Bi-directional means that you can navigate from parent to
children and vice versa. Ordered means that the sort order of the
children of a parent node is well-defined.
data.tree basicsdata.tree structure: a tree,
consisting of multiple Node objects. Often, the entry point
to a data.tree structure is the root NodeNode: both a class and the basic
building block of data.tree structures?attr,
which have a different meaning. Many methods and functions have an
attribute arg, which can refer to a an active, a field or a
method. For example, see ?GetNode that can be called like an attribute, but behaves like
a function without arguments. For example:
node$positionNode,
e.g. node$cost <- 2500Node in this context). Many methods are available in OO
style (e.g. node$Revert()) or in traditional style
(Revert(node))Node inherits e.g. an
attribute from one of its ancestors. For example, see ?Get,
?SetNodeStyleThere are different ways to create a data.tree
structure. For example, you can create a tree
programmatically, by conversion from
other R objects, or from a file.
Let’s start by creating a tree programmatically. We do this by
creating Node objects, and linking them together so as to
define the parent-child relationships.
In this example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 1) is the company. On level 2, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:
library(data.tree)
acme <- Node$new("Acme Inc.")
  accounting <- acme$AddChild("Accounting")
    software <- accounting$AddChild("New Software")
    standards <- accounting$AddChild("New Accounting Standards")
  research <- acme$AddChild("Research")
    newProductLine <- research$AddChild("New Product Line")
    newLabs <- research$AddChild("New Labs")
  it <- acme$AddChild("IT")
    outsource <- it$AddChild("Outsource")
    agile <- it$AddChild("Go agile")
    goToR <- it$AddChild("Switch to R")
print(acme)##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to RAs you can see from the previous example, each Node is
identified by its name, i.e. the argument you pass into the
Node$new(name) constructor. The name needs to be
unique among siblings, such that paths to Nodes
are unambiguous.
Node inherits from R6 reference class. This
has the following implications:
Node in OO style,
e.g. acme$Get("name")Node exhibits reference semantics. Thus,
multiple variables in R can point to the same Node, and
modifying a Node will modify it for all referencing
variables. In the above code example, both acme$IT and
it reference the same object. This is different from the
value semantics, which is much more widely used in R.data.frameCreating a tree programmatically is useful especially in the context
of algorithms. However, most times you will create a tree by conversion.
This could be by conversion from a nested list-of-lists, by conversion
from another R tree-structure (e.g. an ape phylo), or by
conversion from a data.frame. For more details on all the
options, type ?as.Node and refer to the See Also
section.
One of the most common conversions is the one from a
data.frame in table format. The following code illustrates
this. We load the GNI2014 data from the treemap package. This
data.frame is in table format, meaning that each row will
represent a leaf in the data.tree structure:
library(treemap)
data(GNI2014)
head(GNI2014)##   iso3          country     continent population    GNI
## 3  BMU          Bermuda North America      67837 106140
## 4  NOR           Norway        Europe    4676305 103630
## 5  QAT            Qatar          Asia     833285  92200
## 6  CHE      Switzerland        Europe    7604467  88120
## 7  MAC Macao SAR, China          Asia     559846  76270
## 8  LUX       Luxembourg        Europe     491775  75990Let’s convert that into a data.tree structure! We start
by defining a pathString. The pathString describes the
hierarchy by defining a path from the root to each leaf. In this
example, the hierarchy comes very naturally:
GNI2014$pathString <- paste("world", 
                            GNI2014$continent, 
                            GNI2014$country, 
                            sep = "/")Once our pathString is defined, conversion to Node is very easy:
population <- as.Node(GNI2014)
print(population, "iso3", "population", "GNI", limit = 20)##                                 levelName iso3 population    GNI
## 1  world                                               NA     NA
## 2   ¦--North America                                   NA     NA
## 3   ¦   ¦--Bermuda                         BMU      67837 106140
## 4   ¦   ¦--United States                   USA  313973000  55200
## 5   ¦   ¦--Canada                          CAN   33487208  51630
## 6   ¦   ¦--Bahamas, The                    BHS     309156  20980
## 7   ¦   ¦--Trinidad and Tobago             TTO    1310000  20070
## 8   ¦   ¦--Puerto Rico                     PRI    3971020  19310
## 9   ¦   ¦--Barbados                        BRB     284589  15310
## 10  ¦   ¦--St. Kitts and Nevis             KNA      40131  14920
## 11  ¦   ¦--Antigua and Barbuda             ATG      85632  13300
## 12  ¦   ¦--Panama                          PAN    3360474  11130
## 13  ¦   ¦--Costa Rica                      CRI    4253877  10120
## 14  ¦   ¦--Mexico                          MEX  111211789   9870
## 15  ¦   ¦--Grenada                         GRD      90739   7910
## 16  ¦   ¦--St. Lucia                       LCA     160267   7260
## 17  ¦   ¦--Dominica                        DMA      72660   6930
## 18  ¦   ¦--St. Vincent and the Grenadines  VCT     104574   6610
## 19  ¦   ¦--Dominican Republic              DOM    9650054   6040
## 20  ¦   °--... 7 nodes w/ 0 sub                        NA     NA
## 21  °--... 6 nodes w/ 171 sub                          NA     NAThis is a simple example, and more options are available. Type
?FromDataFrameTable for all the details.
Often, trees are created from one of many file formats. When
developing this package, We opted for a multi-step approach, meaning
that you first import the file into one of the well-known R data
structures. Then you convert these into a data.tree
structure. For example, typical import patterns could be:
?read.csv) ->
data.tree (?as.Node.data.frame)?ape::read.tree) ->
data.tree (?as.Node.phylo )?read.csv)
-> data.tree (c.f. ?FromDataFrameNetwork)?yaml::yaml.load) ->
data.tree (?as.Node.list)?jsonlite::fromJSON)
-> data.tree (?as.Node.list)If you have a choice, we recommend you consider yaml format to store and share your hierarchies. It is concise, human-readable, and very easy to convert to a data.tree. An example is provided here for illustration. The data represents what platforms and OS versions a group of students use:
library(yaml)
yaml <- "
name: OS Students 2014/15
OS X:
  Yosemite:
    users: 16
  Leopard:
    users: 43
Linux:
  Debian:
    users: 27
  Ubuntu:
    users: 36
Windows:
  W7:
    users: 31
  W8:
    users: 32
  W10:
    users: 4
"
osList <- yaml.load(yaml)
osNode <- as.Node(osList)
print(osNode, "users")##              levelName users
## 1  OS Students 2014/15    NA
## 2   ¦--OS X               NA
## 3   ¦   ¦--Yosemite       16
## 4   ¦   °--Leopard        43
## 5   ¦--Linux              NA
## 6   ¦   ¦--Debian         27
## 7   ¦   °--Ubuntu         36
## 8   °--Windows            NA
## 9       ¦--W7             31
## 10      ¦--W8             32
## 11      °--W10             4In cases where your leaf elements have no attributes, you might want
to interpret them as nodes, and not as attributes. In such cases, you
can use interpretNullAsList = TRUE to convert these into
Nodes (instead of attributes).
For example:
library(yaml)
yaml <- "
name: OS Students 2014/15
OS X:
  Yosemite:
  Leopard:
Linux:
  Debian:
  Ubuntu:
Windows:
  W7:
  W8:
  W10:
"
osList <- yaml.load(yaml)
osNode <- as.Node(osList, interpretNullAsList = TRUE)
osNode$printFormatters <- list(h = "\u2500" , v = "\u2502", l = "\u2514", j = "\u251C")
print(osNode, "users")##              levelName users
## 1  OS Students 2014/15    NA
## 2  ├─OS X                 NA
## 3  │├─Yosemite            NA
## 4  │└─Leopard             NA
## 5  ├─Linux                NA
## 6  │├─Debian              NA
## 7  │└─Ubuntu              NA
## 8  └─Windows              NA
## 9  ├─W7                   NA
## 10 ├─W8                   NA
## 11 └─W10                  NAAs seen above, a data.tree structure is composed of
Node objects, and the entry point to a
data.tree structure is always a Node, often
the root Node of a tree.
There are different types of methods:
Nodes, such as e.g. Node$isRootNodes, such as
e.g. Node$AddChild(name)Clone(node).Actives look and feel like attributes, but they are dynamically
evaluated. They are documented in the Node documentation,
which is accessed by typing ?Node.
Remember our population example:
print(population, limit = 15)##                        levelName
## 1  world                        
## 2   ¦--North America            
## 3   ¦   ¦--Bermuda              
## 4   ¦   ¦--United States        
## 5   ¦   ¦--Canada               
## 6   ¦   ¦--Bahamas, The         
## 7   ¦   ¦--Trinidad and Tobago  
## 8   ¦   ¦--Puerto Rico          
## 9   ¦   ¦--Barbados             
## 10  ¦   ¦--St. Kitts and Nevis  
## 11  ¦   ¦--Antigua and Barbuda  
## 12  ¦   ¦--Panama               
## 13  ¦   ¦--Costa Rica           
## 14  ¦   ¦--Mexico               
## 15  ¦   °--... 12 nodes w/ 0 sub
## 16  °--... 6 nodes w/ 176 subpopulation$isRoot## [1] TRUEpopulation$height## [1] 3population$count## [1] 7population$totalCount## [1] 196population$attributes## character(0)population$attributesAll## [1] "GNI"        "continent"  "country"    "iso3"       "population"population$averageBranchingFactor## [1] 24.375The naming convention of the package is that attributes and actives
are lower case, whereas methods are upper / CamelCase. RStudio and other
IDEs work well with data.tree. If you have a
Node, simply type myNode$ + SPACE to get a
list of available attributes, actives and methods.
Examples of OO-Style methods
You will find more information on these examples below.
Get will traverse the tree and collect specific values for the
Nodes it traverses:
sum(population$Get("population", filterFun = isLeaf))## [1] 6683146875Prune traverses the tree and keeps only the subtrees for which the pruneFun returns TRUE.
Prune(population, pruneFun = function(x) !x$isLeaf || x$population > 1000000)## [1] 39Note that the Prune function has side-effects, as it acts on the original population object. The population sum is now smaller:
sum(population$Get("population", filterFun = isLeaf), na.rm = TRUE)## [1] 6669737814popClone <- Clone(acme)Traditional S3 generics are available especially for conversion:
as.data.frame(acme)##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to RThough there is also a more specialised non-generic version:
ToDataFrameNetwork(acme)##          from                       to
## 1   Acme Inc.               Accounting
## 2   Acme Inc.                 Research
## 3   Acme Inc.                       IT
## 4  Accounting             New Software
## 5  Accounting New Accounting Standards
## 6    Research         New Product Line
## 7    Research                 New Labs
## 8          IT                Outsource
## 9          IT                 Go agile
## 10         IT              Switch to RJust as with, say, a list, we can add any custom field
to any Node in a data.tree structure. Let’s go
back to our acme company:
acme##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to RWe now add costs and probabilities to the projects in each department:
acme$Accounting$`New Software`$cost <- 1000000
acme$Accounting$`New Accounting Standards`$cost <- 500000
acme$Research$`New Product Line`$cost <- 2000000
acme$Research$`New Labs`$cost <- 750000
acme$IT$Outsource$cost <- 400000
acme$IT$`Go agile`$cost <- 250000
acme$IT$`Switch to R`$cost <- 50000
acme$Accounting$`New Software`$p <- 0.5
acme$Accounting$`New Accounting Standards`$p <- 0.75
acme$Research$`New Product Line`$p <- 0.25
acme$Research$`New Labs`$p <- 0.9
acme$IT$Outsource$p <- 0.2
acme$IT$`Go agile`$p <- 0.05
acme$IT$`Switch to R`$p <- 1
print(acme, "cost", "p")##                           levelName    cost    p
## 1  Acme Inc.                             NA   NA
## 2   ¦--Accounting                        NA   NA
## 3   ¦   ¦--New Software             1000000 0.50
## 4   ¦   °--New Accounting Standards  500000 0.75
## 5   ¦--Research                          NA   NA
## 6   ¦   ¦--New Product Line         2000000 0.25
## 7   ¦   °--New Labs                  750000 0.90
## 8   °--IT                                NA   NA
## 9       ¦--Outsource                 400000 0.20
## 10      ¦--Go agile                  250000 0.05
## 11      °--Switch to R                50000 1.00Note that there is a list of reserved names you cannot use as
Node attributes:
NODE_RESERVED_NAMES_CONST##  [1] "AddChild"               "AddChildNode"           "AddSibling"            
##  [4] "AddSiblingNode"         "attributes"             "attributesAll"         
##  [7] "averageBranchingFactor" "children"               "Climb"                 
## [10] "Navigate"               "FindNode"               "clone"                 
## [13] "count"                  "Do"                     "Get"                   
## [16] "GetAttribute"           "height"                 "initialize"            
## [19] "isBinary"               "isLeaf"                 "isRoot"                
## [22] "leafCount"              "leaves"                 "level"                 
## [25] "levelName"              "name"                   "parent"                
## [28] "path"                   "pathString"             "position"              
## [31] "printFormatters"        "Prune"                  "Revert"                
## [34] "RemoveAttribute"        "RemoveChild"            "root"                  
## [37] "Set"                    "siblings"               "Sort"                  
## [40] "totalCount"             ".*"An alternative, often convenient way to assign custom attributes is
in the constructor, or in the Node$AddChild method:
birds <- Node$new("Aves", vulgo = "Bird")
birds$AddChild("Neognathae", vulgo = "New Jaws", species = 10000)
birds$AddChild("Palaeognathae", vulgo = "Old Jaws", species = 60)
print(birds, "vulgo", "species")##           levelName    vulgo species
## 1 Aves                  Bird      NA
## 2  ¦--Neognathae    New Jaws   10000
## 3  °--Palaeognathae Old Jaws      60Nothing stops you from setting a function as a field. This calculates
a value dynamically, i.e. whenever a field is accessed in tree
traversal. For example, you can add a new Node to your
structure, and the function will reflect this. Think of this as a
hierarchical spreadsheet, in which you can set formulas into cells.
Consider the following example:
birds$species <- function(self) sum(sapply(self$children, function(x) x$species))
print(birds, "species")##           levelName species
## 1 Aves                10060
## 2  ¦--Neognathae      10000
## 3  °--Palaeognathae      60data.tree maps the self argument to the
Node at hand. Thus, you must name the argument
self.
Now, let’s assume we discover a new species. Then, the species on the root adjusts dynamically:
birds$Palaeognathae$species <- 61
print(birds, "species")##           levelName species
## 1 Aves                10061
## 2  ¦--Neognathae      10000
## 3  °--Palaeognathae      61This, together with the Set method and recursion,
becomes a very powerful tool, as we’ll see later.
Basic printing is easy, as you surely have noted in the previous
sections. print displays a tree in a tree-grid view. On the
left, you have the hierarchy. Then you have a column per variable you
want to print:
print(acme, "cost", "p")##                           levelName    cost    p
## 1  Acme Inc.                             NA   NA
## 2   ¦--Accounting                        NA   NA
## 3   ¦   ¦--New Software             1000000 0.50
## 4   ¦   °--New Accounting Standards  500000 0.75
## 5   ¦--Research                          NA   NA
## 6   ¦   ¦--New Product Line         2000000 0.25
## 7   ¦   °--New Labs                  750000 0.90
## 8   °--IT                                NA   NA
## 9       ¦--Outsource                 400000 0.20
## 10      ¦--Go agile                  250000 0.05
## 11      °--Switch to R                50000 1.00For more advanced printing, you have a few options.
You can use formatters to output a variable in a certain way. You can use formatters in two ways:
Node using the
SetFormat method. If you do this, then the formatter will
be picked up as a default formatter whenever you print,
Get, convert to data.frame, etc. Formatters
can be set on any Node in a data.tree
structure act on any descendant. So you can overwrite a formatter for a
sub-tree.Get
method (see below). This will overwrite default formatters previously
set via the SetFormat method. You can also set the
formatter to identity to void a default formatter.Setting a formatter using the SetFormat method:
SetFormat(acme, "p", formatFun = FormatPercent)
SetFormat(acme, "cost", formatFun = function(x) FormatFixedDecimal(x, digits = 2))
print(acme, "cost", "p")##                           levelName       cost        p
## 1  Acme Inc.                                           
## 2   ¦--Accounting                                      
## 3   ¦   ¦--New Software             1000000.00  50.00 %
## 4   ¦   °--New Accounting Standards  500000.00  75.00 %
## 5   ¦--Research                                        
## 6   ¦   ¦--New Product Line         2000000.00  25.00 %
## 7   ¦   °--New Labs                  750000.00  90.00 %
## 8   °--IT                                              
## 9       ¦--Outsource                 400000.00  20.00 %
## 10      ¦--Go agile                  250000.00   5.00 %
## 11      °--Switch to R                50000.00 100.00 %GetFormatting with the Get method overwrites any formatters
found along the path:
data.frame(cost = acme$Get("cost", format = function(x) FormatFixedDecimal(x, 2)),
           p = acme$Get("p", format = FormatPercent))##                                cost        p
## Acme Inc.                                   
## Accounting                                  
## New Software             1000000.00  50.00 %
## New Accounting Standards  500000.00  75.00 %
## Research                                    
## New Product Line         2000000.00  25.00 %
## New Labs                  750000.00  90.00 %
## IT                                          
## Outsource                 400000.00  20.00 %
## Go agile                  250000.00   5.00 %
## Switch to R                50000.00 100.00 %plotdata.tree is mainly a data structure. As it is easy to
convert data.tree structures to other formats, you have
access to a large number of tools to plot a data.tree
structure. For example, you can plot a data.tree structure
as a dendrogram, as an ape tree, as a treeview, etc. Additionally,
data.tree also provides its own plotting facility. It is
built on GraphViz/DiagrammeR, and you can access these features via the
plot and ToGraphViz functions. Note that
DiagrammeR is not required to use data.tree, so plot only
works if DiagrammeR is installed on your system. For example:
plot(acme)Similar to formatters for printing, you can style your tree and store the styling directly in the tree, for later use:
SetGraphStyle(acme, rankdir = "TB")
SetEdgeStyle(acme, arrowhead = "vee", color = "grey35", penwidth = 2)
SetNodeStyle(acme, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow", 
            fontname = "helvetica", tooltip = GetDefaultTooltip)
SetNodeStyle(acme$IT, fillcolor = "LightBlue", penwidth = "5px")
plot(acme)For details on the styling attributes, see https://graphviz.org/Documentation.php .
Note that, by default, most Node style attributes will be inherited.
Though, for example, label will not be inherited. However,
inheritance can be avoided for all style attributes, as for the
Accounting node in the following example:
SetNodeStyle(acme$Accounting, inherit = FALSE, fillcolor = "Thistle", 
             fontcolor = "Firebrick", tooltip = "This is the accounting department")
plot(acme)Use Do to set style on specific nodes:
Do(acme$leaves, function(node) SetNodeStyle(node, shape = "egg"))
plot(acme)However, there are also endless other possibilities to visualise
data.tree structures. There are more examples in the
applications vignette. Type
vignette('applications', package = "data.tree").
For example, using dendrogram:
plot(as.dendrogram(CreateRandomTree(nodes = 20)), center = TRUE)Or, using igraph:
library(igraph)
plot(as.igraph(acme, directed = TRUE, direction = "climb"))Or, using networkD3: (you can actually touch these thingies and drag them around, don’t be shy!)
library(networkD3)
acmeNetwork <- ToDataFrameNetwork(acme, "name")
simpleNetwork(acmeNetwork[-3], fontSize = 12)Another example, which at the same time shows conversion from csv:
fileName <- system.file("extdata", "useR15.csv", package="data.tree")
useRdf <- read.csv(fileName, stringsAsFactors = FALSE)
#define the hierarchy (Session/Room/Speaker)
useRdf$pathString <- paste("useR", useRdf$session, useRdf$room, useRdf$speaker, sep="|")
#convert to Node
useRtree <- as.Node(useRdf, pathDelimiter = "|")
#plot with networkD3
useRtreeList <- ToListExplicit(useRtree, unname = TRUE)
radialNetwork( useRtreeList)In order to take advantage of the R eco-system, you can convert your
data.tree structure to other oft-used data types. The
general rule is that, for each target type, there is a one-does-it-all
generics, and a few more specialised conversion functions. For example,
in order to convert a data.tree to a data.frame, you can
either use as.data.frame.Node, or
ToDataFrameTree, ToDataFrameTable, or
ToDataFrameNetwork. The documentation for all of these
variations is accessible via ?as.data.frame.Node.
data.frameAs you saw just above, creating a data.frame is
easy.
Again, note that we always call such methods on the root
Node of a data.tree structure, or on the root
Node of a subtree:
acmedf <- as.data.frame(acme)
as.data.frame(acme$IT)##         levelName
## 1 IT             
## 2  ¦--Outsource  
## 3  ¦--Go agile   
## 4  °--Switch to RThe same can be achieved by using the more specialised method:
ToDataFrameTree(acme)We can also add field values of the Nodes as columns to
the data.frame:
ToDataFrameTree(acme, "level", "cost")##                           levelName level    cost
## 1  Acme Inc.                            1      NA
## 2   ¦--Accounting                       2      NA
## 3   ¦   ¦--New Software                 3 1000000
## 4   ¦   °--New Accounting Standards     3  500000
## 5   ¦--Research                         2      NA
## 6   ¦   ¦--New Product Line             3 2000000
## 7   ¦   °--New Labs                     3  750000
## 8   °--IT                               2      NA
## 9       ¦--Outsource                    3  400000
## 10      ¦--Go agile                     3  250000
## 11      °--Switch to R                  3   50000Note that it is not required that the field is set on each and every
Node.
Other data frame conversions are:
ToDataFrameTable(acme, "pathString", "cost")##                                      pathString    cost
## 1             Acme Inc./Accounting/New Software 1000000
## 2 Acme Inc./Accounting/New Accounting Standards  500000
## 3           Acme Inc./Research/New Product Line 2000000
## 4                   Acme Inc./Research/New Labs  750000
## 5                        Acme Inc./IT/Outsource  400000
## 6                         Acme Inc./IT/Go agile  250000
## 7                      Acme Inc./IT/Switch to R   50000ToDataFrameNetwork(acme, "cost")##          from                       to    cost
## 1   Acme Inc.               Accounting      NA
## 2   Acme Inc.                 Research      NA
## 3   Acme Inc.                       IT      NA
## 4  Accounting             New Software 1000000
## 5  Accounting New Accounting Standards  500000
## 6    Research         New Product Line 2000000
## 7    Research                 New Labs  750000
## 8          IT                Outsource  400000
## 9          IT                 Go agile  250000
## 10         IT              Switch to R   50000And, finally, we can also put attributes of our nodes in a column,
based on a type discriminator. This sounds more complicated then what it
is. Consider the default discriminator, level:
ToDataFrameTypeCol(acme, 'cost')##     level_1    level_2                  level_3    cost
## 1 Acme Inc. Accounting             New Software 1000000
## 2 Acme Inc. Accounting New Accounting Standards  500000
## 3 Acme Inc.   Research         New Product Line 2000000
## 4 Acme Inc.   Research                 New Labs  750000
## 5 Acme Inc.         IT                Outsource  400000
## 6 Acme Inc.         IT                 Go agile  250000
## 7 Acme Inc.         IT              Switch to R   50000Let’s look at a somewhat more advanced example. First, let’s assume that for the outsourcing project, we have two separate possibilities: Outsourcing to India or outsourcing to Poland:
acme$IT$Outsource$AddChild("India")
acme$IT$Outsource$AddChild("Poland")Now, with this slightly more complex tree structure, the level is not a usefully discriminator anymore, because some projects are in level 3, while the new projects are in level 4. For this reason, we introduce a type field on our node objects: A node type can be a company (root only), a department (Accounting, Research, and IT), a program (Oursource), and a project (the rest, i.e. all the leaves):
acme$Set(type = c('company', 'department', 'project', 'project', 'department', 'project', 'project', 'department', 'program', 'project', 'project', 'project', 'project'))Our tree now looks like this:
print(acme, 'type')##                           levelName       type
## 1  Acme Inc.                           company
## 2   ¦--Accounting                   department
## 3   ¦   ¦--New Software                project
## 4   ¦   °--New Accounting Standards    project
## 5   ¦--Research                     department
## 6   ¦   ¦--New Product Line            project
## 7   ¦   °--New Labs                    project
## 8   °--IT                           department
## 9       ¦--Outsource                   program
## 10      ¦   ¦--India                   project
## 11      ¦   °--Poland                  project
## 12      ¦--Go agile                    project
## 13      °--Switch to R                 projectWe can now create a data.frame in which we have one column per distinct type value. Namely, a company column, a department column, a program column, and a project column. Note that the columns are not hardcoded, but derived dynamically from your data in the tree structure:
ToDataFrameTypeCol(acme, type = 'type', prefix = NULL)##     company department   program                  project
## 1 Acme Inc. Accounting      <NA>             New Software
## 2 Acme Inc. Accounting      <NA> New Accounting Standards
## 3 Acme Inc.   Research      <NA>         New Product Line
## 4 Acme Inc.   Research      <NA>                 New Labs
## 5 Acme Inc.         IT Outsource                    India
## 6 Acme Inc.         IT Outsource                   Poland
## 7 Acme Inc.         IT      <NA>                 Go agile
## 8 Acme Inc.         IT      <NA>              Switch to RList of lists are useful for various use cases:
data.tree structure as an R object (see
performance considerations below)data(acme)
str(as.list(acme$IT))## List of 3
##  $ Outsource  :List of 2
##   ..$ cost: num 4e+05
##   ..$ p   : num 0.2
##  $ Go agile   :List of 2
##   ..$ cost: num 250000
##   ..$ p   : num 0.05
##  $ Switch to R:List of 2
##   ..$ cost: num 50000
##   ..$ p   : num 1str(ToListExplicit(acme$IT, unname = FALSE, nameName = "id", childrenName = "dependencies"))## List of 2
##  $ id          : chr "IT"
##  $ dependencies:List of 3
##   ..$ Outsource  :List of 3
##   .. ..$ id  : chr "Outsource"
##   .. ..$ cost: num 4e+05
##   .. ..$ p   : num 0.2
##   ..$ Go agile   :List of 3
##   .. ..$ id  : chr "Go agile"
##   .. ..$ cost: num 250000
##   .. ..$ p   : num 0.05
##   ..$ Switch to R:List of 3
##   .. ..$ id  : chr "Switch to R"
##   .. ..$ cost: num 50000
##   .. ..$ p   : num 1There are also conversions to igraph objects, to phylo / ape, to
dendrogram, and others. For details, see ?as.phylo.Node,
?as.dendrogram.Node, ?as.igraph.Node.
Tree traversal is one of the core concepts of trees. See, for example, here: Tree Traversal on Wikipedia.
GetThe Get method traverses the tree and collects values
from each node. It then returns a vector or a list, containing the
collected values.
Additional features of the Get method are:
Node method on each node, and append the
method’s return value to the returned vectorThe Get method can traverse the tree in various ways.
This is called traversal order.
The default traversal mode is pre-order.
This is what is used e.g. in print:
print(acme, "level")##                           levelName level
## 1  Acme Inc.                            1
## 2   ¦--Accounting                       2
## 3   ¦   ¦--New Software                 3
## 4   ¦   °--New Accounting Standards     3
## 5   ¦--Research                         2
## 6   ¦   ¦--New Product Line             3
## 7   ¦   °--New Labs                     3
## 8   °--IT                               2
## 9       ¦--Outsource                    3
## 10      ¦--Go agile                     3
## 11      °--Switch to R                  3The post-order traversal mode returns children first, returning parents only after all its children have been traversed and returned:
We can use it like this on the Get method:
acme$Get('level', traversal = "post-order")##             New Software New Accounting Standards               Accounting 
##                        3                        3                        2 
##         New Product Line                 New Labs                 Research 
##                        3                        3                        2 
##                Outsource                 Go agile              Switch to R 
##                        3                        3                        3 
##                       IT                Acme Inc. 
##                        2                        1This is useful if your parent’s value depends on the children, as we’ll see below.
This is a non-standard traversal mode that does not traverse the
entire tree. Instead, the ancestor mode starts from a Node,
then walks the tree along the path from parent to parent, up to the
root.
data.frame(level = agile$Get('level', traversal = "ancestor"))##           level
## Go agile      3
## IT            2
## Acme Inc.     1You can add a filter and/or a prune function to the Get
method. These functions have to take a Node as an input,
and return TRUE if the Node should be
considered, and FALSE otherwise. The difference between the
pruneFun and the filterFun is that filters act
only on specific nodes, whereas if the pruneFun returns
FALSE, then the entire sub-tree spanned by the
Node is ignored.
For example:
acme$Get('name', pruneFun = function(x) x$position <= 2)##                  Acme Inc.                 Accounting 
##                "Acme Inc."               "Accounting" 
##               New Software   New Accounting Standards 
##             "New Software" "New Accounting Standards" 
##                   Research           New Product Line 
##                 "Research"         "New Product Line" 
##                   New Labs 
##                 "New Labs"There are also some convenient filter functions available in the
package, such as isLeaf, isRoot,
isNotLeaf, etc.
acme$Get('name', filterFun = isLeaf)##               New Software   New Accounting Standards 
##             "New Software" "New Accounting Standards" 
##           New Product Line                   New Labs 
##         "New Product Line"                 "New Labs" 
##                  Outsource                   Go agile 
##                "Outsource"                 "Go agile" 
##                Switch to R 
##              "Switch to R"The attribute parameter determines what is collected.
This is called attribute, but it should not be confused
with R’s concept of object attributes (e.g. ?attributes).
In this context, an attribute can be either:
Node fieldNode method or activeThroughout this document, we refer to attribute in this
sense.
acme$Get('name')##                  Acme Inc.                 Accounting 
##                "Acme Inc."               "Accounting" 
##               New Software   New Accounting Standards 
##             "New Software" "New Accounting Standards" 
##                   Research           New Product Line 
##                 "Research"         "New Product Line" 
##                   New Labs                         IT 
##                 "New Labs"                       "IT" 
##                  Outsource                   Go agile 
##                "Outsource"                 "Go agile" 
##                Switch to R 
##              "Switch to R"You can pass a standard R function to the Get method
(and thus to print, as.data.frame, etc.). The
only requirement this function must satisfy is that its first argument
be of class Node. Subsequent arguments can be added through
the ellipsis (…). For example:
ExpectedCost <- function(node, adjustmentFactor = 1) {
  return ( node$cost * node$p * adjustmentFactor)
}
acme$Get(ExpectedCost, adjustmentFactor = 0.9, filterFun = isLeaf)##             New Software New Accounting Standards         New Product Line 
##                   450000                   337500                   450000 
##                 New Labs                Outsource                 Go agile 
##                   607500                    72000                    11250 
##              Switch to R 
##                    45000Recursion comes naturally with data.tree, and it is one of its core strengths:
Cost <- function(node) {
  result <- node$cost
  if(length(result) == 0) result <- sum(sapply(node$children, Cost))
  return (result)
}
print(acme, "p", cost = Cost)##                           levelName    p    cost
## 1  Acme Inc.                          NA 4950000
## 2   ¦--Accounting                     NA 1500000
## 3   ¦   ¦--New Software             0.50 1000000
## 4   ¦   °--New Accounting Standards 0.75  500000
## 5   ¦--Research                       NA 2750000
## 6   ¦   ¦--New Product Line         0.25 2000000
## 7   ¦   °--New Labs                 0.90  750000
## 8   °--IT                             NA  700000
## 9       ¦--Outsource                0.20  400000
## 10      ¦--Go agile                 0.05  250000
## 11      °--Switch to R              1.00   50000There is a built-in function that would make this example even
simpler: Aggregate. It is explained below.
DoDo is similar to Get in that it also traverses a tree in
a specific traversal order. However, instead of fetching an attribute,
it will (surprise!) do something, namely run a function. For example, we
can tell the Do method to assign a value to each
Node it traverses. This is especially useful if the
attribute parameter is a function, as in the previous examples. For
instance, we can store the aggregated cost for later use and
printing:
acme$Do(function(node) node$cost <- Cost(node), filterFun = isNotLeaf)
print(acme, "p", "cost")##                           levelName    p    cost
## 1  Acme Inc.                          NA 4950000
## 2   ¦--Accounting                     NA 1500000
## 3   ¦   ¦--New Software             0.50 1000000
## 4   ¦   °--New Accounting Standards 0.75  500000
## 5   ¦--Research                       NA 2750000
## 6   ¦   ¦--New Product Line         0.25 2000000
## 7   ¦   °--New Labs                 0.90  750000
## 8   °--IT                             NA  700000
## 9       ¦--Outsource                0.20  400000
## 10      ¦--Go agile                 0.05  250000
## 11      °--Switch to R              1.00   50000SetThe Set method is the counterpart to the
Get method. The Set method takes a vector or a
single value as an input, and traverses the tree in a certain order.
Each Node is assigned a value from the vector, one after
the other, recycling.
acme$Set(id = 1:acme$totalCount)
print(acme, "id")##                           levelName id
## 1  Acme Inc.                         1
## 2   ¦--Accounting                    2
## 3   ¦   ¦--New Software              3
## 4   ¦   °--New Accounting Standards  4
## 5   ¦--Research                      5
## 6   ¦   ¦--New Product Line          6
## 7   ¦   °--New Labs                  7
## 8   °--IT                            8
## 9       ¦--Outsource                 9
## 10      ¦--Go agile                 10
## 11      °--Switch to R              11The Set method can take multiple vectors as an input,
and, optionally, you can define the name of the attribute. Finally, just
as for the Get method, the traversal order
is important for the Set.
secretaries <- c(3, 2, 8)
employees <- c(52, 43, 51)
acme$Set(secretaries, 
         emps = employees,
         filterFun = function(x) x$level == 2)
print(acme, "emps", "secretaries", "id")##                           levelName emps secretaries id
## 1  Acme Inc.                          NA          NA  1
## 2   ¦--Accounting                     52           3  2
## 3   ¦   ¦--New Software               NA          NA  3
## 4   ¦   °--New Accounting Standards   NA          NA  4
## 5   ¦--Research                       43           2  5
## 6   ¦   ¦--New Product Line           NA          NA  6
## 7   ¦   °--New Labs                   NA          NA  7
## 8   °--IT                             51           8  8
## 9       ¦--Outsource                  NA          NA  9
## 10      ¦--Go agile                   NA          NA 10
## 11      °--Switch to R                NA          NA 11The Set method can also be used to assign a single value
directly to all Nodes traversed. For example, to remove the
avgExpectedCost, we assign NULL on each node,
using the fact that the Set recycles:
acme$Set(avgExpectedCost = NULL)However, note that setting a field to NULL will not make
it gone for good. You will still see it:
acme$attributesAll## [1] "avgExpectedCost" "cost"            "id"              "emps"           
## [5] "secretaries"     "p"In order remove it completely, you can use the
RemoveAttribute method:
acme$Do(function(node) node$RemoveAttribute("avgExpectedCost"))Earlier, we saw that we can add a function dynamically to a
Node. We can, of course, also do this via the
Set method
acme$Set(cost = c(function(self) sum(sapply(self$children, 
                                            function(child) GetAttribute(child, "cost")))), 
         filterFun = isNotLeaf)
print(acme, "cost")##                           levelName    cost
## 1  Acme Inc.                        4950000
## 2   ¦--Accounting                   1500000
## 3   ¦   ¦--New Software             1000000
## 4   ¦   °--New Accounting Standards  500000
## 5   ¦--Research                     2750000
## 6   ¦   ¦--New Product Line         2000000
## 7   ¦   °--New Labs                  750000
## 8   °--IT                            700000
## 9       ¦--Outsource                 400000
## 10      ¦--Go agile                  250000
## 11      °--Switch to R                50000acme$IT$AddChild("Paperless", cost = 240000)
print(acme, "cost")##                           levelName    cost
## 1  Acme Inc.                        5190000
## 2   ¦--Accounting                   1500000
## 3   ¦   ¦--New Software             1000000
## 4   ¦   °--New Accounting Standards  500000
## 5   ¦--Research                     2750000
## 6   ¦   ¦--New Product Line         2000000
## 7   ¦   °--New Labs                  750000
## 8   °--IT                            940000
## 9       ¦--Outsource                 400000
## 10      ¦--Go agile                  250000
## 11      ¦--Switch to R                50000
## 12      °--Paperless                 240000Traverse and explicit traversalPreviously, we have used the Get, Set and
Do methods in their OO-style version. This is often very
convenient for quick access to variables. However, sometimes you want to
re-use the same traversal for multiple sequential operations. For this,
you can use what is called explicit traversal. It works
like so:
traversal <- Traverse(acme, traversal = "post-order", filterFun = function(x) x$level == 2)
Set(traversal, floor = c(1, 2, 3))
Do(traversal, function(x) {
    if (x$floor <= 2) {
      x$extension <- "044"
    } else {
      x$extension <- "043"
    }
  })
Get(traversal, "extension")## Accounting   Research         IT 
##      "044"      "044"      "043"AggregateThe Aggregate method provides a shorthand for the
oft-used case when a parent is the aggregate of its child values, as
seen in the previous example. Aggregate calls a function
recursively on children. If a child holds the attribute, that value is
returned. Otherwise, the attribute is collected from all children, and
aggregated using the aggFun. For example:
Aggregate(node = acme, attribute = "cost", aggFun = sum)## [1] 5190000We can also use this in the Get method, of course:
acme$Get(Aggregate, "cost", sum)Note, however, that this is not very efficient:
Aggregate will be called twice on, say, IT: Once
when the traversal passes IT itself, the second time
recursively when Aggregate is called on the root. For this
reason, we have the option to store/cache the calculated value along the
way. For one thing, this is a convenient way to save an additional
Set call in case we want to store the aggregated value.
Additionally, it speeds up calculation because Aggregate on
an ancestor will use a cached value on a descendant:
acme$Do(function(node) node$cost <- Aggregate(node, attribute = "cost", aggFun = sum), traversal = "post-order")
print(acme, "cost")##                           levelName    cost
## 1  Acme Inc.                        5190000
## 2   ¦--Accounting                   1500000
## 3   ¦   ¦--New Software             1000000
## 4   ¦   °--New Accounting Standards  500000
## 5   ¦--Research                     2750000
## 6   ¦   ¦--New Product Line         2000000
## 7   ¦   °--New Labs                  750000
## 8   °--IT                            940000
## 9       ¦--Outsource                 400000
## 10      ¦--Go agile                  250000
## 11      ¦--Switch to R                50000
## 12      °--Paperless                 240000CumulateIn its simplest form, the Cumulate function just sums up
an attribute value along siblings, taking into consideration all
siblings before the Node on which Cumulate is
called:
Cumulate(acme$IT$`Go agile`, "cost", sum)## [1] 650000Or, to find the minimum cost among siblings:
Cumulate(acme$IT$`Go agile`, "cost", min)## [1] 250000This can be useful in combination with traversal, e.g. to calculate a
running sum among siblings. Specifically, the
cacheAttribute lets you store the running sum in a field.
This not only speeds up calculation, but lets you re-use the calculated
values later:
acme$Do(function(node) node$cumCost <- Cumulate(node, 
                                                attribute = "cost", 
                                                aggFun = sum))
print(acme, "cost", "cumCost")##                           levelName    cost cumCost
## 1  Acme Inc.                        5190000 5190000
## 2   ¦--Accounting                   1500000 1500000
## 3   ¦   ¦--New Software             1000000 1000000
## 4   ¦   °--New Accounting Standards  500000 1500000
## 5   ¦--Research                     2750000 4250000
## 6   ¦   ¦--New Product Line         2000000 2000000
## 7   ¦   °--New Labs                  750000 2750000
## 8   °--IT                            940000 5190000
## 9       ¦--Outsource                 400000  400000
## 10      ¦--Go agile                  250000  650000
## 11      ¦--Switch to R                50000  700000
## 12      °--Paperless                 240000  940000CloneAs stated above, Nodes exhibit reference semantics. If
you call, say, Set, then this changes the
Nodes in the tree. The changes will be visible for all
variables having a reference on the data.tree structure. As
a consequence, you might want to “save away” the current state of a
structure. To do this, you can Clone an entire tree:
acmeClone <- Clone(acme)
acmeClone$name <- "New Acme"
# acmeClone does not point to the same reference object anymore:
acme$name == acmeClone$name## [1] FALSESortWith the Sort method, you can sort an entire tree, a
sub-tree, or children of a specific Node. The method will
sort recursively and sort children with respect to a child attribute. As
explained earlier, the child attribute can be a function or a
method.
Sort(acme, "name")
acme##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Accounting Standards
## 4   ¦   °--New Software            
## 5   ¦--IT                          
## 6   ¦   ¦--Go agile                
## 7   ¦   ¦--Outsource               
## 8   ¦   ¦--Paperless               
## 9   ¦   °--Switch to R             
## 10  °--Research                    
## 11      ¦--New Labs                
## 12      °--New Product LineSort(acme, Aggregate, "cost", sum, decreasing = TRUE, recursive = TRUE)
print(acme, "cost", aggCost = acme$Get(Aggregate, "cost", sum))##                           levelName    cost aggCost
## 1  Acme Inc.                        5190000 5190000
## 2   ¦--Research                     2750000 2750000
## 3   ¦   ¦--New Product Line         2000000 2000000
## 4   ¦   °--New Labs                  750000  750000
## 5   ¦--Accounting                   1500000 1500000
## 6   ¦   ¦--New Software             1000000 1000000
## 7   ¦   °--New Accounting Standards  500000  500000
## 8   °--IT                            940000  940000
## 9       ¦--Outsource                 400000  400000
## 10      ¦--Go agile                  250000  250000
## 11      ¦--Paperless                 240000  240000
## 12      °--Switch to R                50000   50000PruneYou can prune sub-trees out of a tree, by that removing an entire
sub-tree from a tree. There are two variations of this: *
temporary pruning, e.g. just for printing: This is the
pruneFun parameter, e.g. in Get * side
effect or permanent pruning, meaning that you modify your
data.tree structure for good. This is achieved with the
Prune method.
Consider the following example of permanent pruning:
acme$Do(function(x) x$cost <- Aggregate(x, "cost", sum))
Prune(acme, function(x) x$cost > 700000)## [1] 5print(acme, "cost")##                  levelName    cost
## 1 Acme Inc.                5190000
## 2  ¦--Research             2750000
## 3  ¦   ¦--New Product Line 2000000
## 4  ¦   °--New Labs          750000
## 5  ¦--Accounting           1500000
## 6  ¦   °--New Software     1000000
## 7  °--IT                    940000The data.tree package has been built to work with
hierarchical data, to support visualization, to foster rapid
prototyping, and for other applications where development time saved is
more important than computing time lost. Having said this, it becomes
clear that big data and data.tree do not marry particularly
well. Don’t expect R to build your data.tree structure with
a few million Nodes during your cigarette break. Do not try
to convert a gigabyte JSON document to a data.tree
structure in a testthat test case.
However, if you are respecting the following guidelines, I promise
that you and your Nodes will have a lot of fun together. So
here it goes:
Node is relatively expensive.
CreateRegularTree(6, 6) creates a data.tree
structure with 9331 Nodes. On an AWS c4.large instance,
this takes about 2.5 seconds.Clone is similar to Node creation, with an
extra penalty of about 50%.Traverse, Get,
Set and Do) is relatively cheap.This is really what you would expect. data.tree builds
on R6, i.e. reference objects. There is an overhead in creating them, as
your computer needs to manage the references they hold. However,
performing operations that change your tree (e.g. Prune or
Set) are often faster than value semantics, as your
computer does not need to copy the entire object in memory.
Just to give you an order of magnitude: The following times are achieved on an AWS c4.large instance:
system.time(tree <- CreateRegularTree(6, 6))##    user  system elapsed 
##   2.499   0.009   2.506system.time(tree <- Clone(tree))##    user  system elapsed 
##   3.704   0.023   3.726system.time(traversal <- Traverse(tree))##    user  system elapsed 
##   0.096   0.000   0.097system.time(Set(traversal, id = 1:tree$totalCount))##    user  system elapsed 
##   0.205   0.000   0.204system.time(ids <- Get(traversal, "id"))##    user  system elapsed 
##   0.569   0.000   0.569leaves <- Traverse(tree, filterFun = isLeaf)
Set(leaves, leafId = 1:length(leaves))
system.time(Get(traversal, function(node) Aggregate(node, "leafId", max)))##    user  system elapsed 
##   1.418   0.000   1.417With caching, you can save some time:
system.time(tree$Get(function(node) Aggregate(tree, "leafId", max, "maxLeafId"), traversal = "post-order"))##    user  system elapsed 
##    0.69    0.00    0.69data.tree structures have a relatively large memory footprint.
However, for every-day applications using modern computers, this will
not normally have an impact on your work except when saving a
data.tree structure to disk.
For an explanation why that is the case, you might want to read this answer on Stack Overflow.
Depending on your development environment, you might want to turn off the option to save the workspace to .RData on exit.