--- title: "Semantically Enriched Vectors with `defined()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Semantically Enriched Vectors with `defined()`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setupdefined, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r load} library(dataset) ``` The `defined()` function in the `dataset` package allows you to create semantically enriched vectors that retain human-readable metadata — including labels, measurement units, definitions (e.g. URIs), and namespaces. This vignette demonstrates how to create, manipulate, and interpret defined vectors, and how they integrate seamlessly into data frames and tidy workflows. ## Introducting the `defined` Class The `defined()` constructor enriches a vector by attaching additional attributes that convey semantic meaning. It builds upon the foundation of labelled vectors and introduces three further metadata elements: - A `unit of measurement` (e.g. "million dollars") - A `concept`, which can be a textual reference or ideally a URI - A `namespace`, which enables the construction of meaningful, resolvable identifiers for values or categories Let’s inspect the metadata attached to a defined vector representing GDP values: ```{r definedvector} gdp_1 <- defined( c(3897, 7365), label = "Gross Domestic Product", unit = "million dollars", concept = "http://data.europa.eu/83i/aa/GDP" ) cat("The print method:\n") print(gdp_1) cat("And the summary:\n") summary(gdp_1) ``` When `summary()` is called on a defined vector, its label and unit (if available) are displayed above the summary statistics. The `defined()` class extends the attributes of a labelled vector with a unit (of measure), a concept definition and a namespace. ```{r definedattributes} attributes(gdp_1) cat("Get the label only: ") var_label(gdp_1) cat("Get the unit only: ") var_unit(gdp_1) cat("Get the concept definition only: ") var_concept(gdp_1) ``` What happens if we try to concatenate a semantically under-specified new vector to the GDP vector? ## Semantic Consistency in Concatenation ```{r combine} a <- defined(1:3, label = "Length", unit = "metres") b <- defined(4:6, label = "Length", unit = "metres") c(a, b) ``` ```{r newexample} gdp_2 <- defined(2034, label = "Gross Domestic Product") ``` You will get an intended error message that some attributes are not compatible. You certainly want to avoid that you are concatenating figures in euros and dollars, for example. Attempting to concatenate the under-specified `gdp_2` vector with `gdp_1` will trigger an error: ```{r error, eval=FALSE} c(gdp_1, gdp_2) ``` ``` Error in `vec_c()`: ! Can't combine `..1` and `..2` . ✖ Some attributes are incompatible. ``` This error is intentional — it ensures that values with mismatched or incomplete semantic context (e.g., a different currency unit or an undefined concept) do not silently contaminate the dataset. ### Aligning Metadata Before Concatenation We can resolve this by explicitly defining the missing unit and definition for gdp_2 so that it matches gdp_1: ```{r smgdp, gpd2} var_unit(gdp_2) <- "million dollars" ``` ```{r vardef2} var_concept(gdp_2) <- "http://data.europa.eu/83i/aa/GDP" ``` With matching metadata, concatenation now succeeds: ```{r concat} summary(c(gdp_1, gdp_2)) ``` ### Semantic Identifiers via Namespaces Namespaces allow defined values — such as country codes — to be expanded into resolvable URIs. This is especially powerful for linked data and machine-readable classification systems. ```{r country} country <- defined(c("AD", "LI", "SM"), label = "Country name", concept = "http://data.europa.eu/bna/c_6c2bb82d", namespace = "https://www.geonames.org/countries/$1/" ) ``` The namespace attribute allows each value in a vector to become a resolvable URI — useful in linked data and semantic web contexts. The point of using a namespace is that it can point to a both human- and machine readable definition of the ID column, or any attribute column in the datasets. (Attributes in a statistical datasets are characteristics of the observations or the measured variables.) The namespace acts as a template: $1 is replaced by the actual value of each element, producing links like: - in the case of Andorra, - for Lichtenstein, and - for San Marino. In addition, the definition URI — — resolves to a machine-readable classification of country names, helping to align datasets with official vocabularies and standards. ## Basic Usage Working with character vectors: ```{r characters} countries <- defined( c("AD", "LI"), label = "Country code", namespace = "https://www.geonames.org/countries/$1/" ) countries as_character(countries) ``` ### Subsetting and coercion ```{r basicmethods} gdp_1[1:2] gdp_1 > 5000 as.vector(gdp_1) as.list(gdp_1) ``` ## Coerce to base R types Coerce back the labelled country vector to a character vector: ```{r coerce-char} as_character(country) as_character(c(gdp_1, gdp_2)) ``` And to numeric: ```{r coerce-num} as_numeric(c(gdp_1, gdp_2)) ```