--- title: "Why Semantics Matter for R Data Frames" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Why Semantics Matter for R Data Frames} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setupvignette, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(dataset) ``` R users love `data.frame`s and `tibble`s for tidy, rectangular data. But tidy data isn't always **meaningful data**. What does a column labelled `gdp` actually represent? Euros? Millions? Per capita? Current prices? Constant 2010 prices? These questions matter—especially in statistics, open data publishing, and knowledge graph integration. The `dataset_df` class extends the familiar `data.frame` structure with lightweight, semantically meaningful metadata. It's built for: - **Tidyverse lovers** who want better documentation and safer analysis - **Open science workflows** that need interoperable metadata - **Semantic web users** who want to export structured RDF data from R `dataset_df` helps you preserve the meaning of variables, units, identifiers, and dataset-level context. ## From Tidy to Meaningful: An Example Let's start with a basic data frame and upgrade it to a `dataset_df` with semantically enriched columns using `defined()`: ```{r smallcountries} small_country_dataset <- dataset_df( country_name = defined(c("AD", "LI"), label = "Country name", concept = "http://data.europa.eu/bna/c_6c2bb82d", namespace = "https://www.geonames.org/countries/$1/" ), gdp = defined(c(3897, 7365), label = "Gross Domestic Product", unit = "million dollars", concept = "http://data.europa.eu/83i/aa/GDP" ), dataset_bibentry = dublincore( title = "Small Country Dataset", creator = person("Jane", "Doe"), publisher = "Example Inc." ) ) ``` The `defined()` vectors attach metadata to each column: - `label`: a human-readable name - `unit`: an explicit measurement unit - `concept`: a URI identifying the concept measured - `namespace`: for generating full subject URIs when exporting to RDF The `dataset_df()` call also allows bibliographic metadata: - `dataset_bibentry`: Dublin Core metadata for citation, reuse, and provenance ## Why Units Matter Many statistical errors begin with a silent assumption about units. In Eurostat data, it's common to see: - `EUR`: Euros - `MIO_EUR`: Millions of euros - `PPS`: Purchasing Power Standards By making units explicit at the column level, you: - Prevent decimal-scale mistakes (e.g., thousands vs millions) - Avoid joining or averaging incompatible series - Gain confidence in your data exports (CSV, RDF, JSON-LD, etc.) This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial. ## A Final Structure, Ready for Export The enriched `dataset_df` object can be serialized to RDF using: ```{r serialisation} triples <- dataset_to_triples(small_country_dataset) n_triples(mapply(n_triple, triples$s, triples$p, triples$o)) ``` This supports export to: - **Wikibase** via `wbdataset` - **RDF Data Cube** via `datacube` - **DataCite or DCAT** metadata formats This vignette represents the final conceptual structure for `dataset_df` before its rOpenSci submission. Future work will build on this foundation without breaking it. ## Summary: Why Use `dataset_df` | Feature | What It Adds | |--------------------|----------------------------------------| | `label` | Human-readable variable name | | `unit` | Explicit unit (e.g., `MIO_EUR`) | | `concept` | URI identifying what is measured | | `subject` | Dataset-level topical classification | | `namespace` | Base URI for RDF subject identifiers | | `dataset_bibentry` | Bibliographic metadata via Dublin Core | The `dataset_df` class is designed to remain fully compatible with the **tidyverse** data workflow, while offering a metadata structure suitable for: - **Receiving SDMX-style statistical data** into R - **Exporting semantically meaningful datasets** to DCAT, RDF, or Wikibase - **Complying with open science repository requirements** (e.g., DataCite, Zenodo) Start tidy. Stay meaningful. Embrace `dataset_df`.