---
title: "Why Semantics Matter for R Data Frames"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Why Semantics Matter for R Data Frames}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setupvignette, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(dataset)
```

R users love `data.frame`s and `tibble`s for tidy, rectangular data. But tidy data isn't always **meaningful data**. What does a column labelled `gdp` actually represent? Euros? Millions? Per capita? Current prices? Constant 2010 prices? These questions matter—especially in statistics, open data publishing, and knowledge graph integration.

The `dataset_df` class extends the familiar `data.frame` structure with lightweight, semantically meaningful metadata. It's built for:

-   **Tidyverse lovers** who want better documentation and safer analysis

-   **Open science workflows** that need interoperable metadata

-   **Semantic web users** who want to export structured RDF data from R

`dataset_df` helps you preserve the meaning of variables, units, identifiers, and dataset-level context.

## From Tidy to Meaningful: An Example

Let's start with a basic data frame and upgrade it to a `dataset_df` with semantically enriched columns using `defined()`:

```{r smallcountries}
small_country_dataset <- dataset_df(
  country_name = defined(c("AD", "LI"),
    label = "Country name",
    concept = "http://data.europa.eu/bna/c_6c2bb82d",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  gdp = defined(c(3897, 7365),
    label = "Gross Domestic Product",
    unit = "million dollars",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc."
  )
)
```

The `defined()` vectors attach metadata to each column:

-   `label`: a human-readable name

-   `unit`: an explicit measurement unit

-   `concept`: a URI identifying the concept measured

-   `namespace`: for generating full subject URIs when exporting to RDF

The `dataset_df()` call also allows bibliographic metadata:

-   `dataset_bibentry`: Dublin Core metadata for citation, reuse, and provenance

## Why Units Matter

Many statistical errors begin with a silent assumption about units. In Eurostat data, it's common to see:

-   `EUR`: Euros

-   `MIO_EUR`: Millions of euros

-   `PPS`: Purchasing Power Standards

By making units explicit at the column level, you:

-   Prevent decimal-scale mistakes (e.g., thousands vs millions)

-   Avoid joining or averaging incompatible series

-   Gain confidence in your data exports (CSV, RDF, JSON-LD, etc.)

This is especially important in multi-currency and multi-country datasets such as those published by Eurostat, where harmonization is crucial.

## A Final Structure, Ready for Export

The enriched `dataset_df` object can be serialized to RDF using:

```{r serialisation}
triples <- dataset_to_triples(small_country_dataset)

n_triples(mapply(n_triple, triples$s, triples$p, triples$o))
```

This supports export to:

-   **Wikibase** via `wbdataset`

-   **RDF Data Cube** via `datacube`

-   **DataCite or DCAT** metadata formats

This vignette represents the final conceptual structure for `dataset_df` before its rOpenSci submission. Future work will build on this foundation without breaking it.

## Summary: Why Use `dataset_df`

| Feature            | What It Adds                           |
|--------------------|----------------------------------------|
| `label`            | Human-readable variable name           |
| `unit`             | Explicit unit (e.g., `MIO_EUR`)        |
| `concept`          | URI identifying what is measured       |
| `subject`          | Dataset-level topical classification   |
| `namespace`        | Base URI for RDF subject identifiers   |
| `dataset_bibentry` | Bibliographic metadata via Dublin Core |

The `dataset_df` class is designed to remain fully compatible with the **tidyverse** data workflow, while offering a metadata structure suitable for:

-   **Receiving SDMX-style statistical data** into R

-   **Exporting semantically meaningful datasets** to DCAT, RDF, or Wikibase

-   **Complying with open science repository requirements** (e.g., DataCite, Zenodo)

Start tidy. Stay meaningful. Embrace `dataset_df`.