--- title: "DSIR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{DSIR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE ) ``` ```{r setup} library(DSIR) library(dplyr) library(ggplot2) ``` DSIR logo DSIR is a small R package for global health data work. It consists of WHO Member State metadata, lightweight clients for the GHO and UN SDG APIs, and reusable WHO-style `ggplot2` and `flextable` themes. DSIR is designed for health professionals, WHO staff, and global health researchers — the kind of users who do the same routine tasks every day. This vignette walks through the typical workflow: looking up countries, fetching data from GHO and SDG, cleaning the raw response, and producing publication-style charts and tables. ## WHO Member State metadata The `who_countries` tibble lists all 194 WHO Member States with their ISO3, ISO2, UN M49 codes, official names, short names, and WHO region. For Western Pacific countries, an extra column `is_pic` identifies the 14 Pacific Island Countries. ```{r} who_countries ``` For convenience, DSIR offers pre-defined vectors of ISO3 codes for each WHO region. ```{r} wpro_cty length(wpro_cty) # 28 Member States in WPR (since May 2025) ``` The `is_pic` flag is useful because Pacific Island Countries are often analysed as a group, given their distinct demographic and geographic profiles. ```{r} who_countries |> filter(is_pic) |> select(iso3, name_short) ``` When you have a vector of ISO3 codes and need to know which WHO region each belongs to, `iso3_to_region()` provides the lookup. It is vectorised and returns `NA` for codes that do not match a WHO Member State. ```{r} iso3_to_region(c("PHL", "FRA", "ZAF", "USA", "XYZ")) # "WPR" "EUR" "AFR" "AMR" NA ``` This is convenient when joining external datasets (which often arrive keyed only by ISO3) to the WHO regional structure. The companion helper `iso3_to_m49()` converts ISO3 codes to UN M49 numeric codes — useful because the WHO GHO API is keyed by ISO3 (`"PHL"`) while the UN SDG API is keyed by M49 (`"608"`). The M49 values are returned as three-character zero-padded strings, exactly as stored in `who_countries$m49_code`. ```{r} iso3_to_m49(c("PHL", "FRA", "JPN")) # "608" "250" "392" # Case-insensitive; non-Member areas return NA iso3_to_m49(c("phl", "PRI")) # "608" NA ``` In practice you can usually skip the explicit conversion: `sdg_data()` and `sdg_coverage()` accept ISO3 codes for their `area` argument and do the lookup internally (see the SDG section below). ## Checking availability before fetching GHO has thousands of indicators, but any single indicator may not cover the countries or years you need. Before issuing a full download with `gho_data()`, three lightweight helpers let you ask the server what is available without transferring any observations. `gho_has_data()` is a quick yes / no for a given indicator and filter — useful when screening a list of candidate indicators. ```{r} # Does WHO have life-expectancy data for France? gho_has_data("WHOSIS_000001", area = "FRA") # TRUE # Bulk-screen several indicators at once inds <- c("WHOSIS_000001", "NCDMORT3070", "MDG_0000000026") vapply(inds, gho_has_data, logical(1), area = "PHL") ``` It returns `TRUE`, `FALSE`, or `NA` (for request failures, including a non-existent indicator code — GHO returns HTTP 404 in that case). `gho_count()` returns the number of rows the same filter would produce, which is useful for sizing a download. ```{r} gho_count("WHOSIS_000001", area = wpro_cty) ``` `gho_coverage()` summarises year coverage and observation counts per country. The payload is small because only `SpatialDim` and `TimeDim` are requested from the server. ```{r} gho_coverage("WHOSIS_000001", area = c("FRA", "DEU", "JPN")) #> location year_min year_max n_obs #> 1 DEU 2000 2021 66 #> 2 FRA 2000 2021 66 #> 3 JPN 2000 2021 66 ``` ## Fetching indicator data from GHO To fetch indicators from GHO, the typical workflow is three steps: search for the indicator code, fetch the data, then clean the response. The `area` argument accepts a long ISO3 vector, so a whole region can be pulled in one call. ### Step 1: Search for an indicator ```{r} gho_indicators("UHC") |> head() ``` Pick an `IndicatorCode` from the result — this is the value you pass to `gho_data()` in the next step. ### Step 2: Fetch the data ```{r} uhc <- gho_data( indicator = "UHC_INDEX_REPORTED", spatial_type = "country", area = wpro_cty, year_from = 2015 ) uhc |> glimpse() ``` Note that `area` accepts long ISO3 vectors — here we fetch all 28 WPR countries in one call. ### Step 3: Clean the raw response `gho_clean()` produces the **unified DSIR cleaned-indicator schema** — the same 15-column shape as `sdg_clean()`. Columns include `source` (`"gho"`), `id`, `indicator`, `location`, `iso3`, `location_name` (empty for GHO), `year`, `value`, `value_num`, `low`, `high`, `series` (empty for GHO), and the three optional GHO dimensions `dim1`–`dim3`. Columns absent from the raw response are filled with typed `NA`. ```{r} uhc_clean <- gho_clean(uhc) uhc_clean ``` ## Aggregating indicators with geomean() Some health indicators are constructed as the geometric mean of component values rather than the arithmetic mean. The UHC Service Coverage Index, for example, aggregates 14 tracer indicators using nested geometric means. DSIR provides `geomean()` for this: ```{r} # Unweighted geometric mean geomean(c(0.6, 0.8, 0.95)) #> 0.7720589 # With optional weights — useful when tracers have different # methodological importance geomean(c(0.6, 0.8, 0.95), w = c(2, 1, 1)) ``` `geomean()` handles missing values, zeros, and negative values sensibly — see `?geomean` for details. It is a small helper, but it removes a common source of bugs when re-implementing index calculations from indicator components. ## Plotting with theme_dsi() and theme_dsi_facet() DSIR provides two paired `ggplot2` themes tuned for WHO-style charts — clean panels, modest grids, and a consistent accent colour. Use them as drop-in replacements for `theme_minimal()` and `theme_bw()` respectively whenever a chart is heading into a WHO deliverable. The rule of thumb is simple: **single-panel plots use `theme_dsi()`, faceted plots use `theme_dsi_facet()`**. The two share typography, title treatment, and legend handling, but differ in how they frame the data — the facet variant adds panel borders, light strip backgrounds, and breathing room between panels, all of which would look heavy on a single-panel chart. ### Single panel: `theme_dsi()` `theme_dsi()` keeps the chart chrome minimal — a half-frame axis, light grid lines, and the WHO-blue accent on the axis line. By default the grid runs in both directions; pass `grid = "y"` for the minimalist horizontal-only look. ```{r, fig.width = 7, fig.height = 4} uhc_clean |> filter(iso3 %in% c("AUS", "CHN", "PHL", "FJI")) |> left_join(who_countries, by = "iso3") |> ggplot(aes(x = year, y = value_num, group = iso3, color = name_short)) + geom_line(linewidth = .8) + geom_point(size = 1.8) + theme_dsi() + labs( title = "UHC Service Coverage Index, selected WPR Member States", subtitle = "2015 onwards", x = NULL, y = "SCI", color = NULL ) ``` For bar charts, pair `theme_dsi()` with `scale_y_dsi_col()` (or `scale_x_dsi_col()` when `value` is mapped to `x`) — these are thin wrappers around `scale_*_continuous()` that remove the lower axis expansion, so columns sit flush with the axis instead of floating above it. ```{r, fig.width = 7, fig.height = 4} uhc_clean |> filter(year == max(year)) |> left_join(who_countries, by = "iso3") |> arrange(desc(value_num)) |> head(10) |> ggplot(aes(reorder(name_short, value_num), value_num)) + geom_col(fill = "#0093D5") + coord_flip() + scale_y_dsi_col() + theme_dsi(grid = "x") + labs( title = "UHC Service Coverage Index, top 10 WPR Member States", subtitle = "Latest available year", x = NULL, y = "SCI" ) ``` ### Faceted: `theme_dsi_facet()` When the same chart is split across many small panels, the half-frame look becomes visually noisy — the accent-blue axis line repeats across every facet. `theme_dsi_facet()` switches to a full panel border, adds a light grey strip background to clearly mark each facet's label, and introduces panel spacing so adjacent panels don't run together. ```{r, fig.width = 8, fig.height = 5} uhc_clean |> left_join(who_countries, by = "iso3") |> filter(is_pic) |> ggplot(aes(x = year, y = value_num)) + geom_line(color = "#0093D5", linewidth = 0.8) + geom_point(color = "#0093D5", size = 1.5) + facet_wrap(~ name_short, ncol = 4) + theme_dsi_facet() + labs( title = "UHC Service Coverage Index, Pacific Island Countries", subtitle = "Each panel shows one country's trajectory", x = NULL, y = "SCI" ) ``` The `strip_fill` argument lets you change the strip background colour for emphasis — for example, a light-blue tone derived from the WHO accent for a deliverable where the strips themselves carry meaning: ```{r, fig.width = 8, fig.height = 5} uhc_clean |> left_join(who_countries, by = "iso3") |> filter(is_pic) |> ggplot(aes(x = year, y = value_num)) + geom_line(color = "#0093D5", linewidth = 0.8) + facet_wrap(~ name_short, ncol = 4) + theme_dsi_facet(strip_fill = "#E5F4FB") + labs(title = "UHC SCI, PIC — with custom strip colour", x = NULL, y = "SCI") ``` ## Tables with dsi_flextable_defaults() `dsi_flextable_defaults()` sets WHO-style defaults for `flextable` globally — booktabs theme, bold headers, modest padding. Call it once near the top of your report and every subsequent `flextable()` picks up the formatting. ```{r} library(flextable) dsi_flextable_defaults(font_family = "Geogria") uhc_clean |> filter(year == max(year)) |> left_join(who_countries, by = "iso3") |> select(name_short, value_num) |> arrange(desc(value_num)) |> flextable() |> set_table_properties("autofit", width = .6) %>% set_caption("UHC SCI in WPR, latest year") ``` ## Working with SDG indicators `sdg_data()` and `sdg_clean()` follow the same fetch-then-tidy pattern as their GHO counterparts. The main differences are that indicator codes use the dotted SDG format (e.g. `"3.4.1"`) and that `value`, `low`, and `high` are kept as character — the SDG API returns non-numeric entries (`"<0.1"`, aggregate notes) for some rows, so coerce with `as.numeric()` only when you are ready to drop them. `sdg_indicators()` accepts an optional `search` argument with the same behaviour as `gho_indicators()` — multiple keywords are AND-ed together and matched case-insensitively against the indicator description. The filter runs client-side because the UN SDG indicator list is short (~250 rows) and the endpoint is not OData. ```{r} # All indicators that mention both mortality and cancer sdg_indicators("mortality cancer") # Same as above, but with explicit terms (allows whitespace inside a term) sdg_indicators(c("maternal", "mortality")) ``` The `area` argument of `sdg_data()` and `sdg_coverage()` accepts either ISO3 codes (converted internally via `iso3_to_m49()`) or UN M49 numeric codes — so DSIR's regional vectors (`wpro_cty`, `afro_cty`, etc.) work directly, the same way they do with the GHO client. Do not mix the two formats in a single call. ```{r} # ISO3 — regional vector passed straight through sdg <- sdg_data( indicator = "3.4.1", area = wpro_cty ) sdg |> glimpse() # M49 also works (e.g. when copy-pasting codes from sdg_areas()) sdg_data("3.4.1", area = c("608", "250")) ``` ```{r} sdg_clean(sdg) ``` `sdg_clean()` produces the same 15-column schema as `gho_clean()`, so the two outputs can be combined directly with `bind_indicators()`. SDG rows populate the `series` column (and the `iso3` column via [`m49_to_iso3()`] for Member States), while leaving the GHO-only `dim1`–`dim3` columns as `NA`. ### Combining GHO and SDG with bind_indicators() When an analysis pulls indicators from both sources, `bind_indicators()` stacks any number of cleaned tibbles into one. The `source` column (`"gho"` / `"sdg"`) lets you filter or facet by origin without remembering which frame came from which API. ```{r} # Two indicators on the same topic from different APIs: # GHO NCDMORT3070 (probability of premature NCD mortality) # SDG 3.4.1 (mortality rate from NCDs) gho_ncd <- gho_data("NCDMORT3070", area = wpro_cty) |> gho_clean() sdg_ncd <- sdg_data("3.4.1", area = wpro_cty) |> sdg_clean() bind_indicators(gho_ncd, sdg_ncd) |> glimpse() ``` ### Exploring series with sdg_coverage() A single SDG indicator often contains several **series** — for example different vaccines, sex strata, or causes of death — each with its own country and year coverage. Indicator `"3.b.1"` (vaccine coverage) is a clear case: it is published as four separate series (DTP3, MCV2, PCV3, HPV), and the year coverage of the newer vaccines is much shorter than that of DTP3. `sdg_coverage()` summarises the year range and observation count per `(location, series)`, so you can inspect what series exist and how each is covered before deciding which one to analyse. ```{r} sdg_coverage("3.b.1", area = c("156", "608")) #> location series year_min year_max n_obs #> 1 156 SH_ACS_DTP3 2000 2023 24 #> 2 156 SH_ACS_HPV 2018 2023 6 #> 3 156 SH_ACS_MCV2 2000 2023 24 #> 4 156 SH_ACS_PCV3 2017 2023 7 #> 5 608 SH_ACS_DTP3 2000 2023 24 #> 6 608 SH_ACS_HPV 2017 2023 7 #> 7 608 SH_ACS_MCV2 2000 2023 24 #> 8 608 SH_ACS_PCV3 2014 2023 10 ``` Note that DSIR intentionally does *not* provide SDG analogues of `gho_has_data()` and `gho_count()`. SDG data is generally complete enough that those screening helpers add little value — the more useful pre-analysis question for SDG is "which series are available?", which is what `sdg_coverage()` answers. ## Where to next - Source code lives at . - Bug reports, feature requests, and pull requests are all welcome — please file them on the GitHub issue tracker.