--- title: "Introduction to the cellKey-Package" author: "Bernhard Meindl" date: "2025-09-03" output: rmarkdown::html_vignette: toc: true toc_depth: 5 number_sections: false vignette: > %\VignetteIndexEntry{Introduction to the cellKey-Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## About the cellKey package This package implements methods to provide perturbation for statistical tables. The implementation if greatly inspired by the the paper *Methodology for the Automatic Confidentialisation of Statistical Outputs from Remote Servers at the Australian Bureau of Statistics* (Thompson, Broadfoot, Elazar). This approach however was generalized and a new specification on how record keys are specified and the lookup-tables are defined is used. This package makes usage of perturbation tables that can be defined in the [**ptable**](https://github.com/sdcTools/ptable) package. This document describes the usage of version `1.0.3` of the package. ## Main Features In the **cellKey** package it is possible to pertub multidimensional count and magnitude tables. Functionality to generate suitable record keys is provided in function `ck_generate_rkeys()`. Using sha1-checksums on the input datasets, we can make sure that the same record keys are generated whenever the same input data set is used for which record keys should be generated. However, it is also possible of course to use already pre-generated record keys available as variable in the microdata set. The package allows to make use of sampling weights. Thus, both weighted and unweighted count and magnitude tables can be perturbed. The hierarchical structure of the variables spanning the table can be arbitrarily complex. To generate these hierarchies, the **cellKey** packages depends on functionality available in package [**sdcHierarchies**](https://cran.r-project.org/package=sdcHierarchies). Finally, auxiliary methods are provided that allow to extract valuable information from objects created by the main function of this package `perturbTable()`. For example `mod_counts()` returns a data object showing for each cell the perturbation value and where it was found in the lookup table. What is left is of course identifying bugs and issues and also to optimize performance of the package as well as to include some real world examples, eg. census tables. ## An Example We now show the capabilities of the **cellKey** package by running an example that ### Load the Package ```r library(cellKey) packageVersion("cellKey") ``` ``` ## [1] '1.0.3' ``` The first step in this approach is to generate a statistical table, which can be achieved using `ck_setup`. This function generates an object that contains all the information required to perturb count- (and optionally continuously scaled) variables. There are however a few inputs that `ck_setup()` requires and that need to be generated beforehand. These inputs are: - `x`: a `data.frame` or `data.table` containing microdata - `rkey`: defines where to find or how to compute record keys. It is possible to either specify a scalar character name or a number. If a number is specified, record keys are internally created by sampling from a uniform distribution between 0 and 1 with the random numbers rounded to the specified value of digits. If a name is specified, this name is interpreted as a variable name within `x` already containing record keys. - `dims`: a named list where each list element contains a hierarchy created with functionality from package [**sdcHierarchies**](https://cran.r-project.org/package=sdcHierarchies) and each name refers to a variable in `x` - `w`: either `NULL` (no weights) or the name of a variable in `x` that contains weights - `countvars`: an optional character vector of variables in `x` holding counts. In any case a special count variable `total` is internally generated that is `1` for each row of `x` - `numvars`: an optional character vector of variables in `x` holding numerical variables that can later be perturbed. We continue to show how to generate the required inputs. The first step is to prepare the inputdata. ### Specifying inputdata The testdata set we are using in this example contains information on person-level along with sampling weights as well as some categorical and continuously scaled variables. We also compute some binary variables for which we also want to create perturbed tables. ```r dat <- ck_create_testdata() dat <- dat[, c("sex", "age", "savings", "income", "sampling_weight")] dat[, cnt_highincome := ifelse(income >= 9000, 1, 0)] ``` We are now adding record keys (which will be referenced later) with 7 digits to the data set using function `ck_generate_rkeys` as shown below: ```r dat$rkeys <- ck_generate_rkeys(dat = dat, nr_digits = 7) print(head(dat)) ``` ``` ## sex age savings income sampling_weight cnt_highincome rkeys ## 1: male age_group3 12 5780 57 0 0.7552293 ## 2: female age_group3 28 2530 48 0 0.3716140 ## 3: male age_group1 550 6920 67 0 0.9195144 ## 4: male age_group1 870 7960 48 0 0.4260026 ## 5: male age_group4 20 9030 55 1 0.5450293 ## 6: female age_group3 102 3290 79 0 0.1694579 ``` To ensure that the same record keys are computed each and every time for the same data set, a seed based on the sha1-hash of the input dataset is computed by default in `ck_generate_rkeys`. This seed is used before sampling the record keys and may be overwritten using the `seed` argument. The goal of this introduction is to create a perturbed table of counts of variables `sex` by `age` for all observations as well as for the subgroups given by `cnt_highincome` that are non-zero. We also want to create perturbed tables of continously scaled variables `savings` and `income` also giving the hierarchical structure defined by `sex` by `age`. ### Specifying dimensions It is required to define hierarchies for each of the classifying variables of the desired table. There are two ways how the hierarchical structure of these variables (including sub-totals) can be specified. One way is to use `"@; value` format as (also) used in [**sdcTable**](https://cran.r-project.org/package=sdcTable). We would suggest however, to use the second alternative which is using functionality from package [**sdcHierarchies**](https://cran.r-project.org/package=sdcHierarchies). In this vignette, we only show the preferred way to generate hierarchies for categorical variables `age` and `sex` below: ```r dim_sex <- hier_create(root = "Total", nodes = c("male", "female")) hier_display(dim_sex) ``` ``` ## Total ## ├─male ## └─female ``` For variable `age` the process is very much the same: ```r dim_age <- hier_create(root = "Total", nodes = paste0("age_group", 1:6)) hier_display(dim_age) ``` ``` ## Total ## ├─age_group1 ## ├─age_group2 ## ├─age_group3 ## ├─age_group4 ## ├─age_group5 ## └─age_group6 ``` The idea of [**sdcHierarchies**](https://github.com/bernhard-da/sdcHierarchies) is to create a tree object using `hier_create()` and then add (using `hier_add()`), delete (with `hier_delete()`) or rename (using `hier_rename()`) elements from this hierarchical structure. For more examples have a look at the package vignette of the package which can be accessed with `sdcHierarchies::hier_vignette()`. [**sdcHierarchies**](https://github.com/bernhard-da/sdcHierarchies) also contains a shiny-based app that can be called with `hier_app()` which allows to interactively change and modify a hierarchy and allows to convert tree- and data.frame based inputs into various formats. For more information, have a look at `?hier_app` and the other help-files of the package. After all dimensions have been specified, these inputs must then be combined into a named list. In this object, the list-names refer to variable names of the input data set and the elements the data objects that hold the hierarchy specification itself. This is why the list elements of `dims` in this example have to be named `sex` and `age` as the specification refers to variables `age` and `sex` in input set `dat`. ```r dims <- list(sex = dim_sex, age = dim_age) ``` ### Setup a table instance We now have prepared the inputs and can define a generic statistical table using `ck_setup()` as shown below: ```r tab <- ck_setup( x = dat, rkey = "rkeys", dims = dims, w = "sampling_weight", countvars = "cnt_highincome", numvars = c("income", "savings")) ``` ``` ## computing contributing indices | rawdata <--> table; this might take a while ``` `ck_setup()` returns a `R6` class object that contains not only the relevant data but also all available methods. Thus, it is not required to assign the results of such methods to new objects, instead, the object itself is automatically updated. These objects also have a custom print method, showing some general information about the object: ```r print(tab) ``` ``` ## ── Table Information ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ✔ 21 cells in 2 dimensions ('sex', 'age') ## ✔ weights: yes ## ── Tabulated / Perturbed countvars ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ☐ 'total' ## ☐ 'cnt_highincome' ## ── Tabulated / Perturbed numvars ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ☐ 'income' ## ☐ 'savings' ``` ### Defining perturbation parameters #### Perturbation parameters for count variables The next task is to define parameters that are used to perturb count variables which can be achieved with `ck_params_cnts`. This function requires as input the result of either `pt_create_pParams`, `pt_create_pTable` or `create_cnt_ptable` from the [**ptable**](https://github.com/sdcTools/ptable) package. Please refer also to the documentation of this package for information on the required parameters. In this example we are going to use - amongst others - exemplary ptables that can are provided by the `ptable`-pkg for demonstration purposes: ```r # two different perturbation parameter sets from the ptable-pkg # an example ptable provided directly ptab1 <- ptable::pt_ex_cnts() # creating a ptable by specifying parameters para2 <- ptable::create_cnt_ptable( D = 8, V = 3, js = 2, pstay = 0.5, optim = 1, mono = TRUE) ``` We then need to create the required inputs for the cellKey package. ```r p_cnts1 <- ck_params_cnts(ptab = ptab1) p_cnts2 <- ck_params_cnts(ptab = para2) ``` `ck_params_cnts()` returns objects that can be used as inputs in method `params_cnts_set()`. In argument `v` one may specify count variables for which the supplied perturbation parameters should be used. If `v` is not specified, the perturbation parameters are used for all count variables. ```r # use `p_cnts1` for variable "total" (which always exists) tab$params_cnts_set(val = p_cnts1, v = "total") ``` ``` ## --> setting perturbation parameters for variable 'total' ``` ```r # use `p_cnts2` for "cnt_highincome" tab$params_cnts_set(val = p_cnts2, v = "cnt_highincome") ``` ``` ## --> setting perturbation parameters for variable 'cnt_highincome' ``` It is therefore entirely possible to use different parameter sets for different variables. Modifying perturbation parameters for some variables is easy, too. It is only required to apply the `params_cnts_set()`-method again which will replace any previously defined parameters. #### Perturbation parameters for continuous variables Setting and defining perturbation parameters for continuous variables works similarily. The required functions are `ck_params_num()` to create input objects that can be set with the `params_nums_set` method. Please note that it is possibly by specifying the `path` argument in both `ck_params_nums()` and `ck_params_cnts()` to save the parameters additionally as yaml-file. Using `ck_read_yaml()`, these files can later be imported again. This feature is useful for re-using parameter settings. The underlying framework on how to perturb continuous tables differs from the proposed method from `ABS`. One possible approach is based on a *"flex function"*. This approach (which is described in deliverable `D4.2` in the project "perturbative confidentiality methods") allows to apply different magnitude of noise to larger and smaller cells. Users can define the required parameters for the flex-approach with function `ck_flexparams()`. The required inputs are: - `fp`: the flexpoint defining at which point should the underlying noise coefficient function reach its desired maximum (which is defined by the first element of `p`) - `p`: numeric vector of length `2` with `p[1] > p[2]` where both elements specify a percentage. The first value refers to the desired maximum perturbation percentage for small cells (depending on fp) while the second element refers to the desired maximum perturbation percentage for large cells. - `epsilon`: a numeric vector in descending order with all values in `[0; 1]` and with the first element forced to equal `1`. The length of this parameter must correspond with the number of `top_k` specified in `ck_params_nums()` (which will be discussed later). ```r # parameters for the flex-function p_flex <- ck_flexparams( fp = 1000, p = c(0.3, 0.03), epsilon = c(1, 0.5, 0.2)) ``` In the [**cellKey**](https://github.com/sdcTools/cellKey) package it is possible to select the underlying data that form the base for the perturbation differently. In `ck_params_nums()` the specific approach can be selected in argument `type`. The valid choices for this argument are: - `"top_contr"`: the `k` largest contributions to each cell are used in the perturbation procedure with the number `k` required to be specified in argument `top_k` - `"mean"`: weighted cellmeans are used as starting points - `"range"`: the difference between largest and smallest unweighted contributions for each cell are used as base for the perturbation procedure - `"sum"`: weighted cellvalues are used as starting points for the perturbation Another, more basic approach, is to use a constant perturbation magnitude for all cells, independent on their (weighted) values. The required parameters can be defined with `ck_simpleparams()` as shown below: ```r # parameters for the simple approach p_simple <- ck_simpleparams( p = 0.05, epsilon = 1) ``` In this appraoch it is only required to specify a single percentage value `p` and - as in the case for the flex function - a vector of epsilons that are used in the case when `top_k > 1`. Further important parameters for `ck_params_nums()` are: - `mu_c`: an extra amount of perturbation applied to sensitive cells (restricted to the first of `top_k` noise components). In the following example we demonstrate how to identify sensitive cells for numeric variables. - `same_key`: a logical value specifying if the original cell key (`TRUE`) should be used for the lookup of the largest contributor of a cell or if a perturbation of the cellkey itself (`FALSE`) should take place. - `use_zero_rkeys`: a logical value defining if record keys of units not contributing to a specific numeric variables should be used (`TRUE`) or ignored (`FALSE`) when cell keys are computed. A very important parameter is `ptab` which actually holds the perturbation tables in which perturbation values are looked up. This input can be specified differently in the case when numeric variables should be perturbed. It can be either an object derived from `ptable::pt_create_pTable(..., table = "nums")` in the most simple case. More advanced is to supply a named list, where the allowed names are shown below and each element must be the output of `ptable::pt_create_pTable(..., table = "nums")`. - `"all"`: this ptable will be used for all cells; if specified, list-elements named `"even"`or `"odd"` are ignored - `"even"`: this perturbation table will be used to look up perturbation values for cells with an even number of contributors * `"odd"`: will be used to look up perturbation values for cells with an odd number of contributors * `"small_cells"`: if specified, this ptable will be used to extract perturbation values for very small cells Please note, that if the goal is to have different perturbation tables for cells with an even/odd number of contributors, both `"even"` or `"odd"` must be available in the input list. In the chunk below we create four different perturbation tables. For details on the parameters, please look at the documentation of the [`ptable`](https://github.com/sdcTools/ptable) package, especially in `ptable::create_num_ptable()`. ```r # same ptable for all cells except for very small ones ex_ptab1 <- ptable::pt_ex_nums(parity = TRUE, separation = TRUE) ``` We can now use these tables to finally create objects containing all the required information to create perturbed magnitude tables using `ck_params_nums`. In the first case we want the same perturbation table (`ptab_all`) for cells with an even/odd number of contributors but want to use `ptab_sc` for very small cells. ```r p_nums1 <- ck_params_nums( type = "top_contr", top_k = 3, ptab = ex_ptab1, mult_params = p_flex, mu_c = 2, same_key = FALSE, use_zero_rkeys = TRUE) ``` The second input we generate should use different ptables for cells with an even/odd number of contributing units (`ptab_even` and `ptab_odd`) but should not use a specific perturbation table for very small cells. ```r ex_ptab2 <- ptable::pt_ex_nums(parity = FALSE, separation = FALSE) ``` As above, we need to use `ck_params_nums()` to compute suitable inputs. ```r p_nums2 <- ck_params_nums( type = "mean", ptab = ex_ptab2, mult_params = p_simple, mu_c = 1.5, same_key = FALSE, use_zero_rkeys = TRUE) ``` The package internally computes the separation point that is used for very small cells in case this is required. Details on this can also be found in deliverable `D4.2` from the "perturbative confidentiality methods" project. Now we can attach the results from `ck_params_nums()` to numeric variables using the `params_nums_set()`-method as shown below: ```r tab$params_nums_set(v = "income", val = p_nums1) ``` ``` ## --> setting perturbation parameters for variable 'income' ``` ```r tab$params_nums_set(v = "savings", val = p_nums1) ``` ``` ## --> setting perturbation parameters for variable 'savings' ``` In order to make use of parameter `mu_c` that allows ab add extra amount of protection to sensitive cells, one may identify sensitive cells according to some rules. The following methods to identify sensitive cells are implemented: - `supp_p()`: identify sensitive cells based on p%-rule - `supp_pp()`: identify sensitive cells based on pq%-rule - `supp_nk()`: identify sensitive cells based on nk-dominance rule - `supp_freq()`: identify sensitive cells based on minimal frequencies for (weighted) number of contributors - `supp_val()`: identify sensitive cells based on (weighted) cell values - `supp_cells()`: identify sensitive cells based on their "names" We now want to set all cells for variable `income` as sensitive to which less than `15` units contribute. ```r tab$supp_freq(v = "income", n = 15, weighted = FALSE) ``` ``` ## freq-rule: 3 new sensitive cells (incl. duplicates) found (total: 3) ``` To set specific cells independent on values but their names, one may use the `$supp_cells()`-method. This cell requires a `data.frame` as input that contains a column for each dimensional variable specified. Each row of this input is considered as a cell where `NAs` are used as placeholders that match any characteristic of the relevant variable. Using the `data.frame` `inp` show below, the programm would suppress the following cells: - `female` x `age_group1` - `male` x `age_group3` - `male` x any age group available in the data ```r inp <- data.frame( "sex" = c("female", "male", "male"), "age" = c("age_group1", "age_group3", NA) ) ``` ### Compute perturbed tables It is now possible to finally perturbed variables using the `perturb()`-method. As `tab` - the object created with `ck_setup()` - already contains all possible data, the only required input is the name of a variable that should be perturbed. - `v`: a character vector specifying one or more variables that should be perturbed. ```r tab$perturb(v = "total") ``` ``` ## Count variable 'total' was perturbed. ``` After this call, object `tab` is updated and contains now also perturbed values for variable `total`. We note that no explicit assignment is required. The following code shows that we also can perturb cnt- and numerical variables in one single call. ```r tab$perturb(v = c("cnt_highincome", "savings", "income")) ``` ``` ## Count variable 'cnt_highincome' was perturbed. ``` ``` ## Numeric variable 'savings' was perturbed. ``` ``` ## Numeric variable 'income' was perturbed. ``` A `data.table` containing original and perturbed values can now be extracted using the `freqtab()`- and `numtab()` methods as discussed next. ### Extracting results #### Obtain perturbed tables for count tables Applying the `freqtab()`-method to one ore more already perturbed variables returns a `data.table` that contains for each table cell the unpertubed and perturbed (weighted and/or unweighted) counts. This function has the following arguments: - `v`: one or more variable names of already perturbed count variables - `path`: if not `NULL`, a (relative or absolute) path to which the resulting output table should be written. A `csv` file will be generated and `.csv` will be appended to the value provided. The method returns a `data.table` with all combinations of the dimensional variables in the first $n$ columns and after those the following columns: - `vname`: name of the perturbed variable - `uwc`: unweighted counts - `wc`: weighted counts - `puwc`: perturbed unweighted counts - `pwc`: perturbed weighted counts ```r tab$freqtab(v = c("total", "cnt_highincome")) ``` ``` ## sex age vname uwc wc puwc pwc ## 1: Total Total total 4580 274539 4580 274539.0000 ## 2: Total age_group1 total 1969 118266 1971 118386.1280 ## 3: Total age_group2 total 1143 68240 1141 68120.5949 ## 4: Total age_group3 total 864 52423 865 52483.6748 ## 5: Total age_group4 total 423 25024 424 25083.1584 ## 6: Total age_group5 total 168 9783 167 9724.7679 ## 7: Total age_group6 total 13 803 14 864.7692 ## 8: male Total total 2296 138169 2298 138289.3563 ## 9: male age_group1 total 1015 61280 1014 61219.6256 ## 10: male age_group2 total 571 33840 572 33899.2644 ## 11: male age_group3 total 424 25818 424 25818.0000 ## 12: male age_group4 total 195 11762 193 11641.3641 ## 13: male age_group5 total 84 4969 83 4909.8452 ## 14: male age_group6 total 7 500 6 428.5714 ## 15: female Total total 2284 136370 2285 136429.7067 ## 16: female age_group1 total 954 56986 955 57045.7338 ## 17: female age_group2 total 572 34400 571 34339.8601 ## 18: female age_group3 total 440 26605 440 26605.0000 ## 19: female age_group4 total 228 13262 229 13320.1667 ## 20: female age_group5 total 84 4814 86 4928.6190 ## 21: female age_group6 total 6 303 7 353.5000 ## 22: Total Total cnt_highincome 445 27952 445 27952.0000 ## 23: Total age_group1 cnt_highincome 192 11711 192 11711.0000 ## 24: Total age_group2 cnt_highincome 123 7970 121 7840.4065 ## 25: Total age_group3 cnt_highincome 82 5241 84 5368.8293 ## 26: Total age_group4 cnt_highincome 34 2068 33 2007.1765 ## 27: Total age_group5 cnt_highincome 14 962 15 1030.7143 ## 28: Total age_group6 cnt_highincome 0 0 0 0.0000 ## 29: male Total cnt_highincome 219 13740 219 13740.0000 ## 30: male age_group1 cnt_highincome 90 5554 89 5492.2889 ## 31: male age_group2 cnt_highincome 66 4249 67 4313.3788 ## 32: male age_group3 cnt_highincome 41 2596 38 2406.0488 ## 33: male age_group4 cnt_highincome 15 826 13 715.8667 ## 34: male age_group5 cnt_highincome 7 515 5 367.8571 ## 35: male age_group6 cnt_highincome 0 0 0 0.0000 ## 36: female Total cnt_highincome 226 14212 226 14212.0000 ## 37: female age_group1 cnt_highincome 102 6157 102 6157.0000 ## 38: female age_group2 cnt_highincome 57 3721 57 3721.0000 ## 39: female age_group3 cnt_highincome 41 2645 43 2774.0244 ## 40: female age_group4 cnt_highincome 19 1242 16 1045.8947 ## 41: female age_group5 cnt_highincome 7 447 7 447.0000 ## 42: female age_group6 cnt_highincome 0 0 0 0.0000 ## sex age vname uwc wc puwc pwc ``` #### Obtain perturbed tables for magnitude tables Using the `numtab()`-method allows to extract results for continuous variables. The required inputs are the same as for the `freqtab()`-method and the output returns a `data.table` with all combinations of the dimensional variables in the first $n$ columns and the following additional columns: - `vname`: name of the perturbed variable - `uws`: unweighted sum - `ws`: weighted cellsum - `pws`: perturbed weighted sum of the given cell We now have a look at the results of the variables `savings` and `income` that we already have perturbed. ```r tab$numtab(v = c("savings", "income")) ``` ``` ## sex age vname uws ws pws ## 1: Total Total savings 2273532 136080418 136083334.6 ## 2: Total age_group1 savings 982386 59144910 59145634.6 ## 3: Total age_group2 savings 552336 33029243 33020004.3 ## 4: Total age_group3 savings 437101 26414476 26414183.8 ## 5: Total age_group4 savings 214661 12446392 12450800.0 ## 6: Total age_group5 savings 80451 4652696 4649450.9 ## 7: Total age_group6 savings 6597 392701 387830.2 ## 8: male Total savings 1159816 69597198 69597973.2 ## 9: male age_group1 savings 517660 31321038 31319611.9 ## 10: male age_group2 savings 280923 16698089 16701276.7 ## 11: male age_group3 savings 214970 13022302 13020960.2 ## 12: male age_group4 savings 99420 5898725 5897903.7 ## 13: male age_group5 savings 43233 2424795 2422438.3 ## 14: male age_group6 savings 3610 232249 235965.8 ## 15: female Total savings 1113716 66483220 66480940.2 ## 16: female age_group1 savings 464726 27823872 27824463.1 ## 17: female age_group2 savings 271413 16331154 16329804.8 ## 18: female age_group3 savings 222131 13392174 13384080.8 ## 19: female age_group4 savings 115241 6547667 6543530.2 ## 20: female age_group5 savings 37218 2227901 2228793.7 ## 21: female age_group6 savings 2987 160452 159026.2 ## 22: Total Total income 22952978 1383502675 1383529374.5 ## 23: Total age_group1 income 9810547 592603578 592606204.7 ## 24: Total age_group2 income 5692119 342438723 342358132.6 ## 25: Total age_group3 income 4406946 266030114 266027152.7 ## 26: Total age_group4 income 2133543 128527130 128580753.7 ## 27: Total age_group5 income 848151 50260125 50213557.5 ## 28: Total age_group6 income 61672 3643005 3523883.0 ## 29: male Total income 11262049 678098383 678119173.7 ## 30: male age_group1 income 4877164 294298901 294309048.5 ## 31: male age_group2 income 2811379 167003473 167018345.8 ## 32: male age_group3 income 2168169 130191360 130158511.4 ## 33: male age_group4 income 978510 60622639 60618171.8 ## 34: male age_group5 income 393134 23699275 23669073.3 ## 35: male age_group6 income 33693 2282735 2389104.5 ## 36: female Total income 11690929 705404292 705401451.7 ## 37: female age_group1 income 4933383 298304677 298301046.4 ## 38: female age_group2 income 2880740 175435250 175426717.7 ## 39: female age_group3 income 2238777 135838754 135753178.2 ## 40: female age_group4 income 1155033 67904491 67849918.6 ## 41: female age_group5 income 455017 26560850 26560686.8 ## 42: female age_group6 income 27979 1360270 1308671.0 ## sex age vname uws ws pws ``` ### Utility-measures #### Utility measures for count variables Method `measures_cnts()` allows to compute information loss measures for perturbed count variables. Its application is as simple as: ```r tab$measures_cnts(v = "total", exclude_zeros = TRUE) ``` ``` ## $overview ## noise cnt pct ## 1: -2 3 0.1428571 ## 2: -1 8 0.3809524 ## 3: 0 3 0.1428571 ## 4: 1 5 0.2380952 ## 5: 2 2 0.0952381 ## ## $measures ## what d1 d2 d3 ## 1: Min 0.000 0.000 0.000 ## 2: Q10 0.000 0.000 0.000 ## 3: Q20 1.000 0.001 0.016 ## 4: Q30 1.000 0.001 0.017 ## 5: Q40 1.000 0.001 0.021 ## 6: Mean 1.095 0.022 0.049 ## 7: Median 1.000 0.002 0.023 ## 8: Q60 1.000 0.002 0.030 ## 9: Q70 1.000 0.006 0.039 ## 10: Q80 2.000 0.012 0.072 ## 11: Q90 2.000 0.077 0.136 ## 12: Q95 2.000 0.143 0.196 ## 13: Q99 2.000 0.162 0.196 ## 14: Max 2.000 0.167 0.196 ## ## $cumdistr_d1 ## cat cnt pct ## 1: 0 3 0.1428571 ## 2: 1 16 0.7619048 ## 3: 2 21 1.0000000 ## ## $cumdistr_d2 ## cat cnt pct ## 1: [0,0.02] 17 0.8095238 ## 2: (0.02,0.05] 18 0.8571429 ## 3: (0.05,0.1] 19 0.9047619 ## 4: (0.1,0.2] 21 1.0000000 ## 5: (0.2,0.3] 21 1.0000000 ## 6: (0.3,0.4] 21 1.0000000 ## 7: (0.4,0.5] 21 1.0000000 ## 8: (0.5,Inf] 21 1.0000000 ## ## $cumdistr_d3 ## cat cnt pct ## 1: [0,0.02] 7 0.3333333 ## 2: (0.02,0.05] 15 0.7142857 ## 3: (0.05,0.1] 17 0.8095238 ## 4: (0.1,0.2] 21 1.0000000 ## 5: (0.2,0.3] 21 1.0000000 ## 6: (0.3,0.4] 21 1.0000000 ## 7: (0.4,0.5] 21 1.0000000 ## 8: (0.5,Inf] 21 1.0000000 ## ## $false_zero ## [1] 0 ## ## $false_nonzero ## [1] 0 ## ## $exclude_zeros ## [1] TRUE ``` This function returns a list with several utility measures that are now discussed. For further information have a look at `?ck_cnt_measures` as the same set of measures can also be computed for two vectors containing original and perturbed values. - `overview`: a `data.table` with the following three columns: * noise: amount of noise computed as orig - pert * cnt: number of cells perturbed with the value given in column noise * pct: percentage of cells perturbed with the value given in column noise - `measures`: a `data.table` containing measures of the distribution of three different distances between original and perturbed values of the unweighted counts. Column `what` specifies the computed measure. The three distances considered are: * `d1`: absolute distance between original and masked values * `d2`: relative absolute distance between original and masked values * `d3`: absolute distance between square-roots of original and perturbed values - `cumdistr_d1`, `cumdistr_d2` and `cumdistr_d3`: for each distance d1, d2 and d3, a data.table with the following three columns: * `cat`: a specific value (for d1) or interval (for distances d2 and d3) * `cnt`: number of records smaller or equal the value in column cat for the given distance * `pct`: proportion of records smaller or equal the value in column cat for the selected distance - `false_zero`: number of cells that were perturbed to zero - `false_nonzero`: number of cells that were initially zero but have been perturbed to a number different from zero If argument `exclude_zeros` is `TRUE` (the default setting), empty cells are excluded when computing distance-based measures `d1`, `d2` and `d3`. #### Utility measures for continuous variables For now, no utility measures for continuous variables are available. This will change in a future version. ### Additional Features Finally we note that there are `print` and `summary` methods implemented for objects created with from `ck_setup()` which can be used as shown below: The `print()`-method shows the dimension of the table as well as the variables that are (or possibly can be) perturbed. It is also displayed if the table was constructed using weights. ```r tab$print() # same as (print(tab)) ``` ``` ## ── Table Information ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ✔ 21 cells in 2 dimensions ('sex', 'age') ## ✔ weights: yes ## ── Tabulated / Perturbed countvars ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ☒ 'total' (perturbed) ## ☒ 'cnt_highincome' (perturbed) ## ── Tabulated / Perturbed numvars ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ☒ 'income' (perturbed) ## ☒ 'savings' (perturbed) ``` The `summary`-method shows some utility statistics for already perturbed variables as shown below: ```r tab$summary() ``` ``` ## ┌──────────────────────────────────────────────┐ ## │Utility measures for perturbed count variables│ ## └──────────────────────────────────────────────┘ ## ── Distribution statistics of perturbations ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## countvar Min Q10 Q20 Q30 Q40 Mean Median Q60 Q70 Q80 Q90 Q95 Q99 Max ## 1: total -2 -1 -1 -1 0 0.238 1 1 1 1 2 2 2 2 ## 2: cnt_highincome -3 -2 -2 -1 0 -0.381 0 0 0 0 1 2 2 2 ## ## ── Distance-based measures ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## ✔ Variable: 'total' ## ## what d1 d2 d3 ## 1: Min 0.000 0.000 0.000 ## 2: Q10 0.000 0.000 0.000 ## 3: Q20 1.000 0.001 0.016 ## 4: Q30 1.000 0.001 0.017 ## 5: Q40 1.000 0.001 0.021 ## 6: Mean 1.095 0.022 0.049 ## 7: Median 1.000 0.002 0.023 ## 8: Q60 1.000 0.002 0.030 ## 9: Q70 1.000 0.006 0.039 ## 10: Q80 2.000 0.012 0.072 ## 11: Q90 2.000 0.077 0.136 ## 12: Q95 2.000 0.143 0.196 ## 13: Q99 2.000 0.162 0.196 ## 14: Max 2.000 0.167 0.196 ## ## ✔ Variable: 'cnt_highincome' ## ## what d1 d2 d3 ## 1: Min 0.000 0.000 0.000 ## 2: Q10 0.000 0.000 0.000 ## 3: Q20 0.000 0.000 0.000 ## 4: Q30 0.000 0.000 0.000 ## 5: Q40 0.800 0.009 0.042 ## 6: Mean 1.111 0.048 0.109 ## 7: Median 1.000 0.016 0.074 ## 8: Q60 1.200 0.025 0.094 ## 9: Q70 2.000 0.047 0.129 ## 10: Q80 2.000 0.072 0.205 ## 11: Q90 2.300 0.141 0.295 ## 12: Q95 3.000 0.177 0.367 ## 13: Q99 3.000 0.264 0.401 ## 14: Max 3.000 0.286 0.410 ## ## ┌──────────────────────────────────────────────────┐ ## │Utility measures for perturbed numerical variables│ ## └──────────────────────────────────────────────────┘ ## ── Distribution statistics of perturbations ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## vname Min Q10 Q20 Q30 Q40 Mean Median Q60 Q70 Q80 Q90 Q95 Q99 Max ## 1: income -119121.989 -80590.410 -51599.027 -32848.599 -8532.253 -13740.085 -3630.625 -2840.283 2626.699 14872.81 26699.529 53623.684 95820.303 106369.457 ## 2: savings -9238.676 -4870.839 -3245.138 -2279.765 -1425.807 -1126.899 -1341.809 -292.242 724.624 892.72 3187.721 3716.843 4269.743 4407.969 ``` ## Summary The package is *"work in progress"* and therefore, suggestions and/or bugreports are welcome. Please feel free to file an issue at [**our issue tracker**](https://github.com/sdcTools/userSupport/issues) or contribute to the package by filing a [**pull request**](https://github.com/sdcTools/cellKey/pulls) against the master branch.