| Title: | Missing Data Explorer | 
| Version: | 0.3.2 | 
| Description: | Correct identification and handling of missing data is one of the most important steps in any analysis. To aid this process, 'mde' provides a very easy to use yet robust framework to quickly get an idea of where the missing data lies and therefore find the most appropriate action to take. Graham WJ (2009) <doi:10.1146/annurev.psych.58.110405.085530>. | 
| License: | GPL-3 | 
| Depends: | R(≥ 3.6.0) | 
| Imports: | dplyr(≥ 1.0.0), tidyr(≥ 1.0.3) | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.1.2 | 
| URL: | https://github.com/Nelson-Gon/mde | 
| BugReports: | https://github.com/Nelson-Gon/mde/issues | 
| Suggests: | knitr, rmarkdown, markdown, testthat | 
| VignetteBuilder: | knitr | 
| Config/testthat/edition: | 3 | 
| NeedsCompilation: | no | 
| Packaged: | 2022-02-09 20:24:36 UTC; Nelg | 
| Author: | Nelson Gonzabato [aut, cre] | 
| Maintainer: | Nelson Gonzabato <gonzabato@hotmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2022-02-10 12:10:06 UTC | 
Checks that all values are NA
Description
This is a helper function to check if all column/vector values are NA
Usage
all_na(x)
Arguments
| x | A vector or data.frame column | 
Value
Boolean TRUE or FALSE depending on the nature of the column/vector
Examples
test <- data.frame(A=c(NA, 2), B= c(NA, NA))
all_na(test)
test_vec <- c("NA",NA,"nope")
test_numeric <- c(NA, 2)
all_na(test_vec)
all_na(test_numeric)
Conditionally Recode NA values based on other Columns
Description
Recode NA as based on Other Columns
Usage
column_based_recode(
  df,
  criterion = "all_na",
  values_from = NULL,
  values_to = NULL,
  value = 0,
  pattern_type = "contains",
  pattern = "Solar",
  case_sensitive = FALSE
)
Arguments
| df | A data.frame object for which recoding is to be done. | 
| criterion | Currently supports one of all_na or any_na to index rows that are either all NA or contain any NA. | 
| values_from | Character. Name of column to get the original values from | 
| values_to | Character New column name for the newly recoded values. Defaults to the same name if none is supplied. | 
| value | The value to convert to 'NA'. We can for instance change "n/a" to 'NA' or any other value. | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| pattern | A character pattern to match | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
Value
A 'data.frame' object with target 'NA' values replaced.
Examples
df <- structure(list(id = 40:43, v1 = c(NA, 1L, 1L, 1L), v2 = c(NA, 1L, 1L, 1L),
v3 = c(NA, 2L, NA, 1L),
test = c(1L, 2L, 1L, 3L)), class = "data.frame", row.names = c(NA, -4L))
# recode test as 0 if all NA, return test otherwise
column_based_recode(df,values_from = "test", pattern_type = "starts_with", pattern="v")
Recode NA as another value using a function or a custom equation
Description
Recode NA as another value using a function or a custom equation
Usage
custom_na_recode(
  df,
  func = "mean",
  grouping_cols = NULL,
  across_columns = NULL
)
Arguments
| df | A valid R 'object' for which the percentage of missing values is required. | 
| func | Function to use for the replacement e.g "mean". Defaults to mean. | 
| grouping_cols | A character vector. If supplied, one can provide the columns by which to group the data. | 
| across_columns | A character vector specifying across which columns recoding should be done #use all columns head(custom_na_recode(airquality,func="mean")) # use only a few columns head(custom_na_recode(airquality,func="mean",across_columns = c("Solar.R","Ozone"))) # use a function from another package #head(custom_na_recode(airquality, func=dplyr::lead)) some_data <- data.frame(ID=c("A1","A1","A1","A2","A2", "A2"), A=c(5,NA,0,8,3,4), B=c(10,0,0,NA,5,6),C=c(1,NA,NA,25,7,8)) # grouping head(custom_na_recode(some_data,func = "mean", grouping_cols = "ID", across_columns = c("C", "A"))) head(custom_na_recode(some_data,func = "mean", grouping_cols = "ID")) | 
Recode Missing Values Dictionary-Style
Description
Recode Missing Values Dictionary-Style
Usage
dict_recode(
  df,
  use_func = "recode_na_as",
  pattern_type = "starts_with",
  patterns,
  values
)
Arguments
| df | A data.frame object for which recoding is to be done. | 
| use_func | Function to use for the recoding. One of the various 'recode_*' functions in package 'mde'. | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| patterns | A vector containing patterns to use for pattern_type | 
| values | A vector containing values to match to the patterns vector | 
Value
A 'data.frame' object with replacements as required.
Examples
head(dict_recode(airquality, pattern_type="starts_with",
patterns = c("Solar", "Ozone"), values = c(190, 41),
use_func="recode_as_na"))
 head(dict_recode(airquality, pattern_type="starts_with",
patterns = c("Solar", "Ozone"), values = c(42, 420),
use_func="recode_na_as"))
Drop columns for which all values are NA
Description
Drop columns for which all values are NA
Usage
drop_all_na(df, grouping_cols = NULL)
Arguments
| df | A valid R 'object' for which the percentage of missing values is required. | 
| grouping_cols | A character vector. If supplied, one can provide the columns by which to group the data. | 
Examples
test <- data.frame(ID= c("A","A","B","A","B"), Vals = c(rep(NA,4),2))
test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = rep(NA, 5))
# drop columns where all values are NA
drop_all_na(test2)
# drop NAs only if all are NA for a given group, drops group too.
drop_all_na(test, "ID")
Drop missing values at columns that match a given pattern
Description
Provides a simple yet efficient way to drop missing values("NA"s) at columns that match a given pattern.
Usage
drop_na_at(
  df,
  pattern_type = "contains",
  pattern = NULL,
  case_sensitive = FALSE,
  ...
)
Arguments
| df | A data.frame object | 
| pattern_type | One of "contains", "ends_with" or "starts_with" | 
| pattern | The type of pattern to use when matching the pattern_type. The pattern is case sensitive | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
| ... | Other params to other methods | 
Value
A data.frame object containing only columns that match the given pattern with the missing values removed.
Examples
head(drop_na_at(airquality,pattern_type = "starts_with","O"))
Condition based dropping of columns with missing values
Description
"drop_na_if" provides a simple way to drop columns with missing values if they meet certain criteria/conditions.
Usage
drop_na_if(
  df,
  sign = "gteq",
  percent_na = 50,
  keep_columns = NULL,
  grouping_cols = NULL,
  target_columns = NULL,
  ...
)
Arguments
| df | A data.frame object | 
| sign | Character. One of gteq,lteq,lt,gt or eq which refer to greater than(gt) or equal(eq) or less than(lt) or equal to(eq) respectively. | 
| percent_na | The percentage to use when dropping columns with missing values | 
| keep_columns | Columns that should be kept despite meeting the target percent_na criterion(criteria) | 
| grouping_cols | For dropping groups that meet a target criterion of percent missingness. | 
| target_columns | If working on grouped data, drop all columns that meet target or only a specific column. | 
| ... | Other arguments to "percent_missing" | 
Value
A data.frame object with columns that meet the target criteria dropped.
See Also
Examples
head(drop_na_if(airquality, percent_na = 24))
#drop columns that have less tan or equal to 4%
head(drop_na_if(airquality,sign="lteq", percent_na = 4))
# Drop all except with greater than ie equal to 4% missing but keep Ozone
head(drop_na_if(airquality, sign="gteq",percent_na = 4, 
keep_columns = "Ozone"))
# Drop groups that meet a given criterion
grouped_drop <- structure(list(ID = c("A", "A", "B", "A", "B"), Vals = c(4, NA, 
NA, NA, NA), Values = c(5, 6, 7, 8, NA)), row.names = c(NA, -5L),
 class = "data.frame")
 drop_na_if(grouped_drop,percent_na = 67,grouping_cols = "ID")
Conditionally drop rows based on percent missingness
Description
Conditionally drop rows based on percent missingness
Usage
drop_row_if(df, sign = "gt", type = "count", value = 20, as_percent = TRUE)
Arguments
| df | A data.frame object | 
| sign | Character. One of gteq,lteq,lt,gt or eq which refer to greater than(gt) or equal(eq) or less than(lt) or equal to(eq) respectively. | 
| type | One of either count or percent. Defaults to count | 
| value | Value to use for the drop. | 
| as_percent | Logical. If set to TRUE, percent_na is treated as a percentage. Otherwise, decimals(fractions) are used. | 
Examples
head(drop_row_if(airquality,sign = "gteq",
type = "percent",value=16, as_percent = TRUE))
# should give the same output as above.
head(drop_row_if(airquality, sign="gteq", type="percent",value = 0.15, as_percent=FALSE))
# Drop based on NA counts
df <- data.frame(A=1:5, B=c(1,NA,NA,2, 3), C= c(1,NA,NA,2,3))
drop_row_if(df, type="count",value=2,sign="eq")
Add columnwise/groupwise counts of missing values
Description
This function takes a 'data.frame' object as an input and returns the corresponding ‘NA' counts. 'NA' refers to R’s builtin missing data holder.
Usage
get_na_counts(x, grouping_cols = NULL, exclude_cols = NULL)
Arguments
| x | A valid R 'object' for which 'na_counts' are needed. | 
| grouping_cols | A character vector. If supplied, one can provide the columns by which to group the data. | 
| exclude_cols | Columns to exclude from the analysis. | 
Value
An object of the same type as 'x' showing the respective number of missing values. If grouped is set to 'TRUE', the results are returned by group.
Examples
get_na_counts(airquality)
# Grouped counts
test <- data.frame(Subject = c("A","A","B","B"), res = c(NA,1,2,3),
ID = c("1","1","2","2"))
get_na_counts(test,grouping_cols = c("ID", "Subject"))
Get mean missingness.
Description
Get mean missingness.
Usage
get_na_means(x, as_percent = TRUE)
Arguments
| x | A vector whose mean NA is required. | 
| as_percent | Boolean? Report means as percents, defaults to TRUE. | 
Examples
get_na_means(airquality)
Get NA counts for a given character, numeric, factor, etc.
Description
Get NA counts for a given character, numeric, factor, etc.
Usage
na_counts(x)
Arguments
| x | A vector whose number of missing values is to be determined. | 
Examples
na_counts(airquality$Ozone) 
An all-in-one missingness report
Description
An all-in-one missingness report
Usage
na_summary(
  df,
  grouping_cols = NULL,
  sort_by = NULL,
  descending = FALSE,
  exclude_cols = NULL,
  pattern = NULL,
  pattern_type = NULL,
  regex_kind = "exclusion",
  round_to = NULL,
  reset_rownames = FALSE
)
Arguments
| df | A valid R 'object' for which the percentage of missing values is required. | 
| grouping_cols | A character vector. If supplied, one can provide the columns by which to group the data. | 
| sort_by | One of counts or percents. This determines whether the results are sorted by counts or percentages. | 
| descending | Logical. Should missing values be sorted in decreasing order ie largest to smallest? Defaults to FALSE. | 
| exclude_cols | A character vector indicating columns to exclude when returning results. | 
| pattern | Pattern to use for exclusion or inclusion. column inclusion criteria. | 
| pattern_type | A regular expression type. One of "starts_with", "contains", or "regex". Defaults to NULL. Only use for selective inclusion. | 
| regex_kind | One of inclusion or exclusion. Defaults to exclusion to exclude columns using regular expressions. | 
| round_to | Number of places to round 2. Defaults to user digits option. | 
| reset_rownames | Should the rownames be reset in the output? defaults to FALSE | 
Examples
na_summary(airquality)
# grouping
test2 <- data.frame(ID= c("A","A","B","A","B"),Vals = c(rep(NA,4),"No"),
ID2 = c("E","E","D","E","D"))
df <- data.frame(A=1:5,B=c(NA,NA,25,24,53), C=c(NA,1,2,3,4))
na_summary(test2,grouping_cols = c("ID","ID2"))
# sort summary
na_summary(airquality,sort_by = "percent_missing",descending = TRUE)
na_summary(airquality,sort_by = "percent_complete")
# Include only via a regular expression
na_summary(mtcars, pattern_type = "contains",
pattern = "mpg|disp|wt", regex_kind = "inclusion")
na_summary(airquality, pattern_type = "starts_with",
pattern = "ozone", regex_kind = "inclusion")
# exclusion via a regex
na_summary(airquality, pattern_type = "starts_with",
pattern = "oz|Sol", regex_kind = "exclusion")
# reset rownames when sorting by variable
na_summary(df,sort_by="variable",descending=TRUE, reset_rownames = TRUE)
Column-wise missingness percentages
Description
A convenient way to obtain percent missingness column-wise.
Usage
percent_missing(df, grouping_cols = NULL, exclude_cols = NULL)
Arguments
| df | A valid R 'object' for which the percentage of missing values is required. | 
| grouping_cols | A character vector. If supplied, one can provide the columns by which to group the data. | 
| exclude_cols | A character vector indicating columns to exclude when returning results. | 
Value
An object of the same class as x showing the percentage of missing values.
Examples
test <- data.frame(ID= c("A","B","A","B","A","B","A"), 
Vals = c(NA,25,34,NA,67,NA,45))
percent_missing(test,grouping_cols = "ID")
percent_missing(airquality)
percent_missing(airquality,exclude_cols = c("Day","Temp"))
percent missing but for vectors.
Description
percent missing but for vectors.
Usage
percent_na(x)
Arguments
| x | A vector whose mean NA is required. | 
Examples
percent_na(airquality$Ozone)
Recode a value as NA
Description
This provides a convenient way to convert a number/value that should indeed be an "NA" to "NA". In otherwords, it converts a value to R's recognized NA.
Usage
recode_as_na(
  df,
  value = NULL,
  subset_cols = NULL,
  pattern_type = NULL,
  pattern = NULL,
  case_sensitive = FALSE,
  ...
)
Arguments
| df | A data.frame object for which recoding is to be done. | 
| value | The value to convert to 'NA'. We can for instance change "n/a" to 'NA' or any other value. | 
| subset_cols | An optional character vector to define columns for which changes are required. | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| pattern | A character pattern to match | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
| ... | Other arguments to other functions | 
Value
An object of the same class as x with values changed to 'NA'.
Examples
head(recode_as_na(airquality,value=c(67,118),pattern_type="starts_with",pattern="S|O"))
head(recode_as_na(airquality,value=c(41),pattern_type="ends_with",pattern="e"))
head(recode_as_na(airquality, value=41,subset_cols="Ozone"))
Recode Values as NA if they meet defined criteria
Description
Recode Values as NA if they meet defined criteria
Usage
recode_as_na_for(df, criteria = "gt", value = 0, subset_cols = NULL)
Arguments
| df | A data.frame object to manipulate | 
| criteria | One of gt,gteq,lt,lteq to define greater than, greater than or equal to, less than or less than or equal to. | 
| value | The value to convert to 'NA'. We can for instance change "n/a" to 'NA' or any other value. | 
| subset_cols | An optional character vector for columns to manipulate. | 
Value
A data.frame object with the required changes.
Examples
recode_as_na_for(airquality,value=36, criteria = "gteq",
subset_cols = c("Ozone","Solar.R"))
Conditionally change all column values to NA
Description
Conditionally change all column values to NA
Usage
recode_as_na_if(df, sign = "gteq", percent_na = 50, keep_columns = NULL, ...)
Arguments
| df | A data.frame object | 
| sign | Character. One of gteq,lteq,lt,gt or eq which refer to greater than(gt) or equal(eq) or less than(lt) or equal to(eq) respectively. | 
| percent_na | The percentage to use when dropping columns with missing values | 
| keep_columns | Columns that should be kept despite meeting the target percent_na criterion(criteria) | 
| ... | Other arguments to "percent_missing" | 
Value
A 'data.frame' with the target columns populated with 'NA's.
Examples
head(recode_as_na_if(airquality, sign="gt", percent_na=20))
Recode as NA based on string match
Description
Recode as NA based on string match
Usage
recode_as_na_str(
  df,
  pattern_type = "ends_with",
  pattern = NULL,
  case_sensitive = FALSE,
  ...
)
Arguments
| df | A data.frame object | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| pattern | A character pattern to match | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
| ... | Other arguments to grepl | 
See Also
Examples
partial_match <- data.frame(A=c("Hi","match_me","nope"), B=c(NA, "not_me","nah"))
# Replace all that end with "me" with NA
recode_as_na_str(partial_match,"ends_with","me")
# Do not recode, ie case-sensitive
recode_as_na_str(partial_match,"ends_with","ME", case_sensitive=TRUE)
Recode a value as another value
Description
This provides a convenient way to convert a number/value to another value.
Usage
recode_as_value(
  df,
  value = NULL,
  replacement_value = NULL,
  subset_cols = NULL,
  pattern_type = NULL,
  pattern = NULL,
  case_sensitive = FALSE,
  ...
)
Arguments
| df | A data.frame object for which recoding is to be done. | 
| value | The value/vector of values to convert. | 
| replacement_value | New value. | 
| subset_cols | An optional character vector to define columns for which changes are required. | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| pattern | A character pattern to match | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
| ... | Other arguments to other functions | 
Value
An object of the same class as x with values changed to 'NA'.
Examples
head(recode_as_value(airquality,
value=c(67,118),replacement=NA, pattern_type="starts_with",pattern="S|O"))
Helper functions in package mde
Description
Helper functions in package mde
Usage
recode_helper(
  x,
  pattern_type = NULL,
  pattern = NULL,
  original_value,
  new_value,
  case_sensitive = FALSE,
  ...
)
Arguments
| x | A data.frame object | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| pattern | A character pattern to match | 
| original_value | Value to replace | 
| new_value | Replacement value. | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
| ... | Other arguments to other functions | 
Replace missing values with another value
Description
This provides a convenient way to recode "NA" as another value for instance "NaN", "n/a" or any other value a user wishes to use.
Usage
recode_na_as(
  df,
  value = 0,
  subset_cols = NULL,
  pattern_type = NULL,
  pattern = NULL,
  case_sensitive = FALSE,
  ...
)
Arguments
| df | A data.frame object for which recoding is to be done. | 
| value | The value to convert to 'NA'. We can for instance change "n/a" to 'NA' or any other value. | 
| subset_cols | An optional character vector to define columns for which changes are required. | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| pattern | A character pattern to match | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
| ... | Other arguments to other functions | 
Value
An object of the same type as x with NAs replaced with the desired value.
Examples
head(recode_na_as(airquality, "n/a"))
head(recode_na_as(airquality, subset_cols = "Ozone", value = "N/A"))
head(recode_na_as(airquality, value=0, pattern_type="starts_with",
pattern="Solar"))
Recode NA as another value with some conditions
Description
Recode NA as another value with some conditions
Usage
recode_na_if(df, grouping_cols = NULL, target_groups = NULL, replacement = 0)
Arguments
| df | A data.frame object with missing values | 
| grouping_cols | Character columns to use for grouping the data | 
| target_groups | Character Recode NA as if and only if the grouping column is in this vector of values | 
| replacement | Values to use to replace NAs for IDs that meet the requirements. Defaults to 0. | 
Examples
some_data <- data.frame(ID=c("A1","A2","A3", "A4"),
A=c(5,NA,0,8), B=c(10,0,0,1),C=c(1,NA,NA,25))
# Replace NAs with 0s only for IDs in A2 and A3
recode_na_if(some_data,"ID",c("A2","A3"),replacement=0)
Helper functions in package mde
Description
Helper functions in package mde
Usage
recode_selectors(
  x,
  column_check = TRUE,
  pattern_type = NULL,
  pattern = NULL,
  case_sensitive = FALSE,
  ...
)
Arguments
| x | data.frame object | 
| column_check | If TRUE, pattern search is performed columnwise. Defaults to FALSE. | 
| pattern_type | One of contains', 'starts_with' or 'ends_with'. | 
| pattern | A character pattern to match | 
| case_sensitive | Defaults to FALSE. Patterns are case insensitive if TRUE | 
| ... | Other arguments to other functions | 
Sort Variables according to missingness
Description
Provides a useful way to sort the variables(columns) according to their missingness.
Usage
sort_by_missingness(df, sort_by = "counts", descending = FALSE, ...)
Arguments
| df | A data.frame object | 
| sort_by | One of counts or percents. This determines whether the results are sorted by counts or percentages. | 
| descending | Logical. Should missing values be sorted in decreasing order ie largest to smallest? Defaults to FALSE. | 
| ... | Other arguments to specific functions. See "See also below" | 
Value
A 'data.frame' object sorted by number/percentage of missing values
See Also
Examples
sort_by_missingness(airquality, sort_by = "counts")
# sort by percents
sort_by_missingness(airquality, sort_by="percents")
# descending order
sort_by_missingness(airquality, descend = TRUE)