Type: | Package |
Title: | To Inspect and Manipulate Data; and to Keep Track of This Process |
Version: | 0.3.0 |
Author: | Sherry Zhao |
Maintainer: | Sherry Zhao <sxzhao@gwu.edu> |
Description: | Functions to work with data frames to prepare data for further analysis. The functions for imputation, encoding, partitioning, and other manipulation can produce log files to keep track of process. |
BugReports: | https://github.com/sherrisherry/cleandata/issues |
URL: | https://github.com/sherrisherry/cleandata |
Depends: | R (≥ 3.0.0) |
Imports: | stats |
Suggests: | R.rsp |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
VignetteBuilder: | R.rsp |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2018-12-01 01:24:00 UTC; Admin |
Repository: | CRAN |
Date/Publication: | 2018-12-01 05:10:02 UTC |
List of Encoders
Description
The return value of inspect_map
can be used to create inputs for the following fuctions. Refer to vignettes for examples.
encode_ordinal
: Encode Ordinal Data Into Sequential Integers
encode_binary
: Encode Binary Data Into 0 and 1
encode_onehot
: Encode categorical data by One-hot encoding
Encode Binary Data Into 0 and 1
Description
Encodes binary data into 0 and 1. Optionally records the result into a log file.
Usage
encode_binary(x, out.int=FALSE, full_print=TRUE, log = eval.parent(in_log_default))
Arguments
x |
The data frame |
out.int |
Whether to convert encoded |
full_print |
When set to |
log |
Controls log files. To produce log files, assign it or the |
Value
An encoded data frame.
Warning
x
can only be a data frame. Don't pass a vector to it.
See Also
Examples
# refer to vignettes if you want to use log files
message('refer to vignettes if you want to use log files')
# building a data frame
A <- as.factor(c('x', 'y', 'x'))
B <- as.factor(c('y', 'x', 'y'))
C <- as.factor(c('i', 'j', 'i'))
df <- data.frame(A, B, C)
# encoding
df <- encode_binary(df)
print(df)
One-Hot Encoding
Description
Encodes categorical data by One-hot encoding. Optionally records the result into a log file.
Usage
encode_onehot(x, colname.sep = '_', drop1st = FALSE,
full_print=TRUE, log = eval.parent(in_log_default))
Arguments
x |
The data frame |
colname.sep |
A character or string that acts as an divider in the names of the columns of encoding results. |
drop1st |
Whether drop the 1st level of every encoded column. The 1st level refers to the level that corresponds to 1 in a factor. |
full_print |
When set to |
log |
Controls log files. To produce log files, assign it or the |
Value
An encoded data frame.
Warning
x
can only be a data frame. Don't pass a vector to it.
See Also
Examples
# refer to vignettes if you want to use log files
message('refer to vignettes if you want to use log files')
# building a data frame
A <- as.factor(c('x', 'y', 'x'))
B <- as.factor(c('i', 'j', 'k'))
df <- data.frame(A, B)
# encoding
df0 <- encode_onehot(df)
df0 <- cbind(df, df0)
print(df0)
df0 <- encode_onehot(df, colname.sep = '-', drop1st = TRUE)
df0 <- cbind(df, df0)
rm(df)
print(df0)
Encode Ordinal Data Into Integers
Description
Encodes ordinal data into sequential integers by a given order. Optionally records the result into a log file.
Usage
encode_ordinal(x, order, none='', out.int=FALSE,
full_print=TRUE, log = eval.parent(in_log_default))
Arguments
x |
The data frame |
order |
a vector of the ordered labels from low to high. |
none |
The 'none'-but-not-'NA' level, which is always encoded to 0. |
out.int |
Whether to convert encoded |
full_print |
When set to |
log |
Controls log files. To produce log files, assign it or the |
Value
An encoded data frame.
Warning
x
can only be a data frame. Don't pass a vector to it.
See Also
Examples
# refer to vignettes if you want to use log files
message('refer to vignettes if you want to use log files')
# building a data frame
A <- as.factor(c('y', 'z', 'x', 'y', 'z'))
B <- as.factor(c('y', 'x', 'z', 'z', 'x'))
C <- as.factor(c('k', 'i', 'i', 'j', 'k'))
df <- data.frame(A, B, C)
# encoding
df[, 1:2] <- encode_ordinal(df[,1:2], order = c('z', 'x', 'y'))
df[, 3] <- encode_ordinal(df[, 3, drop = FALSE], order = c('k', 'j', 'i'))
print(df)
Impute Missing Values
Description
impute_mode
: Impute NA
s by the modes of their corresponding columns.
impute_median
: Impute NA
s by the medians of their corresponding columns.
impute_mean
: Impute NA
s by the means of their corresponding columns.
Usage
impute_mode(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default))
impute_median(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default))
impute_mean(x,cols=colnames(x),idx=row.names(x),log = eval.parent(in_log_default))
Arguments
x |
The data frame to be imputed. |
cols |
The index of columns of |
idx |
The index of rows of |
log |
Controls log files. To produce log files, assign it or the |
Value
An imputed data frame.
See Also
Examples
# refer to vignettes if you want to use log files
message('refer to vignettes if you want to use log files')
# building a data frame
A <- as.factor(c('y', 'x', 'x', 'y', 'z'))
B <- c(6, 3:6)
C <- 1:5
df <- data.frame(A, B, C)
df[3, 1] <- NA; df[2, 2] <- NA; df [5, 3] <- NA
print(df)
# imputation
df0 <- impute_mode(df, cols = 1:3)
print(df0)
df0 <- impute_mode(df, cols = 1:3, idx = 1:3)
print(df0)
df0 <- impute_median(df, cols = 2:3)
print(df0)
df0 <- impute_mean(df, cols = 2:3)
print(df0)
Classify The Columns of A Data Frame
Description
Provide a map for imputation and encoding.
Usage
inspect_map(x, common = 0, message = TRUE)
Arguments
x |
The data frame |
common |
a non-negative numerical parameter, if 2 factorial columns share more than 'common' levels, they share the same scheme. 0 means all the levels should be the same for 2 factorial columns to share the same scheme. |
message |
Whether print the process. |
Value
A list of factor_cols
(list), factor_levels
(list), num_cols
(vector), char_cols
(vector), ordered_cols
(vector), and other_cols
(vector).
factor_cols |
a list, in which each member is a vector of the names of the factorial columns that share the same scheme. The name of a vector is the same as its 1st member. Refer to the argument |
factor_levels |
a list, in which each member is a scheme of the factorial columns. The name of a scheme is the same as its corresponding vector in |
num_cols |
a vector, in which are the names of the numerical columns. |
char_cols |
a vector, in which are the names of the string columns. |
ordered_cols |
a vector, in which are the names of the ordered factorial columns. |
other_cols |
a vector, in which are the names of the other columns. |
See Also
Examples
# building a data frame
A <- as.factor(c('x', 'y', 'z'))
B <- as.ordered(c('z', 'x', 'y'))
C <- as.factor(c('y', 'z', 'x'))
D <- as.factor(c('i', 'j', 'k'))
E <- 5:7
df <- data.frame(A, B, C, D, E)
# inspection
dmap <- inspect_map(df)
summary(dmap)
print(dmap)
Find Out Which Columns Have Most NAs
Description
Return the names and numbers of NAs of columns that have top # (refer to argument top
) most NAs.
Usage
inspect_na(x, top=ncol(x))
Arguments
x |
The data frame |
top |
The value of #. |
Value
A named vector.
Examples
# building a data frame
A <- as.factor(c('y', 'x', 'x', 'y', 'z'))
B <- c(6, 3:6)
C <- 1:5
df <- data.frame(A, B, C)
df[3, 1] <- NA; df[2, 2] <- NA; df[4, 2] <- NA; df [5, 3] <- NA
print(df)
# inspection
a <- inspect_na(df)
print(a)
Simply Classify The Columns of A Data Frame
Description
A simplified thus faster version of inspect_map
.
Usage
inspect_smap(x, message = TRUE)
Arguments
x |
The data frame |
message |
Whether print the process. |
Value
A list of factor_cols
(vector), num_cols
(vector), char_cols
(vector), ordered_cols
(vector), and other_cols
(vector).
factor_cols |
a vector, in which are the names of the factorial columns. |
num_cols |
a vector, in which are the names of the numerical columns. |
char_cols |
a vector, in which are the names of the string columns. |
ordered_cols |
a vector, in which are the names of the ordered factorial columns. |
other_cols |
a vector, in which are the names of the other columns. |
See Also
Examples
# building a data frame
A <- as.factor(c('x', 'y', 'z'))
B <- as.ordered(c('z', 'x', 'y'))
C <- as.factor(c('y', 'z', 'x'))
D <- as.factor(c('i', 'j', 'k'))
E <- 5:7
df <- data.frame(A, B, C, D, E)
# inspection
dmap <- inspect_smap(df)
summary(dmap)
print(dmap)
Partitioning A Dataset Randomly
Description
Designed to create a validation column. Optionally records the result into a log file.
Usage
partition_random(x, name = 'Partition', train,
val = 10^ceiling(log10(train))-train, test = TRUE,
seed = FALSE, log = eval.parent(in_log_default))
Arguments
x |
The data frame |
name |
The name of the validation column. |
train |
The proportion of the training set. |
val |
The proportion of the validation set. If not given, a default value is calculated by assuming the sum of |
test |
Whether to have test set. If |
seed |
Whether to set a random seed. If you want a reproducible result, pass a number to |
log |
Controls log files. To produce log files, assign it or the |
Value
A partitioned column.
Warning
x
can only be a data frame. Don't pass a vector to it.
See Also
Examples
# refer to vignettes if you want to use log files
message('refer to vignettes if you want to use log files')
# building a data frame
A <- 2:16
B <- letters[12:26]
df <- data.frame(A, B)
# partitioning
df0 <- partition_random(df, train = 7)
df0 <- cbind(df, df0)
print(df0)
df0 <- partition_random(df, train = 7, val = 2)
df0 <- cbind(df, df0)
print(df0)
Create Data Dictionary from Data Warehouse
Description
Stacks part of a data frame and repeat the other columns to fit the result of stacking. Optionally records the result into a log file.
Usage
wh_dict(x, attr, value)
Arguments
x |
The data frame |
attr |
The index of the column in |
value |
The index of the column in |
Value
A 2-column data frame, in which the Keys
column stores the explanation of the values in x[, attr]
.
Warning
x
can only be a data frame. Don't pass a vector to it.
See Also
Examples
# refer to vignettes if you want to use log files
message('refer to vignettes if you want to use log files')
# building a data frame
A <- c('i', 'j', 'i', 'k', 'j')
B <- as.factor(c('x', 'y', 'x', 'z', 'y'))
C <- 1:5
df <- data.frame(A, B, C)
print(df)
# encoding
dict <- wh_dict(df, attr = 'B', value = 'A')
print(dict)