The inverseRegex package allows users to reverse engineer regular
expression patterns for R objects. Individual characters that make up an
object are categorised into common groups and encoded into run-lengths.
For example, the phrase “Hello World!” can be translated to
"[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!".
This could be useful to summarise a dataset without viewing all individual entries or to aid in data cleaning. One could check that a column of dates all follow a “nnnn-nn-nn” format or that a column of strings consisted entirely of alphabetic characters (no zeros entered instead of the letter O for example).
The main function to use is inverseRegex(x) which will
identify the different characters that make up the input object
x. The different groups that can be identified are -
'[[:digit:]]' - '[[:lower:]]' -
'[[:upper:]]' - '[[:alpha:]]' -
'[[:alnum:]]' - '[[:space:]]' -
'[[:punct:]]'
See ?regex for an explanation of their meanings.
By default the only groups that will be identified are
[[:digit:]], [[:upper:]], and
[[:lower:]], with any other characters being left as is.
This can altered with the following arguments:
combineCases: Use '[[:alpha:]]' instead of
'[[:lower:]]' and '[[:upper:]]'.combineAlphanumeric: Use ‘[[:alnum:]]’
instead of ‘[[:digit:]]’, ‘[[:lower:]]’,
‘[[:upper:]]’, and ‘[[:alpha:]]’.combinePunctuation: Use ‘[[:punct:]]’
instead of leaving punctuation characters as is.combineSpace: Use ‘[[:space:]]’ instead of
leaving space characters as is.Some examples of these arguments are below:
inverseRegex('1aA')
#> [1] "[[:digit:]][[:lower:]][[:upper:]]"
inverseRegex('1aA', combineCases = TRUE)
#> [1] "[[:digit:]][[:alpha:]]{2}"
inverseRegex('1aA', combineAlphanumeric = TRUE)
#> [1] "[[:alnum:]]{3}"
inverseRegex('Hello World!')
#> [1] "[[:upper:]][[:lower:]]{4} [[:upper:]][[:lower:]]{4}!"
inverseRegex('Hello World!', combineSpace = TRUE, combinePunctuation = TRUE)
#> [1] "[[:upper:]][[:lower:]]{4}[[:space:]][[:upper:]][[:lower:]]{4}[[:punct:]]"Users can also specify the different run lengths that will be
identified. The inverseRegex function has an argument
called numbersToKeep which allows the user to specify what
lengths of repeated sequences should be identified explicitly. The
default value is c(2, 3, 4, 5, 10). Run lengths not
requested will be identified with a +.
The priority argument allows users to specify characters that will
take precedence over the regex grouping patterns, or to make them a
lower priority than the generic . placeholder. For example
below the a and 1 characters are treated
separately to the [[:lower:]] and [[:upper:]]
groups, whilst the ? is treated as a lower priority than
the ., unlike the !.
Many objects with a class other than character are
supported, including logical, integer,
numeric, Date, POSIXct,
factor, matrix, data.frame, and
tibble. They are all (except logical)
converted to characters first and then the collection of regex patterns
returned either as character vectors or as the same class as the input
object if it was a matrix, data frame, or tibble. See
?inverseRegex for a full description of how they are
treated. If users need a different character conversion method they can
do it themselves prior to calling inverseRegex.
Special mention of numerics and data frames will be given here:
numericAn attempt has been made to convert numeric values into characters as
directly as possible without losing or adding any information. When
passed a numeric vector inverseRegex will convert it to
character using:
vapply(x, format, character(1), nsmall = 1). This will
force at least one decimal place for all entries but will not add extra
decimal places beyond that unless they were present in the individual
input element; it will however remove trailing decimal zeros. For
example:
vapply(c(1, 1.0, 1.10, 1.12, 1.123), format, character(1), nsmall = 1)
#> [1] "1.0" "1.0" "1.1" "1.12" "1.123"
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123), numbersToKeep = 2:10)
#> [1] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]"
#> [3] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"
## Vectors of class integer are just converted using as.character.
inverseRegex(1L)
#> [1] "[[:digit:]]"Numerics are treated differently if they are present in a matrix,
data frame, or tibble. In the case of a matrix if it has a mode of
numeric then the entire object will be converted to character using
trimws(format(x)). For data frames and tibbles each column
of type numeric will be converted using trimws(format(x)).
This means that unlike for numeric vectors described above, all numeric
entries in matrices, data frames, and tibbles will have the same number
of decimal places.
inverseRegex(c(1, 1.0, 1.10, 1.12, 1.123))
#> [1] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]"
#> [3] "[[:digit:]].[[:digit:]]" "[[:digit:]].[[:digit:]]{2}"
#> [5] "[[:digit:]].[[:digit:]]{3}"
inverseRegex(data.frame(a = c(1, 1.0, 1.10, 1.12, 1.123)))
#> a
#> 1 [[:digit:]].[[:digit:]]{3}
#> 2 [[:digit:]].[[:digit:]]{3}
#> 3 [[:digit:]].[[:digit:]]{3}
#> 4 [[:digit:]].[[:digit:]]{3}
#> 5 [[:digit:]].[[:digit:]]{3}data.frameWhen giving a data frame inverseRegex will return a data
frame of similar dimensions with each column representing an individual
call to inverseRegex.
unique(inverseRegex(iris, numbersToKeep = 2:10))
#> Sepal.Length Sepal.Width Petal.Length
#> 1 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 51 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> 101 [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]] [[:digit:]].[[:digit:]]
#> Petal.Width Species
#> 1 [[:digit:]].[[:digit:]] [[:lower:]]{6}
#> 51 [[:digit:]].[[:digit:]] [[:lower:]]{10}
#> 101 [[:digit:]].[[:digit:]] [[:lower:]]{9}One of the main use cases of the package is to identify irregular
entries in a dataset. To this end there is a function
occurrencesLessThan which will call
inverseRegex and return logical values with
TRUE giving the location of any regex patterns that occur
less than a certain number of times.
occurrencesLessThan(c(LETTERS, 1))
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [25] FALSE FALSE TRUE
## When called on a data frame occurrencesLessThan will assess each column individually.
x <- iris
x$Species <- as.character(x$Species)
x[27, 'Species'] <- 'set0sa'
unique(occurrencesLessThan(x))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 FALSE FALSE FALSE FALSE FALSE
#> 27 FALSE FALSE FALSE FALSE TRUEWhat constitutes a “rare” pattern can be specified with the
fraction or n arguments. See
?occurrencesLessThan for a full description.