| Title: | Data for Wordpiece-Style Tokenization | 
| Version: | 2.0.0 | 
| Description: | Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from https://huggingface.co/bert-base-cased/resolve/main/vocab.txt and https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt and parsed into an R-friendly format. | 
| License: | Apache License (≥ 2) | 
| Encoding: | UTF-8 | 
| RoxygenNote: | 7.1.2 | 
| URL: | https://github.com/macmillancontentscience/wordpiece.data | 
| BugReports: | https://github.com/macmillancontentscience/wordpiece.data/issues | 
| Depends: | R (≥ 3.5.0) | 
| Suggests: | testthat (≥ 3.0.0) | 
| Config/testthat/edition: | 3 | 
| NeedsCompilation: | no | 
| Packaged: | 2022-03-03 15:50:03 UTC; jonth | 
| Author: | Jonathan Bratt | 
| Maintainer: | Jon Harmon <jonthegeek@gmail.com> | 
| Repository: | CRAN | 
| Date/Publication: | 2022-03-03 16:20:02 UTC | 
Generate the inst path
Description
Generate the inst path
Usage
.get_path(filetype, n_tokens)
Arguments
| filetype | Character scalar; the type of file, like "uncased". | 
| n_tokens | Integer scalar; The number of tokens used for that file. | 
Value
Character scalar; the path to the file.
Load an RDS from inst Dir
Description
Load an RDS from inst Dir
Usage
.load_inst_rds(filetype, n_tokens)
Arguments
| filetype | Character scalar; the type of file, like "uncased". | 
| n_tokens | Integer scalar; The number of tokens used for that file. | 
Value
The R object.
Load a wordpiece Vocabulary
Description
A wordpiece vocabulary is a named integer vector with class "wordpiece_vocabulary". The names of the vector are the tokens, and the values are the integer identifiers of those tokens. The vocabulary is 0-indexed for compatibility with Python implementations.
Usage
wordpiece_vocab(cased = FALSE)
Arguments
| cased | Logical; load the uncased vocabulary, or the cased vocabulary? | 
Value
A wordpiece_vocabulary.
Examples
head(wordpiece_vocab())
head(wordpiece_vocab(cased = TRUE))